# Practice:  <br>PSET 1 Comments, <br>More Sophiticated Reading/Writing Files, &<br> Importing Libraries

**12/2/2019<br>
BIOS 274: Introductory Python Programming for Genomics**<br>

## Table of Contents
1. [PSET 1: Some comments](#PSET1)<br>
2. [Let's practice!](#practice)

<a id="PSET1"></a>
## PSET 1: Some comments

In [None]:
def read_fasta(fasta_filename):
    '''
    Go through file, reading one line at a time, using a
    dictionary to store the DNA sequence for each of the FASTA
    entries (Gavin Sherlock, November 28, 2019)
    '''    
    with open(fasta_filename, mode='r') as fasta_file:

        sequences = {}
        
        for line in fasta_file:
            line = line.rstrip()
            if line.startswith('>'): # it's a new fasta record
                line = line.lstrip('>')
                sequences[line] = '' # intialize dictionary for this entry
                currSeqName = line
            else:
                sequences[currSeqName] += line

    return(sequences)

In [None]:
enzyme_sites = {'EcoRI': 'GAATTC', 'HindIII': 'AAGCTT',
                'BamHI': 'GGATCC', 'HpaI': 'GTTAAC',
                'HaeIII': 'GGCC'}

cutsite_offset = {'EcoRI': 1, 'HindIII': 1, 'BamHI': 1,
                  'HpaI': 3, 'HaeIII': 2}

In [None]:
fasta_filename = 'rosalind_dna.fsa'
read_seqs = read_fasta(fasta_filename)

### Part 1

In [None]:
outFileName = 'PS1_Part1.txt'

with open(outFileName, 'w') as outFile: # open the output file
    for seqName, fastaSeq in read_seqs.items(): # go through each DNA sequence

        numCut = 0 # initialize a varible to count how many enzymes cut the sequence
        
        outFile.write('Sequence: ' + seqName + ' (cut sites)\n') # write the sequene name to the outFile
        print('Sequence: ' + seqName + ' (cut sites)')

        for enzName, enzSeq in enzyme_sites.items(): # go throught each enzyme
            cutsiteList = [] # initialize a list to store all the cutsites for this particular DNA sequence and enzyme
            enzOffset = cutsite_offset[enzName] # look up the offset for this particular enzyme

            # find all the cutsites for for this particulare DNA sequence and enzyme and store in list
            for i in range(len(fastaSeq)):
                if fastaSeq[i:].startswith(enzSeq):
                    cutsiteList.append(str(i + enzOffset))
            
            # only if there's a cutsite in the list (cutsiteList isn't empty), print out the enzyme name and the cutsites
            if cutsiteList:
                outFile.write(enzName + '\t' + ', '.join(cutsiteList) + '\n')
                print(enzName + '\t' + ', '.join(cutsiteList))
                numCut += 1
        
        # only if no enzymes at all cut this DNA sequence (numCut == 0), print out that there are no cutsites found
        if not numCut:
            outFile.write('no cutsites found\n')
            print('no cutsites found')
            
        outFile.write('\n')
        print()

### Part 2

It's best **not to repeat code**!

Other options:<br>
>**1. Store the cutsite info from Part 1 in a variable (maybe a dictionaries within dictionaries) and use that variable in Part 2.**<br><br>
<code>{'Rosalind_6820': {'HpaI': ['118'], 'HaeIII': ['596']}, 
 'Rosalind_3684': {'HaeIII': ['106', '121', '263', '408', '800', '916']}, 
 'Rosalind_6908': {}, 
 'Rosalind_2711': {'HindIII': ['133'], 'HaeIII': ['70', '225', '614', '628']}, 
 'Rosalind_1559': {'EcoRI': ['367'], 'HaeIII': ['91', '429', '458', '614', '622']}, 
 'Rosalind_9546': {}, 
 'Rosalind_5746': {'EcoRI': ['143'], 'HindIII': ['37'], 'BamHI': ['552'], 'HpaI': ['826'], 'HaeIII': ['177', '250', '272', '329', '409', '708']}}</code><br><br>
 
>**2. Open two output files (instead of just one) at the beginning of your loop.
<br>Write some information to one of the files and different information to the other file.**<br><br>
<code>outFileName1 = 'PS1_part1.txt'
outFileName2 = 'PS1_part2.txt'</code><br><br>
<code>with open(outFileName1, 'w') as outFile1, open(outFileName2, 'w') as outFile2:</code><br>
<code>    # FIND CUTSITES</code><br>
<code>    # FIND FRAGMENT LENGTHS</code><br>
<code>    outFile1.write('CUTSITE INFO')</code><br>
<code>    outFile2.write('FRAGMENT LENGTH INFO')</code><br>

### Part 1 - Improved

In [None]:
outFileName = 'PS1_Part1.txt'

##### ADDED THIS
cutsiteDict = {}

with open(outFileName, 'w') as outFile:
    for seqName, fastaSeq in read_seqs.items():

        ##### ADDED THIS
        cutsiteDict[seqName] = {}
        
        outFile.write('Sequence: ' + seqName + ' (cut sites)\n')
        print('Sequence: ' + seqName + ' (cut sites)')

        for enzName, enzSeq in enzyme_sites.items():
            cutsiteList = []
            enzOffset = cutsite_offset[enzName]

            for i in range(len(fastaSeq)):
                if fastaSeq[i:].startswith(enzSeq):
                    cutsiteList.append(str(i + enzOffset))

            if cutsiteList:
                ##### ADDED THIS
                cutsiteDict[seqName][enzName] = cutsiteList
                outFile.write(enzName + '\t' + ', '.join(cutsiteList) + '\n')
                print(enzName + '\t' + ', '.join(cutsiteList))
        
        ##### MADE THIS CHECK BETTER (no need for counting for how many enzymes cut andymore)
        if not cutsiteDict[seqName]:
            outFile.write('no cutsites found\n')
            print('no cutsites found')
            
        outFile.write('\n')
        print()

In [None]:
cutsiteDict

<a id="practice"></a>
## Let's Practice!

In the Day 5 folder on Canvas, there are  BED files that contain all the peaks found from ChIP-seq experiments done for various transcription factors in various cell types.

>**1.** Without hardcoding the file names, find (and store in a variable) the names of all the BED files in the directory you downloaded from Canvas (Day 5).<br>
>**2.** For each transcription factor tested in each cell type, generate a tab-delimited file (with an informative header) called <code>numPeaks.tsv</code> with:<br>
>>**a.** the cell type used<br>
>>**b.** the transcription factor tested<br>
>>**c.** the number of ChIP-seq peaks within a genomic region of interest (in this case, chr2:47000000-48000000).

i.e.<br> 
<code>TF_NAME</code>&nbsp;&nbsp;&nbsp;<code>CELL_TYPE</code>&nbsp;&nbsp;&nbsp;&nbsp;<code>COUNT_IN_ROI</code><br>
<code>FOXA1</code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<code>K562</code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<code>2</code><br>
<code>JUND</code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<code>GM12878</code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<code>7</code><br>

In [None]:
# YOUR SOLUTION HERE

chromOfInterest = 'chr2'
startOfInterest = 47000000
endOfInterest = 48000000