# Practice:  Regular Expressions and Classes

**12/5/2019<br>
BIOS 274: Introductory Python Programming for Genomics**<br>

## Table of Contents
1. [Regular expressions](#regular)<br>
2. [Classes](#classes)
3. [Let's practice!](#practice)

<a id="regular"></a>
## Regular expressions
https://docs.python.org/3.7/library/re.html

<code>re.match(PATTERN, STRING)</code>: matches at the beginning STRING<br>
<code>re.search(PATTERN, STRING)</code>: matches the first instance of PATTERN anywhere in STRING<br>
<code>re.findall(PATTERN, STRING)</code>: finds all non-overlapping intances of a PATTERN and returns a list of the patterns found<br>
<code>re.sub(PATTERN1, PATTERN2, STRING)</code>: replace all instances of PATTERN1 in STRING with PATTERN2<br>

<code>re.compile(PATTERN)</code>: compile PATTERN into a pattern object<br>
<code>re.finditer(PATTERN, STRING)</code>: return an iterator over all non-overlapping PATTERNS in the
STRING

In [None]:
# Find the FIRST instance of a pattern using re.search()

import re

seq = 'TTACTGCTCACTACTA'

pattern = re.compile('(ACT)|(GC)')
#pattern = re.compile('(ACT)|(CT)')

matches = re.search(pattern, seq)
#print(matches)

subseq = matches.group(0)
start = matches.start(0)
end = matches.end(0)

print(subseq, start, end)

In [None]:
# Find ALL instances of a pattern using re.finditer()

import re

seq = 'TTACTGCTCACTACTA'

pattern = re.compile('(ACT)|(GC)')
#pattern = re.compile('(ACT)|(CT)')

iterrator = re.finditer(pattern, seq)
#print(iterrator)

for matches in iterrator:
    #print(matches)
    subseq = matches.group(0)
    start = matches.start(0)
    end = matches.end(0)
    
    print(subseq, start, end)

**<code>regex</code>: A 3rd party regular expressions library that supports overlapping patterns**

In [None]:
# Find ALL instances of a pattern (including overlapping patterns!) 
#   using regex (use the flag overlapped=True)

import regex

seq = 'TTACTGCTCACTACTA'

pattern = regex.compile('(ACT)|(GC)')
#pattern = regex.compile('(ACT)|(CT)')

iterrator = regex.finditer(pattern, seq, overlapped=True)
#print(iterrator)

for matches in iterrator:
    subseq = matches.group(0)
    start = matches.start(0)
    end = matches.end(0)
    
    print(subseq, start, end)

<a id="classes"></a>
## Classes

In [None]:
# Thanks to Olivia deGoede for this example!

class Dog:
    def __init__(self, name, size, mood):
        self.name = name
        self.size = size
        self.mood = mood
        self.is_good_dog = True

    def action(self):
        if self.mood == 'hungry':
            return 'whines'
        elif self.mood == 'sleepy':
            return 'naps'
        else:
            return 'wags tail'
    
    def feed(self):
        if self.mood == 'hungry':
            print(self.name, 'is hungry.', self.name, self.action()+'.')
            print('Feeding', self.name, '...')
            self.mood = 'happy'
            print(self.name, 'is happy and full!')
        else:
            print('Feeding', self.name, '...')
            print(self.name, 'is not hungry.')
        print(self.name, self.action()+'.\n')

    def throw(self, toy):
        print('Throwing the', toy, 'for', self.name + '!')
        if self.mood == 'sleepy':
            print(self.name, self.action()+'.\n')
        else:
            print(self.name, 'runs and gets the', toy, 'back!')
            print(self.name, self.action()+'.\n')
        

puppy1 = Dog('Champion', 'medium', 'hungry')
puppy1.feed()

puppy2 = Dog('Rupert', 'big', 'sleepy')
puppy2.throw('ball')

puppy3 = Dog('Daisy', 'small', 'happy')
puppy3.feed()
puppy3.throw('stick')

<a id="practice"></a>
## Let's practice!

### Exercise 1

Do Part 1 of PSET 1 using <code>re</code> or <code>regex</code> to find all the cutsites!

In [None]:
def read_fasta(fasta_filename):
    '''
    Go through file, reading one line at a time, using a
    dictionary to store the DNA sequence for each of the FASTA
    entries (Gavin Sherlock, November 28, 2019)
    '''    
    with open(fasta_filename, mode='r') as fasta_file:

        sequences = {}
        
        for line in fasta_file:
            line = line.rstrip()
            if line.startswith('>'): # it's a new fasta record
                line = line.lstrip('>')
                sequences[line] = '' # intialize dictionary for this entry
                currSeqName = line
            else:
                sequences[currSeqName] += line

    return(sequences)

In [None]:
enzyme_sites = {'EcoRI': 'GAATTC', 'HindIII': 'AAGCTT',
                'BamHI': 'GGATCC', 'HpaI': 'GTTAAC',
                'HaeIII': 'GGCC'}

cutsite_offset = {'EcoRI': 1, 'HindIII': 1, 'BamHI': 1,
                  'HpaI': 3, 'HaeIII': 2}

In [None]:
fasta_filename = 'rosalind_dna.fsa'
read_seqs = read_fasta(fasta_filename)
#read_seqs   # comment out this line once you understand the format of read_seqs!

In [None]:
import regex

for seqName, fastaSeq in read_seqs.items(): # go through each DNA sequence

    print('Sequence: ' + seqName + ' (cut sites)') # print out the DNA sequence name

    for enzName, enzSeq in enzyme_sites.items(): # go through each enzyme
        
        cutsiteList = [] # initialize a list to store all the cutsites for this particular DNA sequence and enzyme
        enzOffset = cutsite_offset[enzName] # look up the offset for this particular enzyme
        
        ### YOUR SOLUTION HERE
        ### USE re OR regex TO FIND ALL THE CUTSITES FOR EACH ENZYME AND APPEND THEM TO cutsiteList ###

        if cutsiteList:
            print(enzName + '\t' + ', '.join(cutsiteList))

    print()

### Exercise 2

**2.a.** How many JUND binding motif sites are there are in regions of open chromatin from human fetal brain tissue?<br><br> 
Use <code>DNase_brain.tsv</code>.<br>
JUND binds <code>TGACTCA</code> and <code>TGAGTCA</code><br>

Run your script on <code>DNase_placenta.tsv</code> and <code>DNase_spinalCord.tsv</code>, too!

In [None]:
# YOUR SOLUTION HERE



**2.b.** Can you figure out how to use <code>grep</code> from the command line to accomplish this same task?

**2.c.** Make it so you can run your script from the command line like this:<br><br>
<code>python  findMotifs.py  INPUT_FILENAME  REGEX_FOR_TF_MOTIF</code>

i.e.
<code>python  findMotifs.py  DNase_brain.tsv  TACGGGCAT</code>