# Class 8 - **Regular Expressions**
This notebook will cover basic principles and syntax for finding text patterns in string variables.\
We will use the `re` package.\
*RE = Regular Expressions.*

**Regular expressions** can be used in very complex combinations. For more details and resources, see:
 - https://docs.python.org/library/re.html
 - https://www.py4e.com/html3/11-regex
 - https://developers.google.com/edu/python/regular-expressions
 - https://regexr.com/

Before starting with REs, let's quickly remember built-in string methods.

In [None]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'

# Below, fill in the code:
## 1. Get a true/false variable indicating whether the string
## contains the phone number 0502056655.
my_boolean_var = s.find('0502056611')
my_boolean_var

## 2. Get the index of where the phone number starts
#idx = s.

## `re` package

In [None]:
import re

### `re.search(SEARCH TERM, STRING)`

In [None]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'
m = re.search('0502056655',s)
m

### Match object
A match object contains is an object (variable) with all the resulting information from the search, e.g.:
- Is the searched expression in the string? (Boolean)
- What was the found substring?

In [None]:
type(m)

In [None]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'
m = re.search('0502056655',s)
if m:
    print('Found!')

In [None]:
m.group()

#### Example

In [None]:
seq = 'GTGGTATTTCTTCGTTGTCTCTGGCGTGGTCACGTTGATTGGTCCGCTATCTGGACCGAAAAAAGTCGTA\
TTCCAAAAATAATACGTATACGTACGCGCGTGAATCACTCGCCACACGACCCAGCGCAGCGCGACAATCG\
AGTGTAGACACATAACATAACTGCCGTAGTCGTCGCCTCCGTGACATCCGCGCCAGCACCAACCCGAATC\
CGGCCGCGTCCCCCGTTTCCAAATCCAATTTCCCCTTTAATTTCGGTGGCTAATATTAGAGTCTTGCGAC\
ATGTTTAGCTTTCTGAAGCGCGAGAAGAACACCCAGGAGGTAGTGGAGAATGTGATCGGCGAGCTGAAGA\
AGATCTATCGGAGCAAGTTGCTGCCGCTGGAGGAGCACTACCAGTTCCACGACTTTCACTCGCCAAAGCT\
CGAGGATCCAGACTTCGATGCGAAGCCCATGATCCTGCTGGTGGGCCAGTACTCCACGGGCAAGACGACC\
TTCATCCGCTATCTGCTGGAACGCGACTTTCCGGGCATTAGAATTGGTCCGGAGCCAACGACGGACCGCT\
TCATCGCCGTGATGTACGACGACAAGGAGGGCGTGATACCGGGCAACGCCCTGGTTGTGGACCCCAAGAA\
GCAGTTCCGGCCGCTGTCCAAGTACGGCAACGCCTTCCTGAATCGCTTCCAATGCAGCAGTGTGGCCTCG\
CCGGTGCTGAACGCCATCTCCATCGTGGACACGCCCGGAATTCTCTCCGGCGAAAAGCAGCGCATCGACA\
GGGGCTACGACTTCACCGGCGTGCTGGAGTGGTTCGCGGAGCGCGTGGACCGCATCATCCTGCTCTTCGA'
re.findall('A.A',seq)

## Meta (/special) characters:
`. ^ $ * + ? { } [ ] \ | ( )`

### `[]`
a list/set of characters

In [None]:
# Find the age
re.search(':[0-9][0-9] -',s)

In [None]:
# Alternately:
re.search(':\d\d',s)

In [None]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'
re.search('\d\d\d',s)

#### Exercise
Write a program that checks whether each of the restriction enzymes appears in the sequence. If yes, print the enzyme name, if not print that the enzyme does no appear.

In [None]:
sequence1 = 'atatatccgggatatatcccggatatat'

restrictionEnzymes = {}
restrictionEnzymes['bamH1'] = ['ggatcc',0]
restrictionEnzymes['sma1'] = ['cccggg',2]
restrictionEnzymes['nci1'] = ['cc[cg]gg',2]
restrictionEnzymes['scrF1'] = ['cc[atcg]gg',2]


In [None]:
for key in restrictionEnzymes:
    print(re.search(restrictionEnzymes[key][0],sequence1))

### `.`
*any* character

In [None]:
re.search('05........',s)

In [None]:
# Alternately
re.search('05.{7}',s)

#### EXERCISE
Write a `re.search()` command to extract the T.Z.

In [None]:
# Write a re.search() command to extract the T.Z.
XXX

#### "Escaping"
What if we are actually looking for the "." character? (or any other meta-character)

In [None]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912 - davidbrent@gmail.com'
re.search('\.',s)

In [None]:
re.search('\.[a-zA-Z]+',s)

### `^`
negative of a character

In [None]:
re.search('[^a-zA-Z]',s)

In [None]:
seq = 'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
re.search('[^CGT]', seq)

### `\s`
Whitespace

In [None]:
re.search('\s',s)
s[4:7]

#### Upper and lower case meta-characters (`\s` vs. `\S`)
Upper (capitalized) meta-characters are the negation - so think of `\S` as `not('\s')`

In [None]:
l = ['This starts starts with a word - "This"', '\tThis string started with a tab']
print(l[0])
print(l[1])

In [None]:
print(re.search('\s',l[0]))
print(re.search('\s',l[1]))

In [None]:
print(re.search('\S',l[0]))
print(re.search('\S',l[1]))

In [None]:
s = '05020555555-is my phone number'
re.search('\d',s)

In [None]:
re.search('\D',s)

## Quantifiers
How many times the portion must appear:
- `*` zero or more occurences
- `+` one or more occurences
- `?` one or zero (none) occurences

In [None]:
seq = 'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
re.search('CA+[ACGT]',seq)

#### Exercise
Find a sequence that starts with G, ends with A, and in-between has at least one occurence of C.

In [None]:
# Find a sequence that starts with G, ends with A,
# and in-between has at least one occurence of C.
XXX

### `.*`
Use to find any 'fillers'.

In [None]:
seq = 'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
re.search('AA.*AA',seq)

`.*` is **greedy**. What if we want the *shortest* sequence?

In [None]:
re.search('AA.*?AA',seq)

### `{number of repetitions}` or `{minimum,maximum}`
Set length of spaces

In [None]:
sequence1 = 'atatatccgggatatatcccggatatat'
re.findall('cc.{3}', sequence1)

In [None]:
seq = 'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
re.search('AA.{10,20}AA', seq)

#### Exercise
Find *all* sequences that are flanked (begin with and end with) 'AA' and are between 7 and 14 characters (neucleotides) long

In [None]:
seq = 'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
XXX

## Grouping `(regexp)`

In [None]:
seq = 'TGCCAAGCAGGAACTCTCTACAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
re.search('(CT)+',seq)

Grouping works well with the OR condition - `|`

In [None]:
re.search('(AAA|CCC)',seq)

## `re.findall()`
`findall()` is similar to `search()`, but it doesn't stop at the first match. Instead, it returns all the matches in the string.

In [None]:
print(seq)
re.findall('AA.*?AA',seq)

Note that the above returned list *does not* contain overlapping sequences.\
If you want overlaps, you will need to (first time [install](https://youtu.be/PPKj9ic5MmI)) and import `regex` package. `regex` is a third-party python package.

In [None]:
import regex

In [None]:
m = regex.findall('AA.*?AA', seq, overlapped=True)
m

Great, we have also overlapping sequences.\
But, what we didn't get with either `re` or `regex` is the **indices of where in the sequence** the substring is...

## `re.finditer()`

In [None]:
print('Sequence:', seq)
m_iter = re.finditer('AA.*?AA',seq)
(m_iter)

In [None]:
next(m_iter)

In [None]:
print('Sequence:', seq)
m_iter = re.finditer('AA.*?AA',seq)
for each_match in m_iter:
    print(each_match.group(), each_match.start(), each_match.end())

## Exercises

Write an 'email verifying program'. The program that asks a user to input an email address, and the program checks if it:
1. Contains a @ character
2. Has a string before and after the @
4. The string after @ is followed by either .com, .net, or .co.il

In [None]:
email = input()
m = re.search(XXX, email)
