# Class 9 - **Regular Expressions**
This notebook will cover basic principles and syntax for finding text patterns in string variables.\
We will use the `re` package.\
*RE = Regular Expressions.*

**Regular expressions** can be used in very complex combinations. For more details and resources, see:
 - https://docs.python.org/library/re.html
 - https://www.py4e.com/html3/11-regex
 - https://developers.google.com/edu/python/regular-expressions
 - https://regexr.com/

Before starting with REs, let's quickly remember built-in string methods.

In [1]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'

# Below, fill in the code:
## 1. Get a true/false variable indicating whether the string
## contains the phone number 0502056655.
my_boolean_var = '0502056655' in s
print(my_boolean_var)

## 2. Get the index of where the phone number starts
idx = s.find('0502056655')
idx

False


23

In [4]:
s[23:]

'0502056655 - TZ:045111912'

## `re` package

In [5]:
import re

### `re.search(SEARCH TERM, STRING)`

In [6]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'
m = re.search('0502056655',s)
m

<re.Match object; span=(23, 33), match='0502056655'>

In [7]:
s[23:33]

'0502056655'

### Match object
A match object contains an object (variable) with all the resulting information from the search, e.g.:
- Is the searched expression in the string? (Boolean)
- What was the found substring?

In [9]:
type(m)

re.Match

In [8]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'
m = re.search('0502056655',s)
if m:
    print('Found!')
else:
    print('Not found')

Found!


In [10]:
m.group()

'0502056655'

#### Example

In [14]:
seq = 'GTGGTATTZCTTCGTTGTCTCTGGCGTGGTCACGTTGATTGGTCCGCTATCTGGACCGAAAAAAGTCGTA\
TTCCAAAAATAATACGTATACGTACGCGCGTGAATCACTCGCCACACGACCCAGCGCAGCGCGACAATCG\
AGTGTAGACACATAACATAACTGCCGTAGTCGTCGCCTCCGTGACATCCGCGCCAGCACCAACCCGAATC\
CGGCCGCGTCCCCCGTTTCCAAATCCAATTTCCCCTTTAATTTCGGTGGCTAATATTAGAGTCTTGCGAC\
ATGTTTAGCTTTCTGAAGCGCGAGAAGAACACCCAGGAGGTAGTGGAGAATGTGATCGGCGAGCTGAAGA\
AGATCTATCGGAGCAAGTTGCTGCCGCTGGAGGAGCACTACCAGTTCCACGACTTTCACTCGCCAAAGCT\
CGAGGATCCAGACTTCGATGCGAAGCCCATGATCCTGCTGGTGGGCCAGTACTCCACGGGCAAGACGACC\
TTCATCCGCTATCTGCTGGAACGCGACTTTCCGGGCATTAGAATTGGTCCGGAGCCAACGACGGACCGCT\
TCATCGCCGTGATGTACGACGACAAGGAGGGCGTGATACCGGGCAACGCCCTGGTTGTGGACCCCAAGAA\
GCAGTTCCGGCCGCTGTCCAAGTACGGCAACGCCTTCCTGAATCGCTTCCAATGCAGCAGTGTGGCCTCG\
CCGGTGCTGAACGCCATCTCCATCGTGGACACGCCCGGAATTCTCTCCGGCGAAAAGCAGCGCATCGACA\
GGGGCTACGACTTCACCGGCGTGCTGGAGTGGTTCGCGGAGCGCGTGGACCGCATCATCCTGCTCTTCGA'
print(re.search('[A-Z]C',seq))

<re.Match object; span=(31, 33), match='AC'>


## Meta (/special) characters:
`. ^ $ * + ? { } [ ] \ | ( )`

### `[]`
a list/set of characters

In [15]:
s

'David Brent - Age:45 - 0502056655 - TZ:045111912'

In [17]:
# Find the age
re.search('Age:[0-9][0-9]',s)

<re.Match object; span=(17, 20), match=':45'>

In [18]:
# Alternately:
re.search(':\d\d',s)

<re.Match object; span=(17, 20), match=':45'>

In [20]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'
re.search('\d\d\d',s)

<re.Match object; span=(38, 42), match=':045'>

In [None]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'
pattern = '\d\d\d'
re.search(pattern,s)

#### Exercise
Write a program that checks whether each of the restriction enzymes appears in the sequence. If yes, print the enzyme name, if not print that the enzyme does not appear.

In [None]:
sequence1 = 'atatatccgggatatatcccggatatat'

restrictionEnzymes = {}
restrictionEnzymes['bamH1'] = ['ggatcc',0]
restrictionEnzymes['sma1'] = ['cccggg',2]
restrictionEnzymes['nci1'] = ['cc[cg]gg',2]
restrictionEnzymes['scrF1'] = ['cc[atcg]gg',2]


### `.`
*any* character

In [21]:
s

'David Brent - Age:45 - 0502056655 - TZ:045111912'

In [22]:
re.search('05........',s)

<re.Match object; span=(23, 33), match='0502056655'>

In [23]:
# Alternately
re.search('05.{8}',s)

<re.Match object; span=(23, 33), match='0502056655'>

#### EXERCISE
Write a `re.search()` command to extract the T.Z.

In [24]:
# Write a re.search() command to extract the T.Z. from s variable
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912'
re.search('TZ:.{9}', s)

<re.Match object; span=(36, 48), match='TZ:045111912'>

#### "Escaping"
What if we are actually looking for the "." character? (or any other meta-character)

In [25]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912 - davidbrent@gmail.net'
re.search('\.',s)

<re.Match object; span=(67, 68), match='.'>

In [26]:
re.search('\.[a-zA-Z]+',s)

<re.Match object; span=(67, 71), match='.net'>

### `^`
negative of a character

In [30]:
s = 'David Brent - Age:45 - 0502056655 - TZ:045111912 - davidbrent@gmail.net'

In [31]:
re.search('[^a-zA-Z ]',s)

<re.Match object; span=(13, 14), match='-'>

#### EXERCISE
In the sequence below, there is one character that is not a valid nucleotide.\
Write a `re.search()` command to find its location.

In [32]:
seq = 'GTGGTATTTCTTCGTTGTCTCTGGCGTGGTCACGTTGATTGGTCCGCTATCTGGACCGAAAAAAGTCGTA\
TTCCAAAAATAATACGTATACGTACGCGCGTGAATCACTCGCCACACGACCCAGCGCAGCGCGACAATCG\
AGTGTAGACACATAACATAACTGCCGTAGTCGTCGCCTCCGTGACATCCGCGCCAGCACCAACCCGAAIC\
CGGCCGCGTCCCCCGTTTCCAAATCCAATTTCCCCTTTAATTTCGGTGGCTAATATTAGAGTCTTGCGAC\
ATGTTTAGCTTTCTGAAGCGCGAGAAGAACACCCAGGAGGTAGTGGAGAATGTGATCGGCGAGCTGAAGA\
AGATCTATCGGAGCAAGTTGCTGCCGCTGGAGGAGCACTACCAGTTCCACGACTTTCACTCGCCAAAGCT\
CGAGGATCCAGACTTCGATGCGAAGCCCATGATCCTGCTGGTGGGCCAGTACTCCACGGGCAAGACGACC\
TTCATCCGCTATCTGCTGGAACGCGACTTTCCGGGCATTAGAATTGGTCCGGAGCCAACGACGGACCGCT\
TCATCGCCGTGATGTACGACGACAAGGAGGGCGTGATACCGGGCAACGCCCTGGTTGTGGACCCCAAGAA\
GCAGTTCCGGCCGCTGTCCAAGTACGGCAACGCCTTCCTGAATCGCTTCCAATGCAGCAGTGTGGCCTCG\
CCGGTGCTGAACGCCATCTCCATCGTGGACACGCCCGGAATTCTCTCCGGCGAAAAGCAGCGCATCGACA\
GGGGCTACGACTTCACCGGCGTGCTGGAGTGGTTCGCGGAGCGCGTGGACCGCATCATCCTGCTCTTCGA'
re.search('[^ACGT]', seq)

<re.Match object; span=(208, 209), match='I'>

### `\s`
Whitespace

In [34]:
s

'David  Brent - Age:45 - 0502056655 - TZ:045111912 - davidbrent@gmail.net'

In [33]:
re.search('\s',s)


<re.Match object; span=(5, 6), match=' '>

#### Upper and lower case meta-characters (`\s` vs. `\S`)
Upper (capitalized) meta-characters are the negation - so think of `\S` as `not('\s')`

In [35]:
l = ['This string starts with a word - "This"', '\tThis string started with a tab']
print(l[0])
print(l[1])

This string starts with a word - "This"
	This string started with a tab


In [36]:
print(re.search('\s',l[0]))
print(re.search('\s',l[1]))

<re.Match object; span=(4, 5), match=' '>
<re.Match object; span=(0, 1), match='\t'>


In [None]:
print(re.search('\S',l[0]))
print(re.search('\S',l[1]))

In [37]:
s = '05020555555-is my phone number'
re.search('\d',s)

<re.Match object; span=(0, 1), match='0'>

In [38]:
re.search('\D',s)

<re.Match object; span=(11, 12), match='-'>

## Quantifiers
How many times the portion must appear:
- `*` zero or more occurences
- `+` one or more occurences
- `?` one or zero (none) occurences

In [39]:
seq = 'TGCACAAGGGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
re.search('CA+[GT]',seq)

<re.Match object; span=(10, 13), match='CAG'>

In [41]:
s = '9990'
m = re.search('9+',s)
m

<re.Match object; span=(0, 3), match='999'>

In [None]:
m.end() - m.start()

#### Exercise
Find a sequence that starts with G, ends with A, and in-between has at least one occurence of C.

In [None]:
# Find a sequence that starts with G, ends with A,
# and in-between has at least one occurence of C.
seq = 'TGTTTTTTAAGGGCCCCCCCCCCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
...
m

### `.*`
Use to find any 'fillers'.

In [42]:
seq = 'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
m = re.search('AA.*AA',seq)
print(m)
m.group()

<re.Match object; span=(4, 61), match='AAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGA>


'AAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAA'

In [43]:
len(m.group())

57

`.*` is **greedy**. What if we want the *shortest* sequence?

In [44]:
re.search('AA.*?AA',seq)

<re.Match object; span=(4, 13), match='AAGCAGGAA'>

### `{number of repetitions}` or `{minimum,maximum}`
Set length of spaces

In [None]:
sequence1 = 'atatatccgggatatatcccggatatatccgga'
re.findall('cc.{3}', sequence1)

In [45]:
seq = 'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
m = re.search('AA.{10,20}AA', seq)
m

<re.Match object; span=(4, 28), match='AAGCAGGAACTCATCAAATCGAAA'>

In [None]:
len(m.group())

#### Exercise
Find *all* sequences that are flanked (begin with and end with) 'AA' and are between 7 and 14 characters (neucleotides) long

In [None]:
seq = 'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'


## Grouping `(regexp)`

In [None]:
seq = 'TGCCAAGCAGGAACTCTCTACAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'
re.search('(CT.*)|()',seq)

Grouping works well with the OR condition - `|`

In [None]:
m = re.findall('A.{2,10}A',seq)
for each_match in m:
    idx = seq.find(each_match)
    print(each_match, idx)

## `re.findall()`
`findall()` is similar to `search()`, but it doesn't stop at the first match. Instead, it returns all the matches in the string.

In [46]:
print(seq)
re.findall('AA.*?AA',seq)

TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC


['AAGCAGGAA', 'AAATCGAA', 'AACTCGGTGCTCAGCAA']

Note that the above returned list *does not* contain overlapping sequences.\
If you want overlaps, you will need to (first time [install](https://youtu.be/PPKj9ic5MmI)) and import `regex` package. `regex` is a third-party python package.

In [47]:
import regex

In [48]:
seq

'TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC'

In [49]:
m = regex.findall('AA.*?AA', seq, overlapped=True)
m

['AAGCAGGAA',
 'AACTCATCAA',
 'AAATCGAA',
 'AATCGAA',
 'AAACTGCCCAA',
 'AACTGCCCAA',
 'AACTCGGTGCTCAGCAA',
 'AAGATCTGGAA']

Great, we have also overlapping sequences.\
But, what we didn't get with either `re` or `regex` is the **indices of where in the sequence** the substring is...

## `re.finditer()`

In [50]:
print('Sequence:', seq)
m_iter = re.finditer('AA.*?AA',seq)
m_iter

Sequence: TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC


<callable_iterator at 0x270b947fd90>

In [None]:
next(m_iter)

In [51]:
print('Sequence:', seq)
m_iter = re.finditer('AA.*?AA',seq)
for each_match in m_iter:
    print(each_match.group(), each_match.start(), each_match.end())

Sequence: TGCCAAGCAGGAACTCATCAAATCGAAACTGCCCAACTCGGTGCTCAGCAAGATCTGGAAACTGTCGGAC
AAGCAGGAA 4 13
AAATCGAA 19 27
AACTCGGTGCTCAGCAA 34 51


## Exercises

Write an 'email verifying program'. The program that asks a user to input an email address, and the program checks if it:
1. Contains a @ character
2. Has a string before and after the @
4. The string after @ is followed by either .com, .net, or .co.il