# Regular Expressions

Regular expressions are a powerful tool for searching for patterns in text files. Here we're going to learn how to use them by searching for patterns in the unicorn genome.

Outline
- Using regular expressions at the command line with `grep`
- Using regular expressions in Python






## Command line searching using grep
Perhaps we're getting ready to perform a RAD analysis with unicorns. We have two restriction enzymes in the lab (SbfI [CCTGCAGG] and TseI [GCWGC]), but we want to use the restriction enzyme that produces the most loci. We can easily use regular expression to search for reach restriction site.

Usage:
>grep pattern file

1. Search for SbfI

In [32]:
!grep CCTGCAGG unicorn_genome.fa

CCCCTGCAGGTCAGAGCCTGCAGGCCAACCCCTCGAAGTCAATTCCTTCCAAGCGACTGAGCTAAGTATTTGCCGAGTAGCTAGTAGTGCCAGTTGTCCCATTCAAATCCATTCGGTGGTAACTACGCAGGCTTTGCGAGCGCTCGTATTGTCCGACTTGATCTACGTAAGATGGACGTTGCTATATTCATCATCTTATTCGCCCATCAATTGGGCGACTATGACTATAAGACACCGTCTACTCCAGACCCTTAATGGGGCCCATGCGATTTCACATCACATTTTTTATCTGCCATTTACTTACTACCGCTGACTTTGATAGCGCTCGAACTGAGTCGTAGCCCCGTTTCAACATAGTATTCCCAATCCCTTTATTGTGTGGCCAGTAGGAATTCCTGCAGGCCTTTGCCGAAAAATGCCTGCGCCCATCCTAAGATGCTACTATAATACCCACTGCCCCCGGATTAGGGGATGGGTGCTCGTTAATCCACTGATTGTCCTTACGTTGTAGGCATGGACTTAGACGTGGGATTTCGAATTATATTCGATTAGGCTGAGCTGTGGAACTAAATTTGCGATACTGGGGTAGAAAAGCAACAGATGAGAGGATCGAGGGTGGTTTAAGAGAGCAGGTATTCCGGATATAGCCGTGGAGTGTTACCATAACAGGACAGCCGCCTCCAAGACGTAGAACCTTTCAGTTGTAGGCATCGTCAATTAATCTATCAGGACGCCTGCCCATCTATCTAGTATTAATATATGACAAACGACCCCCCGGGAGTGGCCGCTGATTTGGAGAATGGGTACACGGTAAATATCTGCTCTCCTTCTGCCCCATCCGTACGTCCATAGCGCAGCCCTTTCCATAGTGCGCTAGATTCGCCCGCCTACTGGACTACTCGGCCTCTCCGCGGTGAATACATCTTCATTCGTGATGATAGAGGGCTGCTAGCTTTAGCGCCAAGATATGCTTTAAGACCTTTGTGTCTCGGTTTGGAGC

This returns every row where the pattern CCTGCAGG was found. To count how many rows, we would simply need to pipe the results to the wc command to count how many lines.

In [33]:
!grep "CCTGCAGG" unicorn_genome.fa | wc -l

45


But what if a line had the restriction site more than once? We can use the -o option to print out only the matches and not the whole lines. This will print one match per line.

In [35]:
!grep -o "CCTGCAGG" unicorn_genome.fa | wc -l

50


The pattern for TseI has a 'W' in it, which means A or T. In regular expressions, you can use brackets to designate patterns that have multiple options.

In [30]:
!grep -o "GC[AT]GC" unicorn_genome.fa

GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC

GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC

GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCAGC
GCAGC
GCTGC
GCAGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCTGC
GCAGC
GCTGC
GCAGC
GCAGC
GCTGC
GCAGC
GCAGC
GCAGC
GCTGC
GCTGC
GCAGC
GCTGC
GCTGC
GCTGC

In [31]:
!grep -o "GC[AT]GC" unicorn_genome.fa | wc -l

7542


Clearly we want to use the TseI restriction site.

### Other useful regular expression patterns

`^` - pattern is at the begining of the line

`$` - pattern is at the end of the line

`.` - can be used in place of any character
> "A.T" would match AAT, ACT, AGT, and ATT

`*` - can include 0 - Infinity number of instances of the previous character
> "A*T" would match T, AT, AAAAAAAAAAAAAAT (or any number of A's before the T)


### Other useful grep flags

`-v` - print lines that DO NOT contain the pattern

`-c` - print a count of the number of lines that contain the pattern

`-x` - exactly match the entire line

`-m` - print out `m` number of matches and then stop searching

`-o` - print only the matched pattern (one match per line)

`-n` - print the line number where the match was found

see more flags by reading the manual page (`man grep`)





## Regular expressions and Python

What if we want to do the same thing as above, but with python? 