In [1]:
# run this cell to play back an audio file, type Esc-o to hide player
from IPython.display import Audio
Audio("media/rgx-intro.mp3")

Regular expressions (or REGEX) are compact ways of summarising a text pattern. The need to handle text patterns is very common indeed: for instance typing
```
ls *.py
```
in a Linux shell will list all files that end in ```.py```. The character ```*``` is known as a *wildchar*. The above is known as *glob* syntax, and is not technically a REGEX; rather, regular expressions are major refinement of this concept.

Regular expressions help with extracting information from text (eg BLAST output or FASTA files) by locating particular patters. For instance, in a [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file, the accession number may come between a ">" and a  "|": such a pattern can be easily described by a regular expression.
```
>P04637|P53_HUMAN Cellular tumor antigen p53 - Homo sapiens (Human).
```

Also, databases such as [PROSITE](http://prosite.expasy.org/) list *patterns* that identify particular families of proteins or domains; as we will see, from a computational point of view, these are really regular expressions in disguise.

Regular expression syntax in Python is very similar to PERL syntax, so migrating between the two languages should not be difficult.

## The ```re``` Module

In Python, REGEX support is provided in the ```re``` module. Simple usage is indeed straightforward: 

In [2]:
import re

# mo is a "match object"
mo=re.search("hello", "Hello world, hello Python! hello o")
print (mo.group())
print (mo.span())

hello
(13, 18)


This is not too different from the ```.index()``` method of a string:


In [3]:
print ("Hello world, hello Python!".index("hello"))

13


But it is a lot more flexible:

In [4]:
re.findall("[Hh][ea]llo", "Hallo world, hello Python!")

['Hallo', 'hello']

here the square brackets express an alternative within a set of characters.

If a match is not found, the search returns None:

In [5]:
mo=re.search("hello", "Hi world!")
print (mo)

None


In [6]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video, clear_output; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/rgx-hello.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

interactive(children=(IntSlider(value=600, continuous_update=False, description='resize', max=900, min=150, re…

## Performing matches

We have already seen ```.search()```, that finds the first match only, and ```.findall()```. 
The ```re``` module offers four matching operators:


| Function/Method   | Use                                                                          |
|-------------------|------------------------------------------------------------------------------|
| match()           | Determine if the RE matches at the beginning of the string.                  |
| search() 	        | Scan through a string, looking for any location where the RE matches.       |
| findall() 	    | Find all substrings where the RE matches, and returns them as a list.        |
| finditer() 	    | Find all substrings where the RE matches; returns match objects as an iterator(*).|

(*) an iterator works very much like a list in that for instance you can loop over it, but its items are computed on the fly as they are needed, so it is more memory-efficient. 


In [7]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video, clear_output; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/rgx-operators.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

interactive(children=(IntSlider(value=600, continuous_update=False, description='resize', max=900, min=150, re…

## Compiling a pattern

For reasons of efficiency, if a pattern is going to be used repeatedly, it is best to compile it. This is done as follows:

In [9]:
rgx=re.compile("[Hh][ea]llo")
rgx.findall("Hallo world, hello Python!")

['Hallo', 'hello']

the same search functions listed above are available as methods of the *compiled expression* object ```rgx```.

In [10]:
# run this cell to play back an audio file, type Esc-o to hide player
from IPython.display import Audio
Audio("media/rgx-compiling.mp3")

## Beware of the backslash

Regular expressions are a powerful tool, though a bit tedious to learn. Besides matching very complex patterns indeed, other operations that are possible are splitting a string where a pattern matches and substitution. I invite you to have a look at the official [howto](https://docs.python.org/3/howto/regex.html) to get a feeling for what can be done.

As you will see, REGEX syntax makes heavy use of backslashes. This is a problem in Python, because a backslash is interpreted as an *escape character*. That is, a combination of a backslash and a standard character is normally translated to a non-printable character (for example a newline), according to this [table](http://www.python-ds.com/python-3-escape-sequences).

In [11]:
print("escape\nsequence")

escape
sequence


The solution is to use the Python "raw string" syntax by prepending an "r" (for "raw") to the string in question. This saves the backslash from being crunched as an escape sequence:

In [12]:
print(r"escape\nsequence")

escape\nsequence


to be on the safe side, you may want to put an "r" before all of the regular expressions you write. Example:

In [16]:
solomon="""
    Solomon Grundy,
    Born on a Mon_day,
    Christened on Tuesday,
    Married on Wednesday,
    Took ill on Thursday,
    Grew worse on Friday,
    Died on Saturday,
    Buried on Sunday.
    That was the end of,
    Solomon Grundy."""

# \w+ matches one or more alphanumeric characters
rgx=re.compile(r"\w+da?y")
rgx.findall(solomon)

['Grundy',
 'Mon_day',
 'Tuesday',
 'Wednesday',
 'Thursday',
 'Friday',
 'Saturday',
 'Sunday',
 'Grundy']

If you would like more information about your matches, the ```finditer``` method may be a better option, since it returns the individual match objects for you to process.

In [17]:
# you have to loop over an iterator to process its values
for mo in rgx.finditer(solomon):
    print (mo.group(), mo.span())

Grundy (13, 19)
Mon_day (35, 42)
Tuesday (62, 69)
Wednesday (86, 95)
Thursday (113, 121)
Friday (141, 147)
Saturday (161, 169)
Sunday (185, 191)
Grundy (230, 236)


In [18]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video, clear_output; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/rgx-quantifiers.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

interactive(children=(IntSlider(value=600, continuous_update=False, description='resize', max=900, min=150, re…

## Text substitution

There are times when you may want to edit text automatically - for instance, you may want to remove all *http* links from a text you have scraped, remove [stop-words](https://en.wikipedia.org/wiki/Stop_word) from a document in preparation for some natural language processing, re-format telephone numbers or hide credit card numbers. The ```re``` module supports this through the ```re.sub``` function, that you can think of as a powerful programmatic *Find-and-Replace* tool. The documentation is [here](https://docs.python.org/3/library/re.html#re.sub), and usage is straightforward:  

In [19]:
re.sub(pattern="[Hh][ea]llo", repl="Bye", string="Hallo world, hello Python!")

'Bye world, Bye Python!'

The ```sub``` function is very flexible. You may be a bit disappointed that "Hallo" is uppercase, "hello" is lowercase, but it seems that we have to choose whether we want an uppercase "Bye" or a lowercase "bye". Of course we could use two separate expressions, but isn't there a way to match the case in one go? It turns out there is - we can pass a function as the ```repl``` argument, in which case that function is passed the match object and can use it to compute the appropriate replacement. In our case:

In [20]:
def matching_case_bye(mo):
    greeting=mo.group() 
    if greeting[0]=='H':
        return "Bye"
    else:
        return "bye"

In [21]:
# matching_case_bye is called once for each match
re.sub(pattern="[Hh][ea]llo", repl=matching_case_bye, string="Hallo world, hello Python!")

'Bye world, bye Python!'

This gives you a lot of flexibility. For instance, you might need to update all hyperlinks in a website to reflect the new structure of the site: just code a REGEX that matches hyperlinks and a function that maps the old URLs to the new URLs (maybe simply using a dictionary), and hey presto.

## Matching PROSITE patterns


In [None]:
# run this cell to play back an audio file, type Esc-o to hide player
from IPython.display import Audio
Audio("media/rgx-patterns.mp3")

The [Thioredoxin](https://en.wikipedia.org/wiki/Thioredoxin) pattern listed on PROSITE under accession number [PS00194](http://prosite.expasy.org/PS00194) is the following:
```
[LIVMF]-[LIVMSTA]-x-[LIVMFYC]-[FYWSTHE]-x(2)-[FYWGTN]-C-[GATPLVE]-
[PHYWSTA]-C-{I}-x-{A}-x(3)-[LIVMFYWT].
```
Though the [syntax](https://prosite.expasy.org/prosuser.html#conv_pa) is different, this is really a regular expression, and we can easily translate it to a Python REGEX:
```
r'[LIVMF][LIVMSTA]\w[LIVMFYC][FYWSTHE]\w{2}[FYWGTN]C[GATPLVE][PHYWSTA]C[^I]\w[^A]\w{3}[LIVMFYWT]'
```
where ```\w``` matches any character, ```\w{3}``` matches exactly three characters and for example ```[^I]``` will match anything except an ```I```. The following code scans the chicken proteome for matches and prints out the accession number of the proteins that match. 

NOTE: the chicken proteome can be retrieved from the list of Uniprot reference proteomes for [Eukaryotes](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/);
*Gallus gallus* is entry ```U000000539_*.fasta.gz```, where ```*``` stands for the current revision number.
Download the file, unzip it and rename it ```CHICK.fasta``` for convenience. I have included here a file named ```CHICK.fasta``` with a more or less outdated revision, I encourage you to fetch the current data; you may also download data for other organisms of your interest (the key to the file names is in the [README](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/README) file in the parent directory on the Uniprot server). If you are running as a binder or on the QM JupyterHub instance, you may need to dowload the file locally to your machine and then upload it.

#### Warning! Real data - handle with care

```CHICK.fasta``` contains around 10Mb of data (almost 16,000 proteins, filling about 170K lines of text). This is by no means big data, but it is too large for you to open in an editor. The following code prints the first few lines of it:

In [None]:
FILE=open("CHICK.fasta", "rt")
# we could loop over the file, count the lines and break once we reach 12, we choose to
# loop over the numbers and fetch a line from the file each time instead
for i in range(12):
    line=next(FILE) # a file is an iterator; next yields the next item
    print(line.rstrip())
FILE.close()
print("...continues...")

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video, clear_output; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/rgx-chick.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

In [None]:
""" Trawl chicken proteome and find all proteins that match
PROSITE pattern PS00194 (THIOREDOXIN_1) """

import re

# Compile the regexp
PS00194=(r'[LIVMF][LIVMSTA]\w[LIVMFYC][FYWSTHE]\w{2}[FYWGTN]'+
    r'C[GATPLVE][PHYWSTA]C[^I]\w[^A]\w{3}[LIVMFYWT]')
rgx=re.compile(PS00194)

INFILE=open("CHICK.fasta", "r")

seq="" # build sequence here
header="" # name of protein

for line in INFILE:
    if line[0]==">": # current line is a header
        # search protein we just read and print header 
        # if pattern is found
        if (rgx.search(seq)!=None):
            print(header)                    
        # update header and reset sequence
        header=line.rstrip()
        seq=""
    else:  # this line contains part of the sequence
        seq+=line.rstrip() # remove trailing newline

# process the last protein
if (rgx.search(seq)!=None):
    print(header)                    

INFILE.close()


In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import Video, clear_output; from ipywidgets import interactive, IntSlider
def _play(resize): display(Video(filename="media/rgx-trawling.webm",data="",width=resize))
interactive(_play, resize=IntSlider(min=150, max=900, step=50, value=600, continuous_update=False, readout=False))

**(C) 2014,2020 Fabrizio Smeraldi** ([f.smeraldi@qmul.ac.uk](mailto:f.smeraldi@qmul.ac.uk) - [web](http://www.eecs.qmul.ac.uk/~fabri/)), all rights reserved. In: "Computer Programming", School of Electronic Engineering and Computer Science, Queen Mary University of London.