<h1><font color='blue'>Session 6 - Regular Expressions</font></h1>

# Google Drive Set Up

Run this to access files on Google Drive

In [1]:
import os
from google.colab import drive

drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/Python Crash Course')

Mounted at /content/drive


In this tutorial, we will learn about very useful and pervasive text processing tools called **regular expressions**, a.k.a. RE, re, regex, regexp, regex patterns.  They are tools for specificying text matching patterns.  The desired patterns can then be quickly and automatically extracted from possible large amounts of text.

In [2]:
!cat 'regex/towels.txt'

The Hitch Hiker's Guide to the Galaxy has a few things to
say on the subject of towels.
A towel, it says, is about the most massively useful thing
an interstellar hitchhiker can have.  Partly it has great
practical value - you can wrap it around you for warmth as
you bound across the cold moons of Jaglan Beta; you can lie
on it on the brilliant marble-sanded beaches of Santraginus
V, inhaling the heady sea vapours; you can sleep under it
beneath the stars which shine so redly on the desert world
of Kakrafoon; use it to sail a mini raft down the slow heavy
river Moth; wet it for use in hand-to- hand-combat; wrap it
round your head to ward off noxious fumes or to avoid the
gaze of the Ravenous Bugblatter Beast of Traal (a
mindboggingly stupid animal, it assumes that if you can't
see it, it can't see you - daft as a bush, but very
ravenous); you can wave your towel in emergencies as a
distress signal, and of course dry yourself off with it if
it still seems to be clean enough.  More impor

Since the text concerns towels we might be interested in getting every line that contains the term "towel".

In [3]:
!grep "towel" regex/towels.txt

say on the subject of towels.
A towel, it says, is about the most massively useful thing
ravenous); you can wave your towel in emergencies as a
towel has immense psychological value.  For some reason, if
hiker has his towel with him, he will automatically assume
odds, win through, and still knows where his towel is is


Please note that this finds lines that contain the string "`towel`" as well as as the string "`towels`".

A more interesting problem arises when we want to look up lines that say something about hitchhikers. A quick look at the text above shows us that Douglas Adams enjoyed his poetic liberty as far as the spelling of the word "hitchhiker" was concerned. We find the following versions:
Hitch Hiker
hitchhiker
hitch hiker
Hitchhiker

Can regexes help us to find all the different spellings? 

The escape character `\` tells grep to treat the following character differently. The pipe character `|` is a logical OR. Thus, the expression `\(H\|h\)` is interpreted as "`H` or `h`". The Kleene star, `*`, means that the preceeding character occurs zero, one, or more times. In our example this means that the strings hitch and hiker are either separated by one or more spaces, or not separated at all. 

Often it is possible to employ a number of different REs to the same end. In more advanced cases chosing the appropriate RE can have a significant impact on performance. In many other cases however the time gained by executing the perfect RE won't make up for the time invested in crafting it.

# Section 1 - Basic Regex

Python's `re` module enables use of REs within Python programs.  This is useful for parsing strings.  The [standard library documentation](https://docs.python.org/3.5/library/re.html#match-objects) is quite good and is a useful resource.

Unlike most other code we write in Python, REs are compiled into series of bytecode and executed in C, which makes them extremely fast.  After we have compiled a regular expression, it has methods, such as `match`, that allow us to process strings with our compiled regular expression.

## Compiling REs
We will investigate the syntax of the `re` module, as usual, through example.  In this case, we will use an RE to find instances of hitchhiker-like words in the text from Douglas Adams.

In this case we use `re.compile` to make regex object that acts as a sieve for text that matches the pattern provided to it.

In [4]:
import re

# Read file contents as a single string
with open('regex/towels.txt') as f:
    hh_string = f.read()
    
# Define the regex pattern
pattern = '.*[H|h]itch *[H|h]iker.*'
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)

["The Hitch Hiker's Guide to the Galaxy has a few things to",
 'an interstellar hitchhiker can have.  Partly it has great',
 'a strag (strag: non-hitch hiker) discovers that a hitch',
 'Douglas Adams:  "The Hitchhiker\'s Guide to the Galaxy" ']

We get a list of lines in the `towels.txt` file that have a hitchhiker-like word in them.

In the following cell we find the square bracket in the text. It is necessary to use the escape character `\` because a square bracket has special meaning in regex, and we do not want to invoke that special meaning here.

In [5]:
# Define the regex pattern
pattern = '\['
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall('Find the[ bracket.')

['[']

Looking again at our `.*[H|h]itch *[H|h]iker.*` pattern....  The opening and closing `.*` mean that we do not care what comes before or after the hitchhiker-like expression in the line.  The "` *`" in the middle of the expression means the same as in the command line case: arbitrarily many spaces (including zero) may be between `hitch` and `hiker`.  

We use `[H/h]` to mean either upper or lowercase `H`.  This is in contrast to the regex we used with `grep`, which used parentheses.  In Python's `re` module, parentheses serve to form **groups**.  Let's see what happens if we use some parentheses.

In [6]:
# Define the regex pattern
pattern = '(.*((H|h)itch *(H|h)iker).*)'
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)

[("The Hitch Hiker's Guide to the Galaxy has a few things to",
  'Hitch Hiker',
  'H',
  'H'),
 ('an interstellar hitchhiker can have.  Partly it has great',
  'hitchhiker',
  'h',
  'h'),
 ('a strag (strag: non-hitch hiker) discovers that a hitch',
  'hitch hiker',
  'h',
  'h'),
 ('Douglas Adams:  "The Hitchhiker\'s Guide to the Galaxy" ',
  'Hitchhiker',
  'H',
  'h')]

The parenthesis form a hierarchy of groups.  At the outermost level, we get the entire line that has a hitchhiker-like word.  At the next level, we get the actual hitchhiker-like word.  And at the innermost level, we get the individual `H` characters.

## Flags in REs
We can also compile REs with **flags**.  These are given as a second argument for the `re.compile` function and specify variants on how the RE compilation is to be done.  For example, we could have used a flag to make our RE even simpler.

In [7]:
# Define the regex pattern
pattern = '.*hitch *hiker.*'
regex = re.compile(pattern, re.IGNORECASE)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)

["The Hitch Hiker's Guide to the Galaxy has a few things to",
 'an interstellar hitchhiker can have.  Partly it has great',
 'a strag (strag: non-hitch hiker) discovers that a hitch',
 'Douglas Adams:  "The Hitchhiker\'s Guide to the Galaxy" ']

The `re.IGNORECASE` flag allowed us to avoid having to have `[H|h]` in the RE.  This just tells `re.compile` to treat lowercase and uppercase characters the same.  Let's have a look at the available flags:

|   flag   | Description   |
|----------|---------------|
| `re.DEBUG` | Displays debugging information about compiled expression |
| `re.IGNORECASE` | Case insensitive matching |
| `re.MULTILINE` | `^` and `$` also match the beginning and end of a line respectively.|
| `re.DOTALL` | As mentioned above that allows `.` to match any character.|
| `re.VERBOSE` | Allows the usage of comments; everything left of the `#` will be ignored. This flag also ignores non-escaped whitespace (i.e., whitespace without the preceding `\`). This improves the readability of REs.|

To combine flags, separate them with a vertical bar (the bitwise OR operator, which we did not cover in our discussion of operators).  E.g.,

    my_regex_query = re.compile("hitch *hiker", re.IGNORECASE | re.VERBOSE)

## Searching, Matching, Splitting, and more
Once we are happy with our compiled RE we can deploy it in a number of ways.  Upon compilation, we have created a compiled `SRE_Pattern` object that has methods for searching strings.  In the table below, we will assume the compiled object is called `regex`.

|   action   | Description   |
|---------------------|---------------|
| `regex.search(string, flags=0)` | Scans through string and returns first matching object |
| `regex.match(string, flags=0)` | Returns object if zero or more characters at the beginning of the string match. |
| `regex.fullmatch(string, flags=0)` | returns matching object if the whole string matches the RE otherwise RE is returned. |
| `regex.findall(string, flags=0)` | Returns a list of all matches in the string. If there was grouping, each entry in the list is tuple, where each entry has the a match for different levels of grouping.|
| `regex.finditer(string, flags=0)` | Same as `regex.findall()`, except returns an iterator that yields a match object instead of a list.|
| `regex.split(string, maxsplit=0, flags=0)` | A new feature from Python 3.4, splits the string into a list by occurrences of patterns, see example below|

# Section 2 - Metacharacters

## Metacharacters in Python REs

In the command line example using `grep`, we got acquainted with a few metacharacters that help us improve our pattern matching. Let's have a look at all RE metacharacters that can be used in Python:

|     Metacharacter   |   Description   |
|:------------------:|-----------------|
| `.`                   | (dot) The ultimate wildcard. It matches any character other than the newline character (`\n`). If it is desireable to also match `\n` the alternative mode (`re.DOTALL`) can be invoked.
| `^`                   | (caret) Matches the start of a new string and the position immediately after a newline character.
| `$`                   | Similar to caret but goes for the end of the string and the character preceeding the newline character.
| `*`                   | The Kleene star  `*` following a RE allows 0 or multiple repitition of the this expression.  `ab*c` will match `ac`, `abc`, `abbc`, `abbbc`, ...
| `+`                   | Similar to the Kleene star, but it matches 1 or more occurences of the preceding RE, thus `ab+c` matches `abc`, `abbc`, `abbbc`, but not `ac`.
| `?`                   | Matches 0 or 1 repetition of the RE. `ab?` matches `a`, and `ab`.
| `{m}`                   | Matches exactly `m` repeats. `a{3}` equals `aaa`
| `{m,n}`                  | Matches m to n repeats, `a{2,4}` yields `aa`, `aaa`, and `aaaa`. The lower and upper bounds are optional `a{,4}` is the same as `a{0,4}`. Omiting the upper bound `a{4,}` yields anything with four or more repetitions of `a`.
| `[]`                   | Square brackets are used to describe a set of characters eg: `\[atcg\]` matches, `a`, `t`, `c`, or `g`, `\[a-z\]` matches any lowercase ASCII letter. `\[0-2\]\[0,9\]` matches all numbers from 00 to 29. It is important to note that metacharacters lose their special function within sets. Thus, `[(a\*b+)]` matches `(`, `a`, `\\`, `\*`, `b`, `+`, and `)`. 
| `\\`                   | The escape character `\\` makes sure that the following character is interpreted literally. The Kleene star (`*`) for example will be interpreted as a simple asterisk if prefaced by the escape character (`\\\*`)
| &#124;                  | Logical or
| `(...)`                   | matches whatever regular expression is inside the parentheses.  As we discussed, these serve to describe groupings.

### greed vs non-greedy
`+`, `\*`, and `?` match as much text as possible. This behavior is referred to as greedy. Adding a `?` after these qualifiers renders them non greedy, yielding the shortest possible answer. For example applying the RE `(K.\*F)` to this amino acid sequence:
`MKKSLVFAFFAFFLSL`
yields:
`KKSLVFAFFAFF`
whereas `(K.\*?F)` would yield:
`KSLVF`

### More escaping
Let's have a look at how a word is defined before having a look at more escape options.  A word is a sequence of Unicode alphanumeric or underscore characters.  Examples are: 

    Hello_world
    P4ssw0rd

Unless we specify wild card characters or ask for whole lines, a regex search will return words.  We can further specify which words will be returned with the escape characters below.

|   \\.   | Description   |
|--------------|---------------|
| `\number` | Matches the number-times repeat of a group. For example applying `(.+) \1` to the string `Homo sapiens sapiens` returns `sapiens sapiens`|
|`\A` | Matches the start of a string|
|`\b` | Matches the empty string at the beginning or ending of a word. Thus using towel\\b in the example above would only yield lines with the word towel and not towels. |
|`\B` | Opposite of `\\b`. Thus `towel\\B` would yield towels.|
|`\d` | Matches any unicode decimal digit.|
|`\D` | Opposite of `\\d` (Are you seeing a pattern?)|
|`\s` | Matches whitespace characters |
|`\S` | any guess? |
|`\w` | Matches Unicode word characters|
|`\W` | your turn again: |
|`\Z` | Matches only the end of the string|


Now let's work through a few examples. For convenience sake, the contents of `towels.txt` are provided here:

```
The Hitch Hiker's Guide to the Galaxy has a few things to
say on the subject of towels.
A towel, it says, is about the most massively useful thing
an interstellar hitchhiker can have.  Partly it has great
practical value - you can wrap it around you for warmth as
you bound across the cold moons of Jaglan Beta; you can lie
on it on the brilliant marble-sanded beaches of Santraginus
V, inhaling the heady sea vapours; you can sleep under it
beneath the stars which shine so redly on the desert world
of Kakrafoon; use it to sail a mini raft down the slow heavy
river Moth; wet it for use in hand-to- hand-combat; wrap it
round your head to ward off noxious fumes or to avoid the
gaze of the Ravenous Bugblatter Beast of Traal (a
mindboggingly stupid animal, it assumes that if you can't
see it, it can't see you - daft as a bush, but very
ravenous); you can wave your towel in emergencies as a
distress signal, and of course dry yourself off with it if
it still seems to be clean enough.  More importantly, a
towel has immense psychological value.  For some reason, if
a strag (strag: non-hitch hiker) discovers that a hitch
hiker has his towel with him, he will automatically assume
that he is also in possession of a toothbrush, face flannel,
soap, tin of biscuits, flask, compass, map, ball of string,
gnat spray, wet weather gear, space suit etc., etc.
Furthermore, the strag will then happily lend the hitch
hiker any of these or a dozen other items that the hitch
hiker might accidentally have "lost".  What the strag will
think is that any man who can hitch the length and breadth
of the galaxy, rough it, slum it, struggle against terrible
odds, win through, and still knows where his towel is is
clearly a man to be reckoned with.  

Douglas Adams:  "The Hitchhiker's Guide to the Galaxy" 

```

## Example 1 - All hyphenated words

We want to match patterns where there are several word characters `\w` on either side of a hyphen `-`. 

The solution is as follows

In [8]:
# Define the regex pattern
pattern = '\w+-\w+'
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)

['marble-sanded', 'hand-to', 'hand-combat', 'non-hitch']

## Example 2 - All Capital Cased words

We want to match patterns where capital letters `[A-Z]` are followed by lowercase `[a-z]` letters.

The solution is as follows:

In [9]:
# Define the regex pattern
pattern = r'[A-Z][a-z]+'
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)

['The',
 'Hitch',
 'Hiker',
 'Guide',
 'Galaxy',
 'Partly',
 'Jaglan',
 'Beta',
 'Santraginus',
 'Kakrafoon',
 'Moth',
 'Ravenous',
 'Bugblatter',
 'Beast',
 'Traal',
 'More',
 'For',
 'Furthermore',
 'What',
 'Douglas',
 'Adams',
 'The',
 'Hitchhiker',
 'Guide',
 'Galaxy']

## Example 3 - All words between quotations

In [11]:
# Define the regex pattern
pattern = r'\"(\S+ )+?\"'
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)

[]

# Section 3 - Lookahead Expressions

## The very powerful question mark

With the help of `?` we can expand the functionality of parentheses.  The general syntax is `(?...)`.  We will not get into these here, but the table below gives a summary, and more detail can be found in the [`re` package documentation](https://docs.python.org/3.5/library/re.html).

|   `(?...)`   |   Description   |
|------------|-----------------|
| `(?HKRED)` | Matches one or more characters |
| `(?:...)` | Non-capturing version of the regular parentheses|
|`(?<name>...)`| The matched string is accessible by the symbolic group name *name*.|
| `(?P=name)`| Matches the string defined in `(?<name>...)`|
| `(?#...)` | A comment, contents are ignored|
| `(?!...)` | Opposite of `(...)`|
| `(?<=...)`| A positive lookbehind assertion, for example applying the following RE to an amino acid sequence `(?<=(?HKRED)\[A-Z\])` yields residues with a preceding charged residue|
| `(?<!...)`| Opposite of `(?<=...)` (anoter pattern?)|
|`(?(id/name)yes-pattern`&#124;`no-pattern)`| Matching with yes-pattern if group given with id or name exists, with no-pattern if it doesn't. The latter is optional. |

# Section X - Example with Tweets

We will combine regex with `pandas` to show you how useful it can be. We will cover `pandas` in a later tutorial.

In [None]:
import pandas as pd
tweets = pd.read_csv('regex/exp_tweet.csv',encoding = "ISO-8859-1",skiprows=1)

tweets.head(15)

Unnamed: 0,user_id,# of followers,text
0,857402298749681700\t,943,RT @mtvasia: Think @BTS_twt should win the #Bi...
1,4844811867\t,N.A.,RT @youngstars710: Yixing: Director! What to d...
2,3180831223\t,340,RT @Myung_Q_: 0903930979 9149930979 3239930979...
3,1058328517815996400\t,236,--
4,255924545\t,,@ConradChenXZ But how can Zaryn do this to her...
5,na,650,RT @luvly_junwelry: 23282229282329232328 23212...
6,1902970316\t,,RT @Twittblaster: Life is like a journey into;...
7,29917270\t,1023,.
8,3289470552\t,239,RT @sensible_k: 181012 #2985 https://t.co/8WI0...
9,4567039463\t,909,RT @arilshah9618: Nak tahu quote paling deep y...


In [None]:
tweets['text']

0      RT @mtvasia: Think @BTS_twt should win the #Bi...
1      RT @youngstars710: Yixing: Director! What to d...
2      RT @Myung_Q_: 0903930979 9149930979 3239930979...
3                                                     --
4      @ConradChenXZ But how can Zaryn do this to her...
                             ...                        
995           how can a question lead to an argument??? 
996    RT @contradichen: @cutekjdpics you just did Go...
997    RT @MsLeaSalonga: To everyone that posted #LS4...
998    @nikzulhilmi @asyrafjot Nik sombong takjumpak ...
999                          Hujan buat sis lapar balik 
Name: text, Length: 1000, dtype: object

Let's say you want to extract all the Twitter Handles in the `text` column. All Twitter Handles begin with a `@` followed by alphanumeric characters with no space in between. We can easily make a regex expression for this.

In [None]:
pattern = r'\@[\w\d]+'
regex= re.compile(pattern)
result = tweets['text'].map(lambda x: regex.findall(x))
result

0               [@mtvasia, @BTS_twt]
1                   [@youngstars710]
2                        [@Myung_Q_]
3                                 []
4                    [@ConradChenXZ]
                   ...              
995                               []
996    [@contradichen, @cutekjdpics]
997                  [@MsLeaSalonga]
998       [@nikzulhilmi, @asyrafjot]
999                               []
Name: text, Length: 1000, dtype: object

In [None]:
pattern = r'https+:\/\/\S*'
regex= re.compile(pattern)
result = tweets['text'].map(lambda x: regex.findall(x))
result

0          [https://t.co/LAliRÿ]
1                             []
2      [https://t.co/TdJDiK9Wna]
3                             []
4                             []
                 ...            
995                           []
996                           []
997                           []
998                           []
999                           []
Name: text, Length: 1000, dtype: object

In [None]:
pattern = r'\d+'
regex= re.compile(pattern)
result = tweets['text'].map(lambda x: regex.findall(x))
result

0                                                     []
1                                                  [710]
2      [0903930979, 9149930979, 3239930979, 342123859...
3                                                     []
4                                                    [5]
                             ...                        
995                                                   []
996                                                   []
997                                                 [40]
998                                                 [06]
999                                                   []
Name: text, Length: 1000, dtype: object