<img src="images/MIK.png" style="width:375px;height:200px;">

## <center> MIK - Python for beginners: Regular Expressions</center>
### <center>by Ivaldo Tributino and Marcos Machado</center>

## Introduction

Searching and extracting information from text is so common that Python has a very powerful library called `regular expressions` that handles many of these tasks quite elegantly. 

<img src="images/regular_ex.png" style="width:525px;height:300px;">

`Regular Expression` is a sequence of characters that forms a search pattern. The characters above can be used to check if a string contains a specified search pattern.

Regular expressions are almost their own little programming language for searching and parsing strings. In this notebook, we will only cover the very basics of regular expressions. For more details, see:
https://docs.python.org/3/library/re.html

**Note:**
The regular expression library `re` must be imported into your program before you can use it. The simplest use of the regular expression library is the `search()` function. The following program demonstrates a trivial use of the search function.

In [1]:
import re  # Importing re module

fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if re.search('Ivaldo',line):  # Try Ivaldo and if line.find('Ivaldo')!=-1:
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM
Tributino Ivaldo de Sousa Joined 7/27/2021, 1:28:33 PM


In [2]:
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if line.find('Ivaldo')!= -1:  
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM
Tributino Ivaldo de Sousa Joined 7/27/2021, 1:28:33 PM


The power of the `regular expressions` can be seen when we add special characters to the search string that allow us to control which lines match the string. Adding these special characters to our regular expression allow us to do `sophisticated matching` and extraction while writing very little code.

For example, the caret character is used in regular expressions to match “the beginning” of a line. We could change our program to only match lines where “Ivaldo” was at the beginning of the line as follows:

In [3]:
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if re.search('^Ivaldo', line): # Try ^Ivaldo and     if line.startswith('Ivaldo'):
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM


Now we will only match lines that start with the string “Ivaldo”. This is still a very simple example that we could have done equivalently with the` startswith()` method from the string library. But it serves to introduce the notion that regular expressions contain special action characters that give us more control as to what will match the regular expression.

In [4]:
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if line.startswith('Ivaldo'): 
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM


## Character matching in regular expressions

There are a number of other special characters that let us build even more powerful regular expressions. The most commonly used special character is the `period` or `full stop`, which matches any character.

For example, the regular expression `F..m:` would match any of the strings `“From:”`, `“Fxxm:”`, `“F12m:”`, or `“F!@m:”` since the period characters in the regular expression match any character.

In [5]:
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if re.search('J..n', line): # Try Joined, Join and J..n
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM
Tributino Ivaldo de Sousa Joined 7/27/2021, 1:28:33 PM
Roberto Moura Joined 7/27/2021, 1:28:33 PM
Karine Oliveira Joined 7/27/2021, 1:28:33 PM
Sophia Oliveira Joined 7/27/2021, 1:28:33 PM
Lidiany Tributino Joined 7/27/2021, 1:28:33 PM
Jose Valdir Joined 7/27/2021, 1:28:33 PM
Leidiane Mercedes Joined 7/27/2021, 1:28:33 PM
Francisco de Assis Joined 7/27/2021, 1:28:54 PM
Olivia Smith Joined 7/27/2021, 1:28:58 PM
Emma Brown Join 7/27/2021, 1:30:46 PM
Emma Brown Joined 7/27/2021, 3:10:10 PM
Emma Brown Joined 7/27/2021, 3:20:02 PM
Amelia Tremblay Joined 7/27/2021, 1:31:03 PM
Amelia Tremblay Joined 7/27/2021, 2:06:12 PM
Aria Martin Joined 7/27/2021, 1:32:26 PM


Let's say you don't know how to spell 'Ivaldo'. However, you know that it starts with `"I"` and ends with "`o`". How can we find the name if we don't know how many characters there are between `"I"` and `"o"`.

## <center>I____o</center>

In [6]:
fhand = open('attendance.txt')
for line in fhand:
    line = line.rstrip()
    if re.search('I\S+o', line): #  try I..o and I.+o or I.*o
        print(line)

Ivaldo Tributino de Sousa Joined 7/27/2021, 1:28:33 PM
Tributino Ivaldo de Sousa Joined 7/27/2021, 1:28:33 PM


It is good to think of the `plus` and `asterisk` characters as “pushy”. For example, the following string would match the last `o` in the string as the .+ pushes outwards, as shown below:

In [7]:
line1 = 'I am not an impresario' #  produces entertainment, especially the director of an opera company.
line2 = 'Ivaldo'
lines = [line1, line2]
for line in lines:
    if re.search('I.*o', line): 
        print(line)

I am not an impresario
Ivaldo


It is possible to tell an asterisk or plus sign not to be so `“greedy”` by adding another character (matches as many characters as possible). However, in this exaple is better we use `non-whitespace character`.

In [8]:
for line in lines:
    if re.search('I\S*o', line): # try 'I.*o', 'I\S+o' and 'I\S+'
        print(line)

Ivaldo


## Extracting data using regular expressions

If we want to extract data from a string in Python we can use the `findall()` method to extract all of the `substrings` which match a regular expression. Let’s use the example of wanting to extract anything that looks like a time.

In [9]:
s = 'I joined the meeting at 01:32:24 and left the meeting at 02:56:54'
lst = re.findall('[0-9][0-9]', s) # try [0-9], [0-9][0-9], [0-9]+
print(lst)

['01', '32', '24', '02', '56', '54']


In [10]:
lst = list(map(int, lst))
print(lst)
h = (lst[3]-lst[0])*60
m = (lst[4]-lst[1])
s = (lst[5]-lst[3])/60
t = int(h+m+s)      # int(h+m+s,2) is better
print('I spent %g minutes in the meeting' %t)        

[1, 32, 24, 2, 56, 54]
I spent 84 minutes in the meeting


Another example from the book Python for Everybody

In [11]:
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s) # try '\s@\S+', '@+\S'
print(lst)

['csev@umich.edu', 'cwen@iupui.edu']


Translating the regular expression, we are looking for substrings that have at least one non-whitespace character, followed by an `@`, followed by at least one more non-whitespace character. The \S+ matches as many non-whitespace characters as possible.

The regular expression would match twice (`csev@umich.edu` and `cwen@iupui.edu`), but it would not match the string “@2PM” because there are no non-blank characters before the at-sign. We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an email address as follows:

## Combining searching and extracting

Let's translate the code below:

In [12]:
fhand = open('mbox-short.txt') 
for line in fhand:
    line = line.rstrip()
    if re.search('^X\S*: [0-9.]+', line):
        print(line)    

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7565
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7626
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7556
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7002
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7615
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7601
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6959
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7606
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7559
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6932
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7558
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6526
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6948
X-DSPAM-Probability: 0.0000
X-DSPAM-Co

**Note:**

`Parentheses` are another special character in regular expressions. When you add `parentheses` to a regular expression, they are ignored when matching the string. But when you are using `findall()`, parentheses indicate that while you want the whole expression to match, you only are interested in `extracting` a portion of the substring that matches the regular expression.

In [13]:
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^X\S*: ([0-9.]+)', line) # try ^X\S*: ([0-9.]+)
    if len(x) > 0:
        print(x)

['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
['0.7565']
['0.0000']
['0.7626']
['0.0000']
['0.7556']
['0.0000']
['0.7002']
['0.0000']
['0.7615']
['0.0000']
['0.7601']
['0.0000']
['0.7605']
['0.0000']
['0.6959']
['0.0000']
['0.7606']
['0.0000']
['0.7559']
['0.0000']
['0.7605']
['0.0000']
['0.6932']
['0.0000']
['0.7558']
['0.0000']
['0.6526']
['0.0000']
['0.6948']
['0.0000']
['0.6528']
['0.0000']
['0.7002']
['0.0000']
['0.7554']
['0.0000']
['0.6956']
['0.0000']
['0.6959']
['0.0000']
['0.7556']
['0.0000']
['0.9846']
['0.0000']
['0.8509']
['0.0000']
['0.9907']
['0.0000']


Now, in the following example, we are interested in the time of day of each mail message. We looked for lines of the form:

```
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
```

Now we can use regular expressions to do this following the regular expression:

```
^From .* [0-9][0-9]:
```

The translation of this regular expression is that we are looking for lines that start `with From (note the space)`, followed by any number of characters `(.*)`, followed by a space, followed by two digits `[0-9][0-9]`, followed by a colon character. This is the definition of the types of lines we are looking for. In order to pull out only the hour using `findall()`, we add parentheses around the two digits as follows:
```
^From .* ([0-9][0-9]):
```

In [14]:
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    x = re.findall('^From.* ([0-9][0-9]):', line) 
    if len(x) > 0: 
        print(x)

['09']
['18']
['16']
['15']
['15']
['14']
['11']
['11']
['11']
['11']
['11']
['11']
['10']
['10']
['10']
['09']
['07']
['06']
['04']
['04']
['04']
['19']
['17']
['17']
['16']
['16']
['16']


## Escape character


Since we use special characters in regular expressions to match the beginning or end of a line or specify wild cards, we need a way to indicate that these characters are “normal” and we want to match the actual character such as a dollar sign, caret or paratheses.

We can indicate that we want to simply match a character by prefixing that character with a `backslash`. For example, we can find money amounts with the following regular expression.

In [15]:
comment = '''The view from my window overlooked the wall of another building and the location 
        was not convenient. But other than that everything was perfect, the whole 
        staff was very attentive :).'''
print(comment)

The view from my window overlooked the wall of another building and the location 
        was not convenient. But other than that everything was perfect, the whole 
        staff was very attentive :).


In [16]:
x = re.findall(':\)',comment) 
x

[':)']

<img src="images/sent_analysis.png" style="width:350px;height:175px;">

<center>Image from monkeylearn.com
    
If you want to know a little more about `Sentiment Analysis`, see: https://monkeylearn.com/sentiment-analysis/    

In [18]:
":)"

':)'

References:

- Python for Everybody Exploring Data Using Python 3 by Dr. Charles R. Severance
- https://medium.com/@chongye225/python-regular-expression-2ac91e084662

<img align="right" src="images/logo.png" style="width:50px;height:50px;">