# Tutorial 2A Regular expression

In the lecture, you have learnt how to use regular expressions. To quickly review some regular expression syntax:


* <font color="red">[0-9]</font> Matches a single digit
* <font color="red">[a-z0-9]</font> Matches a single character that must be a lower case letter or a digit.
* <font color="red">[A-Za-z]</font> Matches a single character that much be a upper/lower case letter 
* <font color="red">\d</font> Matches any decimal digit; equivalent to the set [0-9].
* <font color="red">\D</font> Matches characters that are not digits, which is equivalent to [^0-9] or [^\d].
* <font color="red">\w</font> Matches any alphanumeric character, which is equivalent to [a-zA-Z0-9].
* <font color="red">\W</font> Matches any non-alphanumeric character; which is equivalent to [^a-zA-Z0-9] or [^\w].
* <font color="red">\s</font> Matches any whitespace character; which is equivalent to [\t\n\r\f\v], where \t indicates taps, \n  line feeds, \r carriage returns, \f form feeds and \v vertical tabs.
* <font color="red">\S:</font> Matches any non-whitespace character; which is equivalent to  [^ \t\n\r\f\v].
* <font color="red">ˆ</font> Matches the start of the line.
* <font color="red">$</font> Matches the end of the line.
* <font color="red">.</font> Matches any character (a wildcard).
* <font color="red">*</font> Matches when the preceding character occurs zero or more times
* <font color="red">?</font> Matches when the preceding character occurs zero or one times
* <font color="red">+</font> Matches when the preceding character occurs one or more times

More information can be found here :
https://docs.python.org/2/library/re.html
* * *

In [None]:
import sys
print (sys.version_info)

Libraries needed are:

In [None]:
import re # library for regular expression
import pandas as pd
pd.__version__

## 1. Backslash

**First, what is '\'? **

'\', backslash or escape-character, is used to indicate special forms or to allow special characters to be used without invoking their special meaning.



**How about r"" ? When to use it? **

r"" is Python’s string literal prefix notation, which has nothing to do with regular expression.  By using r"" or r'', Python will not handle special characters in any special way, in another word, it treated the contents as raw string. For example, r"\t" represents
a two-character string containing '\' and 't', whereas "\t" represents tab.

Sometimes you can use them interchangeably,

In [None]:
str1 = re.findall('\t', "Please find \t")
print (str1)

str2 = re.findall(r'\t', "Please find \t")
print (str2)

Sometimes not!

In [None]:
str1=re.match(r"\W(.)\1\W", " ff ")
print (str1)

str2=re.match("\W(.)\1\W", " ff ")
print (str2)

str3=re.match("\\W(.)\\1\\W", " ff ")
print (str3)

"\W(.)\1\W" doesn't match ?  What is the difference? 

In [None]:
str4="\W(.)\1\W"
print (str4)
str4

In [None]:
str4=r"\W(.)\1\W"
print (str4)
str4

Now you might be able to guess, what "\W(.)\1\W" will match. 

In [None]:
str2=re.match("\W(.)\1\W", " f\x01 ")
print (str2)

It matches with non-word + any one character  + "\x01" + non=word.

** Conclusion -- always fist validate your regular expression, then test with Python**

\* is ??  <br>
\* is a wildcard similar with ? and +  <br>
\* matches 0+ <br>
? matches 0-1 <br>
\+ matches 1+ <br>

In [None]:
str1 = re.findall(r'.*', 'Please find all.')
print (str1)

In [None]:
str1 = re.findall(r'.?', 'Please find all.')
print (str1)

In [None]:
str1 = re.findall(r'.+', 'Please find all.')
print (str1)

In [None]:
str1 = re.findall(r'l+', 'Please find all')
print (str1)

## 2. Answer to homework in the reading material "Introduction to Regular Expressions"Homework 

Refine the regular expression for date to distinguish months with 29/30/31 days.
*Note assume all years are a leap year, which means every Feburary has 29 days.

In [None]:
def date(pattern, m):
    if re.match(pattern, m):
        print (m + " is a date")
    else:
        print (m + " is NOT a date")

In [None]:
regex = r'''(?x)
    (?:
    # February (29 days every year)
      ([12][0-9]|0?[1-9])[/-](0?2)
    # 30-day months
      |(30|[12][0-9]|0?[1-9])[/-](0?[469]|11)
    # 31-day months
      |(3[01]|[12][0-9]|0?[1-9])[/-](0?[13578]|1[02])
    ) 
    # Year
    [/-]((?:[0-9]{2})?[0-9]{2})
'''

In [None]:
#date(r"((31[/-](0?[13578]|1[02]))|(30[/-](0?[469]|11))|(28[/-]02))[/-]((?:\d{2})?\d{2})", "28/02/2019")
date(regex, "28/02/2019")
date(regex, "31/04/2019")
date(regex, "29/05/2019")
date(regex, "31/06/2019")

What is this regular expression

![](figure1.jpg) 

## 3. Extract IPs, dates, and email address with regular expressions

With following tasks we will use the mail box data ([mbox-short.txt](http://www.pythonlearn.com/code3/mbox-short.txt)) provided by the book [Python for Informatics: Exploring Information](http://www.pythonlearn.com/book.php#python-for-informatics). 

In [None]:
with open('mbox-short.txt','r') as infile:
    text = infile.read()

### 3.1 Find IP addresses 

In this task we will need to 
1. find all IP addresses in the mbox-short dataset.
2. print unique IP addresses 

Let's have a try first: 

In [None]:
str1 = re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', "This is a IP address 111.23.39.99")
str1

![](figure2.png)

From https://regexper.com/

In [None]:
str1= re.findall(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', text)
if len(str1)>0:
    print str1

By running the code above, we are able to print all IP addresses. 

Next can we save all unique IP address in a list? We will need to read the whole txt file in to 'text', and then apply re.findall function. set() function returns the unique values.

In [None]:
str1=re.findall(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', text)
set(str1)

### 3.2 Extract All date time 


In the next task, we need to extract all date time from the file. We trust that all date time are valid for now. 


In [None]:
str1=re.findall(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})', text)
set(str1)

### 3.3 Extract author's email address


There are many email addresses included in the file. We would like to extract email addresses from the Author the format is normally:

"Author: stephen.marquard@uct.ac.za"

Now lets see if we can use the following regular expression:

```python
r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
```
which was copied and pasted from http://emailregex.com/

Does it work in the task?

In [None]:
str1=re.findall(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", text)
set(str1)

What if I only want email address after Author ? 

In [None]:
str1=re.findall(r'Author: ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)', text)
str1

## Home work :

Watch the Software Carpentry lecture on regular expressions, if you need more help.

https://www.youtube.com/playlist?list=PL7C1EB31127AB8A0B

or you can look at the video lecture at Lynda.com 

https://wwwlyndacom.ezproxy.lib.monash.edu.au/Regular-Expressions-tutorials/Welcome/85870/93904-4.html?autoplay=true

In order to access the Lynda, you need to setup your account according to 

http://resources.lib.monash.edu.au/eresources/lyndacom.html