### A short practical introduction to RegEx

The concept Regular Expression arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, and grep (global regular expression print)

A regular expression processor translates a regular expression into a nondeterministic finite automaton (NFA) (where several states can be the output of a given state and symbol), which is then made deterministic (only one possible state transition for a particular symbol) and run on the target text string to recognize substrings that match the regular expression.

You write regular expressions (regex) to match patterns in strings. When you are processing text, you may want to extract a substring of some predictable structure: a phone number, an email address, or something more specific to your research or task. You may also want to clean your text of some kind of junk: maybe there are repetitive formatting errors due to some transcription process that you need to remove.

In these cases and in many others like them, writing the right regex will be a good choice.

[<img src="RE.png">](https://xkcd.com/208/)




<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
<ul>
    <li> Regular expressions are pattern matching rules. In essence everything is a character and the regular expression are a set of rules of the character patterns to seek.</li>
    <li> If we provide a raw set of characters it will look for exact matches, e.g. 'aBc1' </li>
</ul>
</div>

In [12]:
import re #the regex module in the python standard library

#strings to be searched for matching regex patterns
#Match the three strings
ex1 = 'abc abcde abcdefg'
pattern = 'abc'
match = re.search(pattern,ex1)

print ("First match:" + match.group())

First match:abc


the `search` method returns an object SRE_Match if find some match for the given pattern, otherwise it returns None. The `group()` method in SRE_Match object returns the substring that matched the pattern defined.

Note that since we are using re.search, only a single substrign is returned. That's because of the following:

+ We only defined a single string pattern 
+ `re.search` finds the first possible match and then doesn't look for any more.

If you want to find all possible matches in a string, you can use re.findall(), which will return a list of all strings that match:

In [13]:
print (re.findall(pattern,ex1))

['abc', 'abc', 'abc']


You can also compile your regex ahead of time. There are many performance reasons to do this. Additionally, you can create lists of these objects and iterate over both strings and patterns more easily, using `finditer` Here's an example:

In [14]:
strings = ['abc123xyz define123 var g = 123', "abc abcde abcdefg"]

patterns = [re.compile('abc'), re.compile('123')]

for string in strings:
    for pattern in patterns:
        for m in re.finditer(pattern, string): # This is a find all but iterated, if you ever want to do anything after each match
            print ("Searching r\""+pattern.pattern+"\" in " + string)
            print ('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
    
# r=<pattern>

Searching r"abc" in abc123xyz define123 var g = 123
00-03: abc
Searching r"123" in abc123xyz define123 var g = 123
03-06: 123
Searching r"123" in abc123xyz define123 var g = 123
16-19: 123
Searching r"123" in abc123xyz define123 var g = 123
28-31: 123
Searching r"abc" in abc abcde abcdefg
00-03: abc
Searching r"abc" in abc abcde abcdefg
04-07: abc
Searching r"abc" in abc abcde abcdefg
10-13: abc


<H3> Summary of terms for regular expressions </H3>
<ul>
     <p><strong>'[ ]'</strong> - one element inside has to match.</p>
<p><strong>'|'</strong> - or element.</p>
<p><strong>'( )'</strong> - all inside has to be matched.</p>
<p><strong>'{ }'</strong> - to set an interval or number of times repetition.</p>
<p><strong>'\'</strong> - identify next character as a character and not regular expression symbol.</p>
<p><strong>'.'(Dot.)</strong> - In the default mode, this matches any character except a newline. </p>
<p><strong>'^'(Caret.)</strong> - Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.</p>
<p><strong>'$'</strong> - Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. </p>
<p><strong>'\*'</strong> - Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’ or ‘ab’ followed by any number of ‘b’s.</p>
<p><strong>'+'</strong>- Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.</p>
<p><strong>'?'</strong> - Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.</p>

***?, +?, ??** - The **'*'**, **'+'**, and **'?'** qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE `<.*>` is matched against `<a>` b `<c>`, it will match the entire string, and not just `<a>`. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE `<.*?>` will match only `<a>`.

**\d** - Matches any decimal digit; this is equivalent to the class [0-9].

**\D** - Matches any non-digit character; this is equivalent to the class [^0-9].

**\s** - Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

**\S** - Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]

**\w** - Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

**\W** - Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

**\b** - Matches the boundary (white spaces) between a word character and a non-word character.

<p>For more comprehesive and complete documentation with (?...) extensions, ref: <a href="http://docs.python.org/2/library/re.html#resyntax">http://docs.python.org/2/library/re.html#re-syntax</a>


In [1]:
#Find all adverbs (words ended by ly)
import re
text = "He was carefully disguised but captured quickly by police ly."
for m in re.finditer(r"\w+ly", text):
    print ('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

07-16: carefully
40-47: quickly



<ul>
     <li>Optional values can be given by the question mark sign. The preceding character will be optional, e.g. cats? stands for cat and cats.</li>
     <li>Another way of checking for specific options is to use square brackets. For example *[abc]* will match only a, b, or c.</li>
     <li>We can negate a set in square brackets *[^abc]*</li>
     <li>We can select ranges, such as *[a-z]*, *[A-Z]* or *[0-9]*</li>
</ul>


In [2]:
#Find files starting with "file" and finishing with .pdf
#All of them start with file, thus it is a boundary, 
#then any amount of arbirtary characters and finally it will end with .pdf
import re
text = 'file_a_record_file.pdf file_yesterday.pdf test_file_fake.pdf.tmp' 
for m in re.finditer(r"\bfile\w*\.pdf", text):
    print ('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

00-22: file_a_record_file.pdf
23-41: file_yesterday.pdf



<ul>
     <li>Another interesting feature is capturing. In parenthesis we can define the group or set of data we want to return. In python we can access these data by indexing the match. At the first position we will get the first capture, in the second position the nested capture or group, etc.</li>
</ul>
</div>

In [3]:
#Trim starting and ending spaces

text = "               Masters of Ba Gua Zhang    "

for m in re.finditer(r"\s*(.+)\s*", text):
    print ('%02d-%02d: %s' % (m.start(1), m.end(1), m.group(1)))
    print ('%02d-%02d: %s' % (m.start(0), m.end(0), m.group(0)))
#Note that we use group(1), group(0) is the complete match without capture

15-42: Masters of Ba Gua Zhang    
00-42:                Masters of Ba Gua Zhang    


Check what happens if we change index 1 for index 0 in the former example.

In [4]:
#Trim starting and ending spaces

text = "               Masters of Ba Gua Zhang    "

for m in re.finditer(r"\s*(.+)\s*", text):
    print ('%02d-%02d: %s' % (m.start(0), m.end(0), m.group(0)))
#Note that we use group(1), group(0) is the complete match without capture

00-42:                Masters of Ba Gua Zhang    


In [5]:
#Match any number 
numbers = '3.1452 -255.34 128 1.9e10 12,334.00 720p'


print (re.findall(r"-?\d+[\.,]?\d*[\.e]?\d*\b", numbers))

['3.1452', '-255.34', '128', '1.9e10', '12,334.00']


### Regular Expressions in an html page

Find all the links in web site:

In [6]:
html = open("Data Science - Universitat de Barcelona.htm").read()

for m in re.finditer(r"href=\"(\S+)\"", html):
    print ('%02d-%02d: %s' % (m.start(1), m.end(1), m.group(1)))

760-834: http://maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css
870-891: css/themify-icons.css
927-995: http://maxcdn.bootstrapcdn.com/bootstrap/3.3.2/css/bootstrap.min.css
1108-1195: http://fonts.googleapis.com/css?family=Roboto+Condensed:300,700%7COpen+Sans:300,400,700
1260-1281: css/style.default.css
2044-2050: #intro
2413-2421: #contact
3035-3042: master/
3369-3374: deep/
3668-3680: postgraduate
6960-7001: http://bootstrapious.com/portfolio-themes


### Regular Expressions in PANDAS

You can use Regular Expressions for working in text data inside a Series, that make it easy to operate on each element of the array. These methods exclude missing/NA values automatically. These are accessed via the `str` attribute

In [7]:
import pandas as pd
df = pd.read_csv('./educ_figdp_1_Data.csv',na_values=':')
df.head(5)

Unnamed: 0,TIME,GEO,INDIC_ED,Value,Flag and Footnotes
0,2000,European Union (28 countries),Total public expenditure on education as % of ...,,
1,2001,European Union (28 countries),Total public expenditure on education as % of ...,,
2,2002,European Union (28 countries),Total public expenditure on education as % of ...,5.0,e
3,2003,European Union (28 countries),Total public expenditure on education as % of ...,5.03,e
4,2004,European Union (28 countries),Total public expenditure on education as % of ...,4.95,e


In [8]:
pattern = "\((?P<European_Union>\d+ countries)\)" # A group can be named using ?P<nom of group>
s = df["GEO"].str.extract(pattern, expand= False) # Returns a Serie.
s.dropna()

0     28 countries
1     28 countries
2     28 countries
3     28 countries
4     28 countries
          ...     
79    13 countries
80    13 countries
81    13 countries
82    13 countries
83    13 countries
Name: European_Union, Length: 84, dtype: object

In [9]:
df["GEO"].str.extract(pattern, expand= True) #returns a Dataframe 

Unnamed: 0,European_Union
0,28 countries
1,28 countries
2,28 countries
3,28 countries
4,28 countries
...,...
379,
380,
381,
382,


### Regular expresions Methods for `str`
+ `findall()`	Compute list of all occurrences of pattern/regex for each string
+ `match()`	Call re.match on each element, returning matched groups as list
+ `extract()`	Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group
+ `extractall()` Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group

### Cheching matches

If we don't want to return the substring, instead we want just to check if a string matchs a pattern, we can use the `re.match` function and check if it is None or not:

For example, if we want to check if a string is a well formed URL:

In [10]:
import re
pattern = '^((https?:\/\/)|www\.)([\da-z\.-]+)\.([\/\w\.-]*)$'

str_true = ('https://github.com', 
            'http://github.com',
            'www.github.com',
            'https://www.github.com/rasbt'
            )
            
str_false = ('//testmail.com', 'http:testmailcom', )

strings = str_true + str_false

for t in strings:
    f = bool(re.match(pattern, t))
    print ('%s is a %s URL' % (t,f))

https://github.com is a True URL
http://github.com is a True URL
www.github.com is a True URL
https://www.github.com/rasbt is a True URL
//testmail.com is a False URL
http:testmailcom is a False URL


# Exercices

* A regular expression that check most email addresses:

In [26]:
import re
pattern = r"\S+@\w+\.\w+$"

str_true = ('l-l.l@mail.Aom',)
            
str_false = ('testmail.com','test@mail.com.', '@testmail.com', 'test@mailcom')

strings = str_true + str_false
for t in strings:
    f = bool(re.match(pattern, t))
    print ('%s is a %s mail address' % (t,f))

l-l.l@mail.Aom is a True mail address
testmail.com is a False mail address
test@mail.com. is a False mail address
@testmail.com is a False mail address
test@mailcom is a False mail address


Validates dates in mm/dd/yyyy format. 

In [43]:
import re
pattern = r"(0[1-9]|1[012])/(0[1-9]|[12]\d|3[01])/\d\d\d\d"
str_true = ('01/08/2014', '12/30/2014', )
            
str_false = ('22/08/2014', '-123', '1/8/2014', '1/08/2014', '01/8/2014')

strings = str_true + str_false
for t in strings:
    f = bool(re.match(pattern, t))
    print ('%s is a %s data format' % (t,f))

01/08/2014 is a True data format
12/30/2014 is a True data format
22/08/2014 is a False data format
-123 is a False data format
1/8/2014 is a False data format
1/08/2014 is a False data format
01/8/2014 is a False data format


* 12-Hour format

In [72]:
import re
pattern = r'(1[0-2])|\d:[0-5]\d\s?[(a)|(A)|(p)|(P)][(m)|(M)]'
str_true = ('2:00pm', '7:30 AM', '12:05 am', )
            
str_false = ('22:00pm', '14:00', '3:12', '03:12pm', )
strings = str_true + str_false
for t in strings:
    f = bool(re.match(pattern, t))
    print ('%s is a %s 12-hour format' % (t,f))

2:00pm is a True 12-hour format
7:30 AM is a True 12-hour format
12:05 am is a True 12-hour format
22:00pm is a False 12-hour format
14:00 is a False 12-hour format
3:12 is a False 12-hour format
03:12pm is a False 12-hour format


* Checking for HTML/XML, etc. tags (a very simple approach)

In [61]:
import re
pattern = r"<\S.*>"
str_true = ('<a>', '<a href="somethinG">', '</a>', '<img src>')
            
str_false = ('a>', '<a ', '< a >')
strings = str_true + str_false
for t in strings:
    f = bool(re.match(pattern, t))
    print ('%s is a %s HTML/XML file' % (t,f))

<a> is a True HTML/XML file
<a href="somethinG"> is a True HTML/XML file
</a> is a True HTML/XML file
<img src> is a True HTML/XML file
a> is a False HTML/XML file
<a  is a False HTML/XML file
< a > is a False HTML/XML file
