






Regular Expressions
-------------------

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. 

We will present examples using python’s standard [re regular expression library](http://docs.python.org/library/re.html).

You may also want to look at this [*excellent* tutorial from Google](https://developers.google.com/edu/python/regular-expressions).


### Basic Patterns

* a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ \$ * + ? { [ ] \ | ( ) (details below)
* . (a period) -- matches any single character except newline '\n'
* \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
* \b -- boundary between word and non-word
* \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
* \t, \n, \r -- tab, newline, return
* \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
* ^ = start, $ = end -- match the start or end of the string
* \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

### Searching strings using regexes

In [1]:
# first import the library
import re

In [3]:
# Regular expressions are compiled into pattern objects
regex = re.compile(r'D.*Data')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
match = regex.search(text)
print match.group()

regex = re.compile(r'\d{3}\D*\d{3}\D*\d{4}')
match = regex.search(text)
print match.group()

Dealing with Data
212-998-0803


In [5]:
# We will now try to match an email address. What is wrong in our regex?
# regex = re.compile(r'[a-z]*@([a-z])+(.[a-z]+)+')
regex = re.compile(r'[a-z\.]+@([a-z0-9\-]+\.)+\w+')
text = "adam.brandenburger@stern.nyu.edu"
match = regex.search(text)
print match.group()

adam.brandenburger@stern.nyu.edu


In [7]:
# We are looking for binary numbers
regex = re.compile(r'[10]+')
text = "asddf1101110100011abd"
match = regex.search(text)
print match.group() 

1101110100011


In [9]:
# We look for money figures, either integers, or with 1 or 2 decimal
# digits
regex = re.compile(r'^\$?(\d*(\.\d\d?)?|\d+)')
text = '$1200.23 is the price today'
match = regex.search(text)
print match.group()

$1200.23


In [12]:
# This code is going to generate an error
regex = re.compile(r'Ra*nd.*m R[egex]')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
match = regex.search(text)
print match.group()

AttributeError: 'NoneType' object has no attribute 'group'

When the regular expression does not find anything, then the search function returns `None`, which can cause an error in the code above:

In [11]:
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:                      
    print 'found', match.group() ## 'found word:cat'
else:
    print 'did not find'

found word:cat


Therefore, we need to check that the returned object is not None, before trying to access a method of the object. The `None` value within the context of an `if` conditional gets translated to `False`; hence, we can modify the code above as follows:

In [None]:
# Regular expressions are compiled into pattern objects
regex = re.compile(r'Ra*nd.*m R[egex]')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
match = regex.search(text)
if match:
    print match.group()
else:
    print "not found"

### Flags for regexes: Case-sentitivity and multiline searches

Regular expressions are typically case-sensitive. 

In [None]:
# Regular expressions are compiled into pattern objects
# Regular expressions are case-sensitive
regex = re.compile(r'I.*IS')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
match = regex.search(text)
if match:
    print match.group()
else:
    print "not found"

But we can specify that they are case-insensitive, using the flag re.IGNORECASE

In [None]:
# Unless we specify that they are case-insensitive, using the flag re.IGNORECASE
regex = re.compile('I.*IS',re.IGNORECASE)
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
match = regex.search(text)
if match:
    print match.group()
else:
    print "not found"

 For a full list of available flags, please see the [re documentation](http://docs.python.org/library/re.html).

### Multiple matches in a string

The search command goes through the string to find the longest expression that matches the regex
and once it finds the first match, it stops. For example, we will not get the second phone number

In [13]:
# The search command goes through the string to find the longest expression that matches the regex
# and once it finds the first match, it stops. For example, we will not get the second phone number
regex = re.compile('[0-9]{3}-[0-9]{3}-[0-9]{4}')
text = '''
Panos Ipeirotis, Dealing with Data, 
212-998-0803, panos@nyu.edu, 646-555-5555
'''
match = regex.search(text)
if match:
    print match.group()
else:
    print "not found"

212-998-0803


If we want to find multiple matches within the string, then we use the `finditer` command that returns a collection of `MatchObject` items. (For comparison, `search` returns just the first `MatchObject` item.)

In [14]:
# The finditer command goes through the string to find the all the expressions that matches the regex
regex = re.compile(r'[0-9]{3}-[0-9]{3}-[0-9]{4}')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu, 646-555-5555"
matches = regex.finditer(text)
for m in matches:
    print "Starts at:", m.start(), 
    "Ends at:", m.end(),
    "Content:", m.group()

Starts at: 36 Ends at: 48 Content: 212-998-0803
Starts at: 65 Ends at: 77 Content: 646-555-5555


### Extracting Data -- where regex start to get really cool

In addition to simple matching and filtering, many regular expressions implementations, including python’s re, provide a powerful mechanism for extracting meaningful data from raw text. Through capturing, those strings that satisfy a particular regular expression are extracted from the text being matched, and returned to the program processing the raw data. The portion of regular expressions that should be captured is surrounded by parentheses, `"( )"`. Then, provided the regular expression containing the capturing statement is satisfied, the result of the regular expression will contain a group of text matching patterns. This group method gets the results of the portions of the input text matched by the capturing statements in the regular expression. The matches are indexed from one-- to get the portion of the text matched by first capturing statement, you could query `result.group(1)`, the second parentheses will have its match stored in `result.group(2)`, etc. The value stored at `result.group(0)`, is the entire portion of the input string matched by the regular expression, not just the portion satisfying the capturing parentheses.

As example of data extraction using capturing regular expressions, say we’re scanning some raw text for phone numbers that we wish to retain for later processing. We might try something like:

In [15]:
raw_text = r"""
512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

# Notice now that each part of the phone is included in parentheses
# allowing us to grab individual part of the phone number
regex = re.compile(r'([2-9]\d{2})\W*(\d{3})\W*(\d{4})')
matches = regex.finditer(raw_text)

for m in matches:
    print "(", m.group(1) , ")",m.group(2), "-", m.group(3)


( 512 ) 234 - 5234
( 679 ) 397 - 5255
( 212 ) 666 - 0921
( 212 ) 998 - 0902
( 888 ) 888 - 2222
( 801 ) 555 - 1211
( 802 ) 555 - 1212
( 803 ) 555 - 1213
( 804 ) 555 - 1214
( 805 ) 555 - 1215
( 806 ) 555 - 1216
( 807 ) 555 - 1217
( 808 ) 555 - 1218
( 809 ) 555 - 1219
( 810 ) 555 - 1220


(See also http://www.diveintopython.net/regular_expressions/phone_numbers.html if you want to see further examples.)

The examples will look like gobbledygook at first.  But after you go through some actual cases, and especially after you struggle to write a few for a real data science task, you will realize that you're not in Kansas any longer.  Now get ready for a horse of a different color...

### String Replacement

In addition to matching and extraction, regular expressions can be used to change data--especially unstructured text--in very powerful ways.  In particular, regex allow you to find specific patterns and then replace them with specified strings. 

As a data scientist, this is useful when trying to get data formated correctly as input to a specific system, such as when doing data cleanup.

In python’s re library, the function used for replacement is `sub()` (think "substitute"). 

The pattern for invoking `sub()` is 

`re.sub(regex, replacement, text)`

This will return a version of text where all instances of the regex have been substituted with replacement.

Imagine we want to conceal all phone numbers in a document. We could use the following call to `sub()`:

In [None]:
raw_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

regex = re.compile('([0-9]{3})\W*([0-9]{3})\W*([0-9]{4})')

newstring = re.sub(regex, "XXX-XXX-XXXX", raw_text)

print newstring

When performing substitution, matches found using the capturing mechanism are available to the replacement using `\1`, `\2`, and so on, as shortcuts to `group(1)`, `group(2)`, etc. 

In order to use this back-referencing capability, we need to tell the `sub()` mechanism to treat the replacement string as a regular expression. For instance, if we want to make sure all phone numbers are normalized and all area codes are surrounded by parentheses, we can use:

In [None]:
print re.sub(regex, r"(\1)-\2-\3", raw_text)

### Exercise

Find all the emails in a webpage. 

Since we have not covered yet the networking abilities of Python (coming next), just use curl to fetch the HTML source of the page. Remember that you can either store the outcome of curl into a file and then read the file into Python, or (preferable) directly get the output of curl into a Python variable.

Then you will need to write the right regex and write the code that finds emails in the retrieved html.

In [29]:
#your code here
html = !curl -s 'http://www.stern.nyu.edu/faculty/search_name_form'
print html

['<!DOCTYPE html>', '<html lang="en">', '<head>', '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">', '  <meta charset="utf-8" />', '<link rel="shortcut icon" href="http://www.stern.nyu.edu/sites/all/themes/stern/favicon.ico" type="image/vnd.microsoft.icon" />', '<link rel="apple-touch-icon" href="apple-touch-icon.png" />', '<link rel="apple-touch-icon" sizes="152x152" href="touch-icon-ipad.png" />', '<title>NYU Stern | Faculty Directory</title>', '  <link type="text/css" rel="stylesheet" href="http://www.stern.nyu.edu/sites/default/files/css/css_xE-rWrJf-fncB6ztZfd2huxqgxu4WO-qwma6Xer30m4.css" media="all" />', '<link type="text/css" rel="stylesheet" href="http://www.stern.nyu.edu/sites/default/files/css/css_sMZt_LGhD1AhX95CPWtFSGmwHMFmLTPDIPTN7PXksBc.css" media="all" />', '<link type="text/css" rel="stylesheet" href="http://www.stern.nyu.edu/sites/default/files/css/css_uHf4Jfry1oEnmrrhjGx_fXwVpo1JaZzZgENAn5ZRMg8.css" media="all" />', '<link type="text/css" rel="styl

In [31]:
regex = re.compile(r'[a-z\.]+@([a-z0-9\-]+\.)+\w+')
emails = []
for line in html:
    match = regex.search(line)
    if match:
        # print match.group()
        emails.append(match.group())
print emails

['rabrante@stern.nyu.edu', 'vacharya@stern.nyu.edu', 'pagnello@stern.nyu.edu', 'nahmad@stern.nyu.edu', 'talbanes@stern.nyu.edu', 'wallen@stern.nyu.edu', 'aalter@stern.nyu.edu', 'daltman@stern.nyu.edu', 'ealtman@stern.nyu.edu', 'yamihud@stern.nyu.edu', 'marmony@stern.nyu.edu', 'aasadpou@stern.nyu.edu', 'hassael@stern.nyu.edu', 'dbackus@stern.nyu.edu', 'bakos@stern.nyu.edu', 'kbalacha@stern.nyu.edu', 'tbaldeni@stern.nyu.edu', 'abandyop@stern.nyu.edu', 'isa@stern.nyu.edu', 'ebartov@stern.nyu.edu', 'wbaumol@stern.nyu.edu', 'bbechky@stern.nyu.edu', 'rberenbe@stern.nyu.edu', 'jbergenf@stern.nyu.edu', 'rbernste@stern.nyu.edu', 'yberson@stern.nyu.edu', 'gbesner@stern.nyu.edu', 'kbigel@stern.nyu.edu', 'jbiggs@stern.nyu.edu', 'jbilders@stern.nyu.edu', 'mbilling@stern.nyu.edu', 'sblader@stern.nyu.edu', 'cbleuste@stern.nyu.edu', 'mbonacch@stern.nyu.edu', 'abonezzi@stern.nyu.edu', 'sbowmake@stern.nyu.edu', 'eboyle@stern.nyu.edu', 'adam.brandenburger@stern.nyu.edu', 'abreen@stern.nyu.edu', 'mbrennan

#### Solution for Exercise



In [None]:
# Email regex
regex = re.compile(r'[a-z0-9]+@([a-z0-9]+\.)+[a-z]+')

# We can create either a list or a set, but let's avoid duplicates
emails = set()

# We iterate through the lines of the html code
for line in html:
    # Find matches
    matches = regex.finditer(line)
    # Go through matches
    for m in matches:
        print m.group()
        emails.add(m.group())

print len(emails)

In [None]:
# and let's make it very compact using list comprehensions
html = !curl -s 'http://www.stern.nyu.edu/faculty/search_name_form'
regex = re.compile(r'[a-z0-9]+@([a-z0-9]+\.)+[a-z]+')
emails = set([m.group() for line in html for m in regex.finditer(line) ])
emails