






Regular Expressions
-------------------

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. 

We will present examples using python’s standard [re regular expression library](http://docs.python.org/library/re.html).

You may also want to look at this [*excellent* tutorial from Google](https://developers.google.com/edu/python/regular-expressions).


### Basic Patterns

* a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ \$ * + ? { [ ] \ | ( ) (details below)
* . (a period) -- matches any single character except newline '\n'
* \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. 
* \W (upper case W) matches any non-word character.
* \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. 
* \S (upper case S) matches any non-whitespace character.
* \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
* \D -- matches anything that is not a number
* \b -- boundary between word and non-word
* \t, \n, \r -- tab, newline, return
* ^ = start, $ = end -- match the start or end of the string
* \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

### Searching strings using regexes

In [None]:
# first import the library
import re

In [None]:
# Regular expressions are compiled into pattern objects
regex = re.compile(r'D.*Data')
text = "Norman White, Dealing with Data, 212-998-0842, nwhite@stern.nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# Find phone numbers: 
# Three digits \d{3}
# followed by zero or more non-digits \D*
# followed by three digits \d{3}
# followed by zero or more non-digits \D*
# followed by four digits \d{4}
regex = re.compile(r'\d{3}\D*\d{3}\D*\d{4}')
text = '''
Norman White, Dealing with Data,
tel: 212-998-0842
email: nwhite@stern.nyu.edu
fax: 646-255-5555
'''

matches = regex.finditer(text)
for match in matches:
    print(match.group())
    print("Starts at:", match.start())
    print("Ends at:", match.end())

In [None]:
# We will now try to match an email address. What is wrong in our regex? 
# Can you fix it? Try to use \w as a shorthand
regex = re.compile(r'[\.\w]+@[\.\w_\-]+')

#regex = re.compile(r'[\.\w]+@[\w_\-]+(.[\w_\-]+)+')
text = "My email is adam.branden_burger034@stern.nyu.edu. You can email me."

matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# We are looking for binary numbers
regex = re.compile(r'[10]+')
text = "asddf1101110100011abd1111norm0000"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# We look for money figures, either integers, or with 1 or 2 decimal
# digits
regex = re.compile(r'\$\d+(\.\d\d?)?')
text = '$1200.23 is the price today. $1200 was the price yesterday'
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# This code is going to generate no matches
regex = re.compile(r'Ra*nd.*m R[egex]')
# Regular expressions are compiled into pattern objects

text = "Norman White, Dealing with Data, 212-998-0842, nwhite@stern.nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())
    # Regular expressions are compiled into pattern objects
regex = re.compile(r'D.*Data')
text = "Norman White, Dealing with Data, 212-998-0842, nwhite@stern.nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

### Flags for regexes: Case-sentitivity and multiline searches

Regular expressions are typically case-sensitive. 

In [None]:
# Regular expressions are compiled into pattern objects
# Regular expressions are case-sensitive
regex = re.compile(r'W.*TE,' )
#
text = "Norman White, Dealing with Data, 212-998-0842, nwhite@stern.nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())


But we can specify that they are case-insensitive, using the flag re.IGNORECASE

In [None]:
# Unless we specify that they are case-insensitive, using the flag re.IGNORECASE
regex = re.compile('W.*TE,',re.IGNORECASE)
# Regular expressions are compiled into pattern objects

text = "Norman White, Dealing with Data, 212-998-0842, nwhite@stern.nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())



 For a full list of available flags, please see the [re documentation](http://docs.python.org/library/re.html).

### Multiple matches in a string

The search command goes through the string to find the longest expression that matches the regex
and once it finds the first match, it stops. For example, we will not get the second phone number

In [None]:
# The search command goes through the string to find the longest expression that matches the regex
# Then it continues with the second one
regex = re.compile('\d{3}-\d{3}-\d{4}')
text = '''
Norman White, Dealing with Data, 
212-998-0842, nwhite@stern.nyu.edu, 646-555-5555
'''
matches = regex.finditer(text)
for i, match in enumerate(matches):
    print(i+1, "==>", match.group())

If we want to find multiple matches within the string, then we use the `finditer` command that returns a collection of `MatchObject` items. (For comparison, `search` returns just the first `MatchObject` item.)

In [None]:
# The matches command returns an iterator containing "match" objects, which have a variety of attributes
regex = re.compile(r'\d{3}-\d{3}-\d{4}')
text = "Norman White, Dealing with Data, 212-998-0842, 646-555-5555"
matches = regex.finditer(text)
for m in matches:
    print("Starts at:", m.start(), 
    "Ends at:", m.end(),
    "Content:", m.group())

### Extracting Data -- where regex starts to get really cool

#### Defining groups within regexes

In addition to simple matching and filtering, many regular expressions implementations, including python’s re, provide a powerful mechanism for extracting meaningful data from raw text. Through capturing, those strings that satisfy a particular regular expression are extracted from the text being matched, and returned to the program processing the raw data. 

**The portion of regular expressions that should be captured is surrounded by parentheses, `"( )"`.**

Then, provided the regular expression containing the capturing statement is satisfied, the result of the regular expression will contain a group of text matching patterns. This group method gets the results of the portions of the input text matched by the capturing statements in the regular expression. The matches are indexed from one-- to get the portion of the text matched by first capturing statement, you could query `result.group(1)`, the second parentheses will have its match stored in `result.group(2)`, etc. The value stored at `result.group(0)`, is the entire portion of the input string matched by the regular expression, not just the portion satisfying the capturing parentheses.

As example of data extraction using capturing regular expressions, say we’re scanning some raw text for phone numbers that we wish to retain for later processing. We might try something like:

In [None]:
raw_text = r"""
512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

In [None]:
raw_text

In [None]:
# Notice now that each part of the phone is included in parentheses
# allowing us to grab the individual parts of the phone number
#           3 digit  area      non-digits 3 digits non-digits     
regex = re.compile(r'([2-9]\d{2})\D*(\d{3})\D*(\d{4})\D*(\d+)?.*$')
matches = regex.finditer(raw_text, re.MULTILINE)

phones = set()
for m in matches:
    area_code = m.group(1)
    first_three_digits = m.group(2)
    last_four_digits =  m.group(3)
    extension = m.group(4)
    
    if extension:
        phone = "(" + area_code + ")" + first_three_digits + \
            "-" + last_four_digits + "x" + extension
    else:
        phone = "(" + area_code + ")" + first_three_digits + \
            "-" + last_four_digits 
    phones.add(phone)

# Notice that our list does not include numbers with invalid area codes (e.g., 124, 125)
phones

(See also http://www.diveintopython.net/regular_expressions/phone_numbers.html if you want to see further examples.)

The examples will look like gobbledygook at first.  But after you go through some actual cases, and especially after you struggle to write a few for a real data science task, you will realize that you're not in Kansas any longer.  Now get ready for a horse of a different color...

#### Exercise

* Modify the code above to allow the extraction of the extension number, if one exists

In [None]:
# your code here

### String Replacement

In addition to matching and extraction, regular expressions can be used to change data--especially unstructured text--in very powerful ways.  In particular, regex allow you to find specific patterns and then replace them with specified strings. 

As a data scientist, this is useful when trying to get data formatted correctly as input to a specific system, such as when doing data cleanup.

In python’s re library, the function used for replacement is `sub()` (think "substitute"). 

The pattern for invoking `sub()` is 

`re.sub(regex, replacement, text)`

This will return a version of text where all instances of the regex have been substituted with replacement.

Imagine we want to conceal all phone numbers in a document. We could use the following call to `sub()`:

In [None]:
raw_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

regex = re.compile('([2-9][0-9]{2})\W*([0-9]{3})\W*([0-9]{4})')

newstring = re.sub(regex, "XXX-XXX-XXXX", raw_text)

print(newstring)

When performing substitution, matches found using the capturing mechanism are available to the replacement using `\1`, `\2`, and so on, as shortcuts to `group(1)`, `group(2)`, etc. 

In order to use this back-referencing capability, we need to tell the `sub()` mechanism to treat the replacement string as a regular expression. For instance, if we want to make sure all phone numbers are normalized and all area codes are surrounded by parentheses, we can use:

In [None]:
print(re.sub(regex, r"(\1)-\2-\3", raw_text))

#### Exercise 1

The webpage at `http://www.stern.nyu.edu/faculty/search_name_form/` contains the contact emails for all the Stern faculty members. Write code that will allow you to extract all the emails that appear in the page. Just for your convenience, the code below will fetch the page, and store the HTML source in the variable `html`.

Then you will need to write the right regex and write the code that finds emails in the retrieved html.

In [None]:
import requests
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
response = requests.get(url)
html = response.text
html

# Your code here
# You want to write a regular expression that will find all the email addresses that appear in the html
# variable, and store the emails in a list. You may also want to write the list of emails in a text file.

#### Exercise 2

* The webpage at `http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A` contains the list of all tickers at the NASDAQ exchange, which start with the letter `A`. Inspect the HTML, and figure how what is the pattern for referring to the ticker (hint: you will see URLs of the form `http://www.nasdaq.com/symbol/....`). 
* Write regular expressions to extract the tickers that appear in a web page
* Write code for iterating over all pages of NASDAQ for all the different letters
* Write code for going over multiple pages within the same letter. (optional)

In [None]:
# your code here


#### Solution for Exercise 1

In [None]:
# Email regex
regex = re.compile(r'\w+@(\w+\.)+\w+')

# We can create either a list or a set, but let's avoid duplicates
emails = set()

# Fetch the HTML source
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
html = requests.get(url).text

# Find matches
matches = regex.finditer(html)
# Go through matches and add them in our result set
for m in matches:
    emails.add(m.group())

sorted(emails)

In [None]:
# and let's make it very compact using list comprehensions
import requests
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
html = requests.get(url).text
regex = re.compile(r'\w+@(\w+\.)+\w+')
emails = set([m.group() for m in regex.finditer(html) ])
emails

#### Solution for exercise 2

In [None]:
import requests

alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
tickers = set()
for letter in alphabet:
    url = 'http://www.nasdaq.com/screening/companies-by-name.aspx?letter='+letter
    print(url)
    html = requests.get(url).text
    
    # The code below extracts the number of pages for each letter
    # of the alphabet. Potentially we can use that number to
    # iterate over all the pages in NASDAQ. Left as an exercise
    # for the interested reader :-)
    pages_regex = r'Displaying.*of.*<b>(\d+)</b>.*results'
    pregex = re.compile(pages_regex)
    pages = pregex.finditer(html)
    for m in pages:
        print("Results:", m.group(1))
        pages = int(int(m.group(1))/50+1)
        print("Letter", letter, "needs", str(pages), "pages")
    
    ticker_regex = r'http://www.nasdaq.com/symbol/(\w+)'
    regex = re.compile(ticker_regex)
    matches = regex.finditer(html)
    for m in matches:
        ticker = m.group(1).upper()
        #print("URL:", m.group())
        #print("Ticker:", ticker)
        tickers.add(ticker)
    print("We have ", len(tickers), "tickers")

tickers