<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/03-Regular_Expressions/B-Regular_Expressions_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>








Regular Expressions
-------------------

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. 

We will present examples using python’s standard [re regular expression library](http://docs.python.org/library/re.html).

You may also want to look at this [*excellent* tutorial from Google](https://developers.google.com/edu/python/regular-expressions).


### Searching strings using regexes

In [None]:
# first import the library
import re

In [None]:
# Regular expressions are compiled into pattern objects
regex = re.compile(r'D.*Data')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# We will now try to match an email address. What is wrong in our regex? 
# Can you fix it? Try to use \w as a shorthand
regex = re.compile(r'\w+@\w+')
text = "My email is adam.brandenburger@stern.nyu.edu. You can email me."

matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# We are looking for binary numbers
regex = re.compile(r'[10]+')
text = "asddf1101110100011abd1111panos0000"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# We look for money figures, either integers, or with 1 or 2 decimal
# digits
regex = re.compile(r'\$\d+(\.\d\d?)?')
text = '$1200.23 is the price today. $1200 was the price yesterday'
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# This code is going to generate no matches
regex = re.compile(r'Ra*nd.*m R[egex]')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

### Flags for regexes: Case-sentitivity and multiline searches

Regular expressions are typically case-sensitive. 

In [None]:
# Regular expressions are compiled into pattern objects
# Regular expressions are case-sensitive
regex = re.compile(r'P.*IS')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

But we can specify that they are case-insensitive, using the flag re.IGNORECASE

In [None]:
# Unless we specify that they are case-insensitive, using the flag re.IGNORECASE
regex = re.compile('P.*IS',re.IGNORECASE)
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

 For a full list of available flags, please see the [re documentation](http://docs.python.org/library/re.html).

### Multiple matches in a string

The search command goes through the string to find the longest expression that matches the regex
and once it finds the first match, it stops. For example, we will not get the second phone number

In [None]:
# The search command goes through the string to find the longest expression that matches the regex
# Then it continues with the second one
regex = re.compile('\d{3}-\d{3}-\d{4}')
text = '''
Panos Ipeirotis, Dealing with Data, 
212-998-0803, panos@nyu.edu, 646-555-5555
'''
matches = regex.finditer(text)
for i, match in enumerate(matches):
    print(i+1, "==>", match.group())

If we want to find multiple matches within the string, then we use the `finditer` command that returns a collection of `MatchObject` items. (For comparison, `search` returns just the first `MatchObject` item.)

In [None]:
# The matches command returns an iterator containing "match" objects, which have a variety of attributes
regex = re.compile(r'\d{3}-\d{3}-\d{4}')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu, 646-555-5555"
matches = regex.finditer(text)
for m in matches:
    print("Starts at:", m.start(), 
    "Ends at:", m.end(),
    "Content:", m.group())

### Extracting Data -- where regex start to get really cool

#### Defining groups within regexes

In addition to simple matching and filtering, many regular expressions implementations, including python’s re, provide a powerful mechanism for extracting meaningful data from raw text. Through capturing, those strings that satisfy a particular regular expression are extracted from the text being matched, and returned to the program processing the raw data. 

**The portion of regular expressions that should be captured is surrounded by parentheses, `"( )"`.**

Then, provided the regular expression containing the capturing statement is satisfied, the result of the regular expression will contain a group of text matching patterns. This group method gets the results of the portions of the input text matched by the capturing statements in the regular expression. The matches are indexed from one-- to get the portion of the text matched by first capturing statement, you could query `result.group(1)`, the second parentheses will have its match stored in `result.group(2)`, etc. The value stored at `result.group(0)`, is the entire portion of the input string matched by the regular expression, not just the portion satisfying the capturing parentheses.

As example of data extraction using capturing regular expressions, say we’re scanning some raw text for phone numbers that we wish to retain for later processing. We might try something like:

In [None]:
# Find phone numbers: 
# Three digits \d{3}
# followed by zero or more non-digits \D*
# followed by three digits \d{3}
# followed by zero or more non-digits \D*
# followed by four digits \d{4}

# The re.VERBOSE flag at the end allows us to write the regex as a multiline string 
# and allows for comments (after the # character)
# In this mode, any whitespace character is ignored, unless explicitly added as part
# of a bracketed expression or when preceded by an unescaped backslash

regex = re.compile(r"""(\d{3}) # The first three digits / area code
                       \D*     # Followed by zero or more non-digits
                       (\d{3}) # The first three digits of the "local" part 
                       \D*     # Followed by zero or more non-digits
                       (\d{4}) # The last four digits of the phone number
                       """, re.VERBOSE)
text = '''
Panos Ipeirotis, Dealing with Data,
tel: 212-998-0803
email: panos@nyu.edu
fax: 646-255-5555
'''

matches = regex.finditer(text)
for match in matches:
    print(match.group())
    print("Formatted:", match.group(1),"-", match.group(2), "-", match.group(3))
    # print("Starts at:", match.start())
    # print("Ends at:", match.end())
    print("===========")

Now we will try to extract and format all phone numbers that are part of a big file:

In [None]:
raw_text = """
512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

In [None]:
raw_text

In [None]:
# Notice now that each part of the phone is included in parentheses
# allowing us to grab individual part of the phone number
regex = re.compile(r'([2-9]\d{2})\D*(\d{3})\D*(\d{4})')
matches = regex.finditer(raw_text)

phones = list()
for m in matches:
    area_code = m.group(1)
    first_three_digits = m.group(2)
    last_four_digits =  m.group(3)
    
    phone = "(" + area_code + ")" + first_three_digits + "-" + last_four_digits
            
    phones.append(phone)

# Notice that our list does not include numbers with invalid area codes (e.g., 124, 125)
phones

### String Replacement

In addition to matching and extraction, regular expressions can be used to change data--especially unstructured text--in very powerful ways.  In particular, regex allow you to find specific patterns and then replace them with specified strings. 

As a data scientist, this is useful when trying to get data formated correctly as input to a specific system, such as when doing data cleanup.

In python’s re library, the function used for replacement is `sub()` (think "substitute"). 

The pattern for invoking `sub()` is 

`re.sub(regex, replacement, text)`

This will return a version of text where all instances of the regex have been substituted with replacement.

Imagine we want to conceal all phone numbers in a document. We could use the following call to `sub()`:

In [None]:
raw_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

regex = re.compile('([2-9]\d{2})\D*(\d{3})\D*(\d{4})')

newstring = re.sub(regex, "XXX-XXX-XXXX", raw_text)

print(newstring)

When performing substitution, matches found using the capturing mechanism are available to the replacement using `\1`, `\2`, and so on, as shortcuts to `group(1)`, `group(2)`, etc. 

In order to use this back-referencing capability, we need to tell the `sub()` mechanism to treat the replacement string as a regular expression. For instance, if we want to make sure all phone numbers are normalized and all area codes are surrounded by parentheses, we can use:

In [None]:
print(re.sub(regex, r"(\1)-\2-\3", raw_text))

#### Exercise 1

The webpage at `http://www.stern.nyu.edu/faculty/search_name_form/` contains the contact emails for all the Stern faculty members. Write code that will allow you to extract all the emails that appear in the page. Just for your convenience, the code below will fetch the page, and store the HTML source in the variable `html`.

Then you will need to write the right regex and write the code that finds emails in the retrieved html.

In [None]:
import requests
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
response = requests.get(url)
html = response.text
html

In [None]:
# Find occurences of the pattern in the HTML source

# You want to write a regular expression that will find all the email addresses that appear in the html
# variable, and store the emails in a list. You may also want to write the list of emails in a text file.
pattern = r'YOUR PATTERN HERE'
regex = re.compile(pattern)
matches = regex.finditer(html)
for m in matches:
    ... #YOUR CODE HERE



#### Solution for Exercise 1

In [None]:
# Email regex
regex = re.compile(r'\w+@(\w+\.)+\w+')

# We can create either a list or a set, but let's avoid duplicates
emails = set()

# Fetch the HTML source
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
html = requests.get(url).text

# Find matches
matches = regex.finditer(html)
# Go through matches and add them in our result set
for m in matches:
    emails.add(m.group())

sorted(emails)

In [None]:
# and let's make it very compact using list comprehensions
import requests
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
html = requests.get(url).text
regex = re.compile(r'\w+@(\w+\.)+\w+')
emails = set([m.group() for m in regex.finditer(html) ])
emails