<a href="https://colab.research.google.com/github/mb8655/Python/blob/main/Regular_Expressions_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Regular Expressions
This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement.

We will present examples using python’s standard re regular expression library.

You may also want to look at this excellent tutorial from Google. http://docs.python.org/library/re.html

Searching strings using regexes https://developers.google.com/edu/python/regular-expressions

In [1]:
# first import the library
import re

In [2]:
# Regular expressions are compiled into pattern objects
regex = re.compile(r'D.*Data')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

Dealing with Data


In [3]:
# We will now try to match an email address. What is wrong in our regex?
# Can you fix it? Try to use \w as a shorthand
regex = re.compile(r'\w+@\w+')
text = "My email is adam.brandenburger@stern.nyu.edu. You can email me."

matches = regex.finditer(text)
for match in matches:
    print(match.group())

brandenburger@stern


In [6]:
# fix
regex = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
text = "My email is adam.brandenburger@stern.nyu.edu. You can email me."

matches = regex.finditer(text)
for match in matches:
    print(match.group())

adam.brandenburger@stern.nyu.edu


In [7]:
# We are looking for binary numbers
regex = re.compile(r'[10]+')
text = "asddf1101110100011abd1111panos0000"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

1101110100011
1111
0000


In [8]:
# We look for money figures, either integers, or with 1 or 2 decimal
# digits
regex = re.compile(r'\$\d+(\.\d\d?)?')
text = '$1200.23 is the price today. $1200 was the price yesterday'
matches = regex.finditer(text)
for match in matches:
    print(match.group())

$1200.23
$1200


In [9]:
# This code is going to generate no matches
regex = re.compile(r'Ra*nd.*m R[egex]')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

## Flags for regexes: Case-sentitivity and multiline searches
Regular expressions are typically case-sensitive.

In [10]:
# Regular expressions are compiled into pattern objects
# Regular expressions are case-sensitive
regex = re.compile(r'P.*IS')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

##But we can specify that they are case-insensitive, using the flag re.IGNORECASE

Full list of available flags: http://docs.python.org/library/re.html

In [11]:
# Unless we specify that they are case-insensitive, using the flag re.IGNORECASE
regex = re.compile('P.*IS',re.IGNORECASE)
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

Panos Ipeirotis


## Multiple matches in a string
The search command goes through the string to find the longest expression that matches the regex and once it finds the first match, it stops. For example, we will not get the second phone number

In [13]:
# The search command goes through the string to find the longest expression that matches the regex
# Then it continues with the second one
regex = re.compile('\d{3}-\d{3}-\d{4}')
text = '''
Panos Ipeirotis, Dealing with Data,
212-998-0803, panos@nyu.edu, 646-555-5555
Dealing with Data,
212-998-0222
'''
matches = regex.finditer(text)
for i, match in enumerate(matches):
    print(i+1, "==>", match.group())

1 ==> 212-998-0803
2 ==> 646-555-5555
3 ==> 212-998-0222


If we want to find multiple matches within the string, then we use the finditer command that returns a collection of MatchObject items. (For comparison, search returns just the first MatchObject item.)

In [14]:
# The matches command returns an iterator containing "match" objects, which have a variety of attributes
regex = re.compile(r'\d{3}-\d{3}-\d{4}')
text = "Panos Ipeirotis, Dealing with Data, 212-998-0803, panos@nyu.edu, 646-555-5555"
matches = regex.finditer(text)
for m in matches:
    print("Starts at:", m.start(),
    "Ends at:", m.end(),
    "Content:", m.group())

Starts at: 36 Ends at: 48 Content: 212-998-0803
Starts at: 65 Ends at: 77 Content: 646-555-5555


##Extracting Data -- where regex start to get really cool
**Defining groups within regexes**

In addition to simple matching and filtering, many regular expressions implementations, including python’s re, provide a powerful mechanism for extracting meaningful data from raw text. Through capturing, those strings that satisfy a particular regular expression are extracted from the text being matched, and returned to the program processing the raw data.

**The portion of regular expressions that should be captured is surrounded by parentheses, "( )".**

Then, provided the regular expression containing the capturing statement is satisfied, the result of the regular expression will contain a group of text matching patterns. This group method gets the results of the portions of the input text matched by the capturing statements in the regular expression. The matches are indexed from one-- to get the portion of the text matched by first capturing statement, you could query result.group(1), the second parentheses will have its match stored in result.group(2), etc. The value stored at result.group(0), is the entire portion of the input string matched by the regular expression, not just the portion satisfying the capturing parentheses.

As example of data extraction using capturing regular expressions, say we’re scanning some raw text for phone numbers that we wish to retain for later processing. We might try something like:

In [15]:
# Find phone numbers:
# Three digits \d{3}
# followed by zero or more non-digits \D*
# followed by three digits \d{3}
# followed by zero or more non-digits \D*
# followed by four digits \d{4}

# The re.VERBOSE flag at the end allows us to write the regex as a multiline string
# and allows for comments (after the # character)
# In this mode, any whitespace character is ignored, unless explicitly added as part
# of a bracketed expression or when preceded by an unescaped backslash

regex = re.compile(r"""(\d{3}) # The first three digits / area code
                       \D*     # Followed by zero or more non-digits
                       (\d{3}) # The first three digits of the "local" part
                       \D*     # Followed by zero or more non-digits
                       (\d{4}) # The last four digits of the phone number
                       """, re.VERBOSE)
text = '''
Panos Ipeirotis, Dealing with Data,
tel: 212-998-0803
email: panos@nyu.edu
fax: 646-255-5555
'''

matches = regex.finditer(text)
for match in matches:
    print(match.group())
    print("Formatted:", match.group(1),"-", match.group(2), "-", match.group(3))
    # print("Starts at:", match.start())
    # print("Ends at:", match.end())
    print("===========")

212-998-0803
Formatted: 212 - 998 - 0803
646-255-5555
Formatted: 646 - 255 - 5555


In [16]:
#Now we will try to extract and format all phone numbers that are part of a big file:

raw_text = """
512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

In [17]:
raw_text

'\n512-234-5234\nfoo\nbar\n124-512-5555\nbiz\n125-555-5785\n679-397-5255\n2126660921\n212-998-0902\n888-888-2222\n801-555-1211\n802 555 1212\n803.555.1213\n(804) 555-1214\n1-805-555-1215\n1(806)555-1216\n807-555-1217-1234\n808-555-1218x1\n809-555-1219 ext. 1234\nwork 1-(810) 555.1220 #1234\n'

In [18]:
# Notice now that each part of the phone is included in parentheses
# allowing us to grab individual part of the phone number
regex = re.compile(r'([2-9]\d{2})\D*(\d{3})\D*(\d{4})')
matches = regex.finditer(raw_text)

phones = list()
for m in matches:
    area_code = m.group(1)
    first_three_digits = m.group(2)
    last_four_digits =  m.group(3)

    phone = "(" + area_code + ")" + first_three_digits + "-" + last_four_digits

    phones.append(phone)

# Notice that our list does not include numbers with invalid area codes (e.g., 124, 125)
phones

['(512)234-5234',
 '(679)397-5255',
 '(212)666-0921',
 '(212)998-0902',
 '(888)888-2222',
 '(801)555-1211',
 '(802)555-1212',
 '(803)555-1213',
 '(804)555-1214',
 '(805)555-1215',
 '(806)555-1216',
 '(807)555-1217',
 '(808)555-1218',
 '(809)555-1219',
 '(810)555-1220']

## String Replacement
In addition to matching and extraction, regular expressions can be used to change data--especially unstructured text--in very powerful ways. In particular, regex allow you to find specific patterns and then replace them with specified strings.

As a data scientist, this is useful when trying to get data formated correctly as input to a specific system, such as when doing data cleanup.

In python’s re library, the function used for replacement is sub() (think "substitute").

The pattern for invoking sub() is

re.sub(regex, replacement, text)

This will return a version of text where all instances of the regex have been substituted with replacement.

Imagine we want to conceal all phone numbers in a document. We could use the following call to sub():

In [19]:
raw_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

regex = re.compile('([2-9]\d{2})\D*(\d{3})\D*(\d{4})')

newstring = re.sub(regex, "XXX-XXX-XXXX", raw_text)

print(newstring)

XXX-XXX-XXXX
foo
bar
124-512-5555
biz
125-555-5785
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
(XXX-XXX-XXXX
1-XXX-XXX-XXXX
1(XXX-XXX-XXXX
XXX-XXX-XXXX-1234
XXX-XXX-XXXXx1234
XXX-XXX-XXXX ext. 1234
work 1-(XXX-XXX-XXXX #1234



When performing substitution, matches found using the capturing mechanism are available to the replacement using \1, \2, and so on, as shortcuts to group(1), group(2), etc.

In order to use this back-referencing capability, we need to tell the sub() mechanism to treat the replacement string as a regular expression. For instance, if we want to make sure all phone numbers are normalized and all area codes are surrounded by parentheses, we can use:

In [20]:
print(re.sub(regex, r"(\1)-\2-\3", raw_text))


(512)-234-5234
foo
bar
124-512-5555
biz
125-555-5785
(679)-397-5255
(212)-666-0921
(212)-998-0902
(888)-888-2222
(801)-555-1211
(802)-555-1212
(803)-555-1213
((804)-555-1214
1-(805)-555-1215
1((806)-555-1216
(807)-555-1217-1234
(808)-555-1218x1234
(809)-555-1219 ext. 1234
work 1-((810)-555-1220 #1234



The webpage at http://www.stern.nyu.edu/faculty/search_name_form/ contains the contact emails for all the Stern faculty members. Write code that will allow you to extract all the emails that appear in the page. Just for your convenience, the code below will fetch the page, and store the HTML source in the variable html.

Then you will need to write the right regex and write the code that finds emails in the retrieved html.

In [21]:
import requests
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
response = requests.get(url)
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head profile="https://www.w3.org/1999/xhtml/vocab">\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n  <meta charset="utf-8" /><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"5b89eb49f4",applicationID:"65778391"};window.NREUM||(NREUM={}),__nr_require=function(n,e,t){function r(t){if(!e[t]){var i=e[t]={exports:{}};n[t][0].call(i.exports,function(e){var i=n[t][1][e];return r(i||e)},i,i.exports)}return e[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(n,e,t){function r(){}function i(n,e,t){return function(){return o(n,[u.now()].concat(f(arguments)),e?null:this,t),e?void 0:this}}var o=n("handle"),a=n(4),f=n(5),c=n("ee").get("tracer"),u=n("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],d="ap

In [23]:
# Email regex
regex = re.compile(r'\w+@(\w+\.)+\w+')

# We can create either a list or a set, but let's avoid duplicates
emails = set()

# Fetch the HTML source
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
html = requests.get(url).text

# Find matches
matches = regex.finditer(html)
# Go through matches and add them in our result set
for m in matches:
    emails.add(m.group())


sorted(emails)

['Eggers@stern.nyu.edu',
 'Kuehlwein@stern.nyu.edu',
 'ab4775@stern.nyu.edu',
 'ab70@stern.nyu.edu',
 'ab862@stern.nyu.edu',
 'abs9397@stern.nyu.edu',
 'ad3@stern.nyu.edu',
 'ad4@stern.nyu.edu',
 'af131@stern.nyu.edu',
 'ag122@stern.nyu.edu',
 'ag5@stern.nyu.edu',
 'ag918@stern.nyu.edu',
 'ahg2061@stern.nyu.edu',
 'ajt10@stern.nyu.edu',
 'ak199@stern.nyu.edu',
 'ak5@stern.nyu.edu',
 'akb214@stern.nyu.edu',
 'al26@stern.nyu.edu',
 'al74@stern.nyu.edu',
 'ala8@stern.nyu.edu',
 'als455@stern.nyu.edu',
 'am14005@stern.nyu.edu',
 'amalin@stern.nyu.edu',
 'amh22@stern.nyu.edu',
 'amm22@stern.nyu.edu',
 'amp453@stern.nyu.edu',
 'an1490@stern.nyu.edu',
 'angelica@stern.nyu.edu',
 'ar183@stern.nyu.edu',
 'ark8@stern.nyu.edu',
 'as11475@stern.nyu.edu',
 'as3631@stern.nyu.edu',
 'as5275@stern.nyu.edu',
 'as5552@stern.nyu.edu',
 'as83@stern.nyu.edu',
 'as9@stern.nyu.edu',
 'asarto@stern.nyu.edu',
 'asiegman@stern.nyu.edu',
 'at106@stern.nyu.edu',
 'at1@stern.nyu.edu',
 'at2@stern.nyu.edu',
 'at302

In [24]:
# and let's make it very compact using list comprehensions
import requests
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
html = requests.get(url).text
regex = re.compile(r'\w+@(\w+\.)+\w+')
emails = set([m.group() for m in regex.finditer(html) ])
emails

{'Eggers@stern.nyu.edu',
 'Kuehlwein@stern.nyu.edu',
 'ab4775@stern.nyu.edu',
 'ab70@stern.nyu.edu',
 'ab862@stern.nyu.edu',
 'abs9397@stern.nyu.edu',
 'ad3@stern.nyu.edu',
 'ad4@stern.nyu.edu',
 'af131@stern.nyu.edu',
 'ag122@stern.nyu.edu',
 'ag5@stern.nyu.edu',
 'ag918@stern.nyu.edu',
 'ahg2061@stern.nyu.edu',
 'ajt10@stern.nyu.edu',
 'ak199@stern.nyu.edu',
 'ak5@stern.nyu.edu',
 'akb214@stern.nyu.edu',
 'al26@stern.nyu.edu',
 'al74@stern.nyu.edu',
 'ala8@stern.nyu.edu',
 'als455@stern.nyu.edu',
 'am14005@stern.nyu.edu',
 'amalin@stern.nyu.edu',
 'amh22@stern.nyu.edu',
 'amm22@stern.nyu.edu',
 'amp453@stern.nyu.edu',
 'an1490@stern.nyu.edu',
 'angelica@stern.nyu.edu',
 'ar183@stern.nyu.edu',
 'ark8@stern.nyu.edu',
 'as11475@stern.nyu.edu',
 'as3631@stern.nyu.edu',
 'as5275@stern.nyu.edu',
 'as5552@stern.nyu.edu',
 'as83@stern.nyu.edu',
 'as9@stern.nyu.edu',
 'asarto@stern.nyu.edu',
 'asiegman@stern.nyu.edu',
 'at106@stern.nyu.edu',
 'at1@stern.nyu.edu',
 'at2@stern.nyu.edu',
 'at302