# Regular Expressions

## Announcements

- [Library Data Services and Assessment Assistant](https://careers.pageuppeople.com/885/ci/en-us/job/494795/library-data-services-and-assessment-assistant)

## Final project discussion

## Review

## Looking Back

- SQL, Relational Databases
   - Selecting, sorting, limiting, joins
   - Using SQLite in Colab with `%%sql`

- Data manipulation with Pandas
   - DataFrames and Series's
   - Import/Exporting to SQL
   - Pulling tables from web
   - Selection, sorting, counting

- Split-Apply-Combine
   - Groupby in Pandas e.g. `.groupby('column').mean()`

- Visualization

- Semi-Structured data in MongoDB
    - JSON
    - selection, sorting
    - Aggregations
    - MapReduce concepts

- String pattern matching and extraction with regular expressions

## Final Project Updates

## <center>Regular Expressions</center>
### <center>aka *regex*</center>

### Overview

Regular Expressions help you work with strings

*Pattern Matching*

e.g. Find all phone numbers on a web page

*Manipulation*

e.g. Match "{Lastname}, {Firstname}" in a set of records and rewrite it as "{Firstname} {Lastname}"

## Why?

- Checking whether an input is valid (i.e. password, phone number, email, etc.)
- Cleaning data
- More complex data subsetting
- Working with user inputs or other unstructured data

### Q: Where can you use regular expressions?

### A: Many, many places!

## In Python

In [2]:
import re
comment = "It was a dark and stormy night." 

Find a simple string:

In [3]:
re.findall('dark', comment)

['dark']

Find all sequences of one or more word characters:

In [3]:
re.findall('\w+', comment)

['It', 'was', 'a', 'dark', 'and', 'stormy', 'night']

## In SQL

SQLite doesn't support it, but...

**MySQL**

Select columns that match alphanumeric characters only:

```
SELECT * FROM table WHERE column REGEXP '^[A-Za-z0-9]+$';
```

**Postgresql**

Match strings that include foo, bar, or baz:

```
SELECT * FROM table WHERE value ~ 'foo|bar|baz';
```

## In Pandas

In [4]:
import pandas as pd
movies = pd.read_csv('https://raw.githubusercontent.com/organisciak/Scripting-Course/master/data/movielens_small.csv')
movies.sample()

Unnamed: 0,userId,rating,title,genres,timestamp,year
78484,282,3.5,"Dirty Dozen, The",Action,1111611630,1967


Find movies where there is a digit (`\d`) right before the end of the string (`$`):

In [5]:
matches = movies['title'].str.contains('\d$')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
36811,607,4.5,Kill Bill: Vol. 2,Action,1113406147,2004
2043,39,4.0,Apollo 13,Adventure,832523013,1995
2042,34,4.0,Apollo 13,Adventure,973746527,1995
65679,213,2.5,District 9,Mystery,1462634092,2009
82253,73,1.5,Home Alone 3,Children,1255845284,1997
65865,423,0.5,2012,Action,1353701305,2009
20203,527,1.0,Lethal Weapon 4,Action,1281232047,1998
12684,294,3.5,Spider-Man 2,Action,1119923345,2004
36819,655,2.5,Kill Bill: Vol. 2,Action,1470072662,2004
81724,306,3.0,Airport '77,Drama,939901272,1977


Find movies where the substring ' Part ' exists:

In [6]:
matches = movies.title.str.contains(' Part ')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
88805,95,4.0,Fright Night Part II,Horror,1019023671,1988
41343,316,4.0,Harry Potter and the Deathly Hallows: Part 1,Action,1460822796,2010
41327,13,4.5,Harry Potter and the Deathly Hallows: Part 1,Action,1331380387,2010
25114,163,5.0,"Godfather: Part II, The",Crime,1294084374,1974
25108,118,5.0,"Godfather: Part II, The",Crime,950153998,1974
73552,350,4.0,Rambo: First Blood Part II,Action,1026761580,1985
43898,518,1.0,Hot Shots! Part Deux,Action,945365087,1993
50254,119,2.0,Back to the Future Part III,Adventure,913050261,1990
25118,198,5.0,"Godfather: Part II, The",Crime,1068822896,1974
50222,560,5.0,Back to the Future Part II,Adventure,1452849491,1989


Find movies that are named "The ... of ..."

In [7]:
matches = movies.title.str.contains('^The .+ of ')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
36105,386,3.0,The Count of Monte Cristo,Action,1047028511,2002
92516,564,4.0,The Golden Voyage of Sinbad,Action,974713375,1973
36117,664,4.0,The Count of Monte Cristo,Action,1393891028,2002
68247,15,1.0,The Theory of Everything,Drama,1425875426,2014
68276,287,5.0,The Hobbit: The Battle of the Five Armies,Adventure,1469162192,2014
97198,402,4.0,The Best of Me,Drama,1462948242,2014
75357,23,3.0,The Importance of Being Earnest,Comedy,1148669643,1952
92202,176,1.0,The Scorpion King: Rise of a Warrior,Action,1340915855,2008
36102,294,4.0,The Count of Monte Cristo,Action,1112390008,2002
36108,468,2.5,The Count of Monte Cristo,Action,1296193249,2002


## In MongoDB

In [None]:
from pymongo import MongoClient
client = MongoClient()
db = client.week7
collection = db.cooking

Find an recipe with an ingredient called "yellow ..."

In [10]:
collection.find_one({
    "ingredients": {"$regex": "yellow .*"}
})

{'_id': ObjectId('5cedb796db075a25e4ac71b5'),
 'cuisine': 'southern_us',
 'id': 25693,
 'ingredients': ['plain flour',
  'ground pepper',
  'salt',
  'tomatoes',
  'ground black pepper',
  'thyme',
  'eggs',
  'green tomatoes',
  'yellow corn meal',
  'milk',
  'vegetable oil']}

After unwinding the recipes to one doc per ingredient, find ingredients with a qualified salt:

In [24]:
pipeline = [
    { "$unwind": "$ingredients" },
    { "$project": {"ingredients": 1, "_id":0} },
    { "$match":{
        "ingredients": {"$regex": "^.+ salt" }
        }
    },
    { "$limit": 5 }
]
results = collection.aggregate(pipeline)
list(results)

[{'ingredients': 'sea salt'},
 {'ingredients': 'kosher salt'},
 {'ingredients': 'fine sea salt'},
 {'ingredients': 'kosher salt'},
 {'ingredients': 'kosher salt'}]

Count the qualified salt types:

In [25]:
pipeline = [
    { "$unwind": "$ingredients" },
    { "$project": {"ingredients": 1, "_id":0} },
    { "$match":{ "ingredients": {"$regex": "^.+ salt$" } } },
    { "$group":{
        "_id": "$ingredients", "count": {"$sum": 1} } 
    },
    { "$sort": { "count": -1} },
    { "$limit": 20 }
]
results = collection.aggregate(pipeline)
list(results)

[{'_id': 'kosher salt', 'count': 3113},
 {'_id': 'sea salt', 'count': 940},
 {'_id': 'coarse salt', 'count': 578},
 {'_id': 'fine sea salt', 'count': 285},
 {'_id': 'garlic salt', 'count': 240},
 {'_id': 'seasoning salt', 'count': 131},
 {'_id': 'table salt', 'count': 79},
 {'_id': 'coarse sea salt', 'count': 68},
 {'_id': 'coarse kosher salt', 'count': 64},
 {'_id': 'celery salt', 'count': 52},
 {'_id': 'fine salt', 'count': 24},
 {'_id': 'onion salt', 'count': 15},
 {'_id': 'rock salt', 'count': 14},
 {'_id': 'black salt', 'count': 12},
 {'_id': 'pickling salt', 'count': 12},
 {'_id': 'Himalayan salt', 'count': 11},
 {'_id': 'celtic salt', 'count': 9},
 {'_id': 'maldon sea salt', 'count': 8},
 {'_id': 'smoked sea salt', 'count': 6},
 {'_id': 'iodized salt', 'count': 4}]

### Note on variation

- Regular Expressions are *close* to standard, but different implementations are slightly different.

## Basics of Regular Expressions

In this class: we'll cover the basics, practiced in Python and Pandas.

To follow along:

In [5]:
import re

In [8]:
text = "it was a dark and stormy night."
re.findall('\w+', text)

['it', 'was', 'a', 'dark', 'and', 'stormy', 'night']

## Wild Cards

`a` - Match the letter 'a'. Same for most other characters

In [9]:
text = "Colorado"
re.findall('o', text)

['o', 'o', 'o']

In [10]:
text = "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo"
re.findall('Buffalo buffalo', text)

['Buffalo buffalo', 'Buffalo buffalo', 'Buffalo buffalo']

`.` - Match any single character

In [13]:
text = "who, what, where, why, and how"
re.findall('wh.', text)

['who', 'wha', 'whe', 'why']

In [14]:
text = "who, what, where, why, and how"
re.findall('wh.,', text)

['who,', 'why,']

- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters

In [28]:
text = "Who, what, where, why, and how"
re.findall('\w\w\w,', text)

['Who,', 'hat,', 'ere,', 'why,']

In [13]:
text = "Who, what, where, why, and how"
re.findall('\w', text)

['W',
 'h',
 'o',
 'w',
 'h',
 'a',
 't',
 'w',
 'h',
 'e',
 'r',
 'e',
 'w',
 'h',
 'y',
 'a',
 'n',
 'd',
 'h',
 'o',
 'w']

`\d` - Match any digit

In [14]:
text = "Party like it's 1999"
re.findall('\d', text)

['1', '9', '9', '9']

In [17]:
text = "Party like it's 1999"
re.findall('\d\d\d\d', text)

['1999']

### *What if I want to match an actual backslash or period?*

This is a problem:

In [15]:
text = "Dr. Jones Drinks Too Much"
re.findall('Dr.', text)

['Dr.', 'Dri']

Precede the character with a backslash

E.g.

- `.` - Matches *any* character
- `\.` - Matches a literal period

In [19]:
re.findall('Dr\.', text)

['Dr.']

## Reference (so far)
- `a` - Match the letter 'a'. Same for most other characters
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period

Let's try the first few lab questions - 1.1. to 1.4.

`\s` - Match any whitespace character (space, tabs, line breaks sometimes)

*What will this return?*

In [20]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('\s....\s', text)

[' over ', ' lazy ']

`[ab]` - Group of multiple possible characters - in this case 'a' or 'b'

In [21]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('[Tt]he', text)

['The', 'the']

- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z

In [16]:
text = "text 1-800-SPAM for more information"
re.findall('[A-Z][A-Z][A-Z][A-Z]', text)

['SPAM']

Those square brackets are same as before, so you can group A-Z with other matches.

e.g. Match capital letters, digits, or hyphens:

In [23]:
text = "text 1-800-SPAM for more information"
re.findall('[\d\-A-Z]+', text)

['1-800-SPAM']

*Note above that a hyphen is another special character, so matching for a literal `-` is done with `\-`.*

Returning to the earlier data.

In [17]:
titles = movies.title.drop_duplicates()

"The (single word) of ..."

In [18]:
matches = titles.str.contains('^The \w+ of ')
titles[matches].sample(10)

99821                       The Face of an Angel
99139                        The End of the Tour
75357            The Importance of Being Earnest
68247                   The Theory of Everything
94869             The Possession of Michael King
93216                    The Diary of Anne Frank
96669    The Disappearance of Eleanor Rigby: Her
36095                  The Count of Monte Cristo
94735                  The Plague of the Zombies
97199                           The Book of Life
Name: title, dtype: object

In [19]:
matches = titles.str.contains(':')
titles[matches].sample(10)

85974                          Carnosaur 3: Primal Species
44442                        Kids in the Hall: Brain Candy
97936                                 Imagine: John Lennon
87777                                Underworld: Evolution
48488                              Speed 2: Cruise Control
94916                               Madonna: Truth or Dare
85005               Children of the Corn IV: The Gathering
88472                  The Butterfly Effect 3: Revelations
85954    Teenage Mutant Ninja Turtles II: The Secret of...
73527                         History of the World: Part I
Name: title, dtype: object

In [25]:
matches = titles.str.contains("^\w+\-\w+$")
titles[matches]

259                  Ben-Hur
12000             Spider-Man
40269                  X-Men
55032                  U-571
58796             Scooby-Doo
61252              Fail-Safe
65729                G-Force
66092               Kick-Ass
68396                Ant-Man
69332            Re-Animator
69765    Slaughterhouse-Five
81831                  K-PAX
83228                 BURN-E
83394               Non-Stop
83908               Bio-Dome
89557            Topsy-Turvy
93256               Cry-Baby
94106              She-Devil
95079               Kon-Tiki
96155              De-Lovely
96602               Catch-22
96617                Ben-hur
96717               Semi-Pro
98638                  T-Men
99056     Shakespeare-Wallah
99971        Straight-Jacket
Name: title, dtype: object

## Reference (so far)
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

## Repetition

`?` - One or zero of the preceding match

In [20]:
text = "color colour"
re.findall('colou?r', text)

['color', 'colour']

- `+` - One or more of the preceding match
- `*` - Zero or more of the preceding match

In [22]:
text = "GOAL GOOOOOOOOOAAAAAAL GAL"
re.findall('GO+A+L', text)

['GOAL', 'GOOOOOOOOOAAAAAAL']

In [34]:
text = "GOAL"
re.findall('GO+A+L', text)

['GOAL']

`*` and `+` are *greedy* in Python. They will grab as much as possible. 

In [23]:
text = "<p>Something or other</p><p>Yet more junk.</p>" 
re.findall('<p>.*</p>', text)

['<p>Something or other</p><p>Yet more junk.</p>']

In [26]:
text = "foo1@gmail.com;b-a-r@gmail.com;baz@gmail.com" 
re.findall('\w.*@gmail.com', text)

['foo1@gmail.com;b-a-r@gmail.com;baz@gmail.com']

`*?` is the *lazy* alternative, it will grab as little as possible.

['foo1@gmail.com', 'b-a-r@gmail.com', 'baz@gmail.com']

In [123]:
re.findall('\w.*?@gmail.com', text)

['foo1@gmail.com', 'b-a-r@gmail.com', 'baz@gmail.com']

## Reference (so far)

**Matching characters**
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)

**Multiple Matches**
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

**Repeating**

*'greedy' means that it captures as much as it can, 'lazy' means it captures as little as possible.*
- `?` - One or zero of the preceding match
- `+` - One or more of the preceding match (greedy)
- `*` - Zero or more of the preceding match (greedy)
- `*?`, `+?`  - Lazy versions of `*` and `+`

## Start and End of Line

`^` - Start of line

In [27]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('^The', text)

['The']

In [44]:
re.findall('^The', text)

['The']

In [109]:
re.findall('^.*fox', text)

['The quick brown fox']

`$` - End of line

In [104]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('.......$', text)

['low dog']

In [28]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall("^.*$", text)

['The quick brown fox jumped over the lazy yellow dog']

## Reference

**Matching characters**
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)

**Multiple Matches**
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

**Repeating**

*'greedy' means that it captures as much as it can, 'lazy' means it captures as little as possible.*
- `?` - One or zero of the preceding match
- `+` - One or more of the preceding match (greedy)
- `*` - Zero or more of the preceding match (greedy)
- `*?`, `+?`  - Lazy versions of `*` and `+`

**Position**
- `^` - Start of line
- `$` - End of line

# Additional tips

Choose a range for repetition with `{min,max}`. e.g.

In [57]:
text = "YOLO"
re.search('YOLO{1,3}$', text) 

<_sre.SRE_Match object; span=(0, 4), match='YOLO'>

In [58]:
text = "YOLOOO"
re.search('YOLO{1,3}$', text)

<_sre.SRE_Match object; span=(0, 6), match='YOLOOO'>

In [66]:
text = "YOLOOOOOO"
re.search('YOLO{1,3}$', text)

*Negation*
    
Use the caret in square brackets: `[^aeiou]` means *not* a, e, i, o, or u

*Groups*
    
Treat multiple characters together, like if they were a single character.

Use parentheses. e.g:

In [53]:
text = "banana"
re.findall('^ba(na)+$', text)

['na']

In [66]:
text = "lololololololololololol"
re.findall('^l(ol)+$', text)

['ololololololololololol']

Capturing groups:

In [61]:
text = "Ketchup Catsup"
re.findall('(Ketch|Cats)up', text)

['Ketch', 'Cats']