# Regular Expressions

## Announcements

## Final project discussion

## Review

## Looking Back

- SQL, Relational Databases
   - Selecting, sorting, limiting, joins
   - Using SQLite in Colab with `%%sql`

- Data manipulation with Pandas
   - DataFrames and Series's
   - Import/Exporting to SQL
   - Pulling tables from web
   - Selection, sorting, counting

- Split-Apply-Combine
   - Groupby in Pandas e.g. `.groupby('column').mean()`

- Visualization

- String pattern matching and extraction with regular expressions

## <center>Regular Expressions</center>
### <center>aka *regex*</center>

### Overview

Regular Expressions help you work with strings

*Pattern Matching*

e.g. Find all phone numbers on a web page

*Manipulation*

e.g. Match "{Lastname}, {Firstname}" in a set of records and rewrite it as "{Firstname} {Lastname}"

## Why?

- Checking whether an input is valid (i.e. password, phone number, email, etc.)
- Cleaning data
- More complex data subsetting
- Working with user inputs or other unstructured data

### Q: Where can you use regular expressions?

### A: Many, many places!

## In Python

In [1]:
import re
comment = "It was a dark and stormy night." 

Find a simple string:

In [26]:
re.findall('dark', comment)

['dark']

Find all sequences of one or more word characters:

In [9]:
re.findall(r'\w+', comment)

['It', 'was', 'a', 'dark', 'and', 'stormy', 'night']

*(What does the `r` mean *before* the start of the string? That tells Python to use the 'raw' string. If your regular expression uses backslashes, it's generally less messy to include that `r'...'`)*

## In SQL

SQLite doesn't support it, but most 'bigger' databases do. e.g.

**MySQL**

Select columns that match alphanumeric characters only:

```
SELECT * FROM table WHERE column REGEXP '^[A-Za-z0-9]+$';
```

**Postgresql**

Match strings that include foo, bar, or baz:

```
SELECT * FROM table WHERE value ~ 'foo|bar|baz';
```

## In Pandas

In [10]:
import pandas as pd
movies = pd.read_csv('https://raw.githubusercontent.com/organisciak/Scripting-Course/master/data/movielens_small.csv')
movies.sample()

Unnamed: 0,userId,rating,title,genres,timestamp,year
69472,262,3.0,Tenebre,Horror,1433901090,1982


Find movies where there is a digit (`\d`) right before the end of the string (`$`):

In [12]:
matches = movies['title'].str.contains(r'\d$')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
36747,199,3.5,Kill Bill: Vol. 2,Action,1215033850,2004
36603,320,4.0,Kill Bill: Vol. 1,Action,1460751900,2003
2121,317,4.0,Apollo 13,Adventure,847633374,1995
49323,306,3.0,Scream 2,Comedy,939901609,1997
61316,176,3.5,Shrek 2,Adventure,1341056045,2004
61311,146,3.0,Shrek 2,Adventure,1256071512,2004
12698,426,2.5,Spider-Man 2,Action,1310374343,2004
82146,654,4.0,Wayne's World 2,Comedy,1145393760,1993
36600,309,4.5,Kill Bill: Vol. 1,Action,1114567270,2003
61341,378,0.5,Shrek 2,Adventure,1443625509,2004


Find movies where the substring ' Part ' exists:

In [25]:
matches = movies.title.str.contains(' Part ')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
85818,212,2.0,"Karate Kid, Part II, The",Action,1253932268,1986
50145,30,4.0,Back to the Future Part II,Adventure,945294811,1989
85816,157,2.5,"Karate Kid, Part II, The",Action,1291598091,1986
41353,501,3.0,Harry Potter and the Deathly Hallows: Part 1,Action,1307129483,2010
41371,157,3.5,Harry Potter and the Deathly Hallows: Part 2,Action,1369334443,2011
43892,416,3.0,Hot Shots! Part Deux,Action,841446959,1993
25997,388,4.0,"Godfather: Part III, The",Crime,946528959,1990
50222,560,5.0,Back to the Future Part II,Adventure,1452849491,1989
43874,232,3.0,Hot Shots! Part Deux,Action,955086058,1993
25111,130,4.0,"Godfather: Part II, The",Crime,1139003651,1974


Find movies that are named "The ... of ..."

In [24]:
matches = movies.title.str.contains('^The .+ of ')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
68278,371,2.0,The Hobbit: The Battle of the Five Armies,Adventure,1462738278,2014
68479,15,3.0,The Age of Adaline,Drama,1458506026,2015
36103,353,3.5,The Count of Monte Cristo,Action,1113052800,2002
92516,564,4.0,The Golden Voyage of Sinbad,Action,974713375,1973
92515,346,5.0,The Golden Voyage of Sinbad,Action,1044651633,1973
68276,287,5.0,The Hobbit: The Battle of the Five Armies,Adventure,1469162192,2014
68248,72,3.5,The Theory of Everything,Drama,1461784635,2014
96814,380,5.0,The Jinx: The Life and Deaths of Robert Durst,Documentary,1465156469,2015
68256,378,3.5,The Theory of Everything,Drama,1443292443,2014
68275,270,3.0,The Hobbit: The Battle of the Five Armies,Adventure,1469306927,2014


## In MongoDB

In [41]:
from pymongo import MongoClient
with open('credentials.txt', mode='r') as f:
    user, mongopw, cluster_url = [l.strip() for l in f.readlines()]
client = MongoClient("mongodb+srv://{}:{}@{}/test?retryWrites=true&w=majority".format(user, mongopw, cluster_url))
db = client.scripting
collection = db.cooking

Find an recipe with an ingredient called "yellow ..."

In [42]:
db.cooking.find_one({
    "ingredients": {"$regex": "yellow .*"}
})

{'_id': ObjectId('682b56f789dffff004d7c273'),
 'id': 25693,
 'cuisine': 'southern_us',
 'ingredients': ['plain flour',
  'ground pepper',
  'salt',
  'tomatoes',
  'ground black pepper',
  'thyme',
  'eggs',
  'green tomatoes',
  'yellow corn meal',
  'milk',
  'vegetable oil']}

After unwinding the recipes to one doc per ingredient, find ingredients with a qualified salt:

In [43]:
pipeline = [
    { "$unwind": "$ingredients" },
    { "$project": {"ingredients": 1, "_id":0} },
    { "$match":{
        "ingredients": {"$regex": "^.+ salt" }
        }
    },
    { "$limit": 5 }
]
results = collection.aggregate(pipeline)
list(results)

[{'ingredients': 'sea salt'},
 {'ingredients': 'kosher salt'},
 {'ingredients': 'fine sea salt'},
 {'ingredients': 'kosher salt'},
 {'ingredients': 'kosher salt'}]

Count the qualified salt types:

In [44]:
pipeline = [
    { "$unwind": "$ingredients" },
    { "$project": {"ingredients": 1, "_id":0} },
    { "$match":{ "ingredients": {"$regex": "^.+ salt$" } } },
    { "$group":{
        "_id": "$ingredients", "count": {"$sum": 1} } 
    },
    { "$sort": { "count": -1} },
    { "$limit": 20 }
]
results = collection.aggregate(pipeline)
list(results)

[{'_id': 'kosher salt', 'count': 3113},
 {'_id': 'sea salt', 'count': 940},
 {'_id': 'coarse salt', 'count': 578},
 {'_id': 'fine sea salt', 'count': 285},
 {'_id': 'garlic salt', 'count': 240},
 {'_id': 'seasoning salt', 'count': 131},
 {'_id': 'table salt', 'count': 79},
 {'_id': 'coarse sea salt', 'count': 68},
 {'_id': 'coarse kosher salt', 'count': 64},
 {'_id': 'celery salt', 'count': 52},
 {'_id': 'fine salt', 'count': 24},
 {'_id': 'onion salt', 'count': 15},
 {'_id': 'rock salt', 'count': 14},
 {'_id': 'pickling salt', 'count': 12},
 {'_id': 'black salt', 'count': 12},
 {'_id': 'Himalayan salt', 'count': 11},
 {'_id': 'celtic salt', 'count': 9},
 {'_id': 'maldon sea salt', 'count': 8},
 {'_id': 'smoked sea salt', 'count': 6},
 {'_id': 'iodized salt', 'count': 4}]

### Note on variation

- Regular Expressions are *close* to standard, but different implementations are slightly different.

## Basics of Regular Expressions

In this class: we'll cover the basics, practiced in Python and Pandas.

To follow along:

In [45]:
import re

In [47]:
text = "it was a dark and stormy night."
re.findall(r'\w+', text)

['it', 'was', 'a', 'dark', 'and', 'stormy', 'night']

## Wild Cards

`a` - Match the letter 'a'. Same for most other characters

In [48]:
text = "Colorado"
re.findall('o', text)

['o', 'o', 'o']

In [49]:
text = "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo"
re.findall('Buffalo buffalo', text)

['Buffalo buffalo', 'Buffalo buffalo', 'Buffalo buffalo']

`.` - Match any single character

In [50]:
text = "who, what, where, why, and how"
re.findall('wh.', text)

['who', 'wha', 'whe', 'why']

In [51]:
text = "who, what, where, why, and how"
re.findall('wh.,', text)

['who,', 'why,']

- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters

In [53]:
text = "Who, what, where, why, and how"
re.findall(r'\w\w\w,', text)

['Who,', 'hat,', 'ere,', 'why,']

In [54]:
text = "Who, what, where, why, and how"
re.findall(r'\w', text)

['W',
 'h',
 'o',
 'w',
 'h',
 'a',
 't',
 'w',
 'h',
 'e',
 'r',
 'e',
 'w',
 'h',
 'y',
 'a',
 'n',
 'd',
 'h',
 'o',
 'w']

`\d` - Match any digit

In [56]:
text = "Party like it's 1999"
re.findall(r'\d', text)

['1', '9', '9', '9']

In [57]:
text = "Party like it's 1999"
re.findall(r'\d\d\d\d', text)

['1999']

### *What if I want to match an actual backslash or period?*

This is a problem:

In [58]:
text = "Dr. Jones Drinks Too Much"
re.findall('Dr.', text)

['Dr.', 'Dri']

Precede the character with a backslash

E.g.

- `.` - Matches *any* character
- `\.` - Matches a literal period

In [60]:
re.findall(r'Dr\.', text)

['Dr.']

## Reference (so far)
- `a` - Match the letter 'a'. Same for most other characters
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period

Let's try the first few lab questions - 1.1. to 1.4.

`\s` - Match any whitespace character (space, tabs, line breaks sometimes)

*What will this return?*

In [61]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall(r'\s....\s', text)

[' over ', ' lazy ']

`[ab]` - Group of multiple possible characters - in this case 'a' or 'b'

In [62]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall(r'[Tt]he', text)

['The', 'the']

- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z

In [64]:
text = "text 1-800-SPAM for more information"
re.findall('[A-Z][A-Z][A-Z][A-Z]', text)

['SPAM']

Those square brackets are same as before, so you can group A-Z with other matches.

e.g. Match capital letters, digits, or hyphens:

In [65]:
text = "text 1-800-SPAM for more information"
re.findall(r'[\d\-A-Z]+', text)

['1-800-SPAM']

*Note above that a hyphen is another special character, so matching for a literal `-` is done with `\-`.*

Returning to the earlier data.

In [66]:
titles = movies.title.drop_duplicates()

"The (single word) of ..."

In [67]:
matches = titles.str.contains('^The \w+ of ')
titles[matches].sample(10)

  matches = titles.str.contains('^The \w+ of ')


88980               The Earrings of Madame de...
68702                       The Legend of Tarzan
94735                  The Plague of the Zombies
36095                  The Count of Monte Cristo
94869             The Possession of Michael King
97199                           The Book of Life
68247                   The Theory of Everything
99821                       The Face of an Angel
96669    The Disappearance of Eleanor Rigby: Her
99139                        The End of the Tour
Name: title, dtype: object

In [68]:
matches = titles.str.contains(':')
titles[matches].sample(10)

84673     Friday the 13th Part VIII: Jason Takes Manhattan
87329    Dragon Ball Z: The Return of Cooler (Doragon b...
87634                            Babylon 5: A Call to Arms
84625                        Amityville II: The Possession
87124    Fullmetal Alchemist the Movie: Conqueror of Sh...
83682       Nightmare on Elm Street 2: Freddy's Revenge, A
82362        Vampire Hunter D: Bloodlust (Banpaia hantâ D)
65723                          G.I. Joe: The Rise of Cobra
99229                         Amityville: A New Generation
84541    Going Clear: Scientology and the Prison of Belief
Name: title, dtype: object

In [69]:
matches = titles.str.contains("^\w+\-\w+$")
titles[matches]

  matches = titles.str.contains("^\w+\-\w+$")


259                  Ben-Hur
12000             Spider-Man
40269                  X-Men
55032                  U-571
58796             Scooby-Doo
61252              Fail-Safe
65729                G-Force
66092               Kick-Ass
68396                Ant-Man
69332            Re-Animator
69765    Slaughterhouse-Five
81831                  K-PAX
83228                 BURN-E
83394               Non-Stop
83908               Bio-Dome
89557            Topsy-Turvy
93256               Cry-Baby
94106              She-Devil
95079               Kon-Tiki
96155              De-Lovely
96602               Catch-22
96617                Ben-hur
96717               Semi-Pro
98638                  T-Men
99056     Shakespeare-Wallah
99971        Straight-Jacket
Name: title, dtype: object

## Reference (so far)
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

## Repetition

`?` - One or zero of the preceding match

In [70]:
text = "color colour"
re.findall('colou?r', text)

['color', 'colour']

- `+` - One or more of the preceding match
- `*` - Zero or more of the preceding match

In [71]:
text = "GOAL GOOOOOOOOOAAAAAAL GAL"
re.findall('GO+A+L', text)

['GOAL', 'GOOOOOOOOOAAAAAAL']

In [72]:
text = "GOAL"
re.findall('GO+A+L', text)

['GOAL']

`*` and `+` are *greedy* in Python. They will grab as much as possible. 

In [73]:
text = "<p>Something or other</p><p>Yet more junk.</p>" 
re.findall('<p>.*</p>', text)

['<p>Something or other</p><p>Yet more junk.</p>']

In [74]:
text = "foo1@gmail.com;b-a-r@gmail.com;baz@gmail.com" 
re.findall('\w.*@gmail.com', text)

  re.findall('\w.*@gmail.com', text)


['foo1@gmail.com;b-a-r@gmail.com;baz@gmail.com']

`*?` is the *lazy* alternative, it will grab as little as possible.

In [76]:
re.findall(r'\w.*?@gmail.com', text)

['foo1@gmail.com', 'b-a-r@gmail.com', 'baz@gmail.com']

## Reference (so far)

**Matching characters**
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)

**Multiple Matches**
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

**Repeating**

*'greedy' means that it captures as much as it can, 'lazy' means it captures as little as possible.*
- `?` - One or zero of the preceding match
- `+` - One or more of the preceding match (greedy)
- `*` - Zero or more of the preceding match (greedy)
- `*?`, `+?`  - Lazy versions of `*` and `+`

## Start and End of Line

`^` - Start of line

In [77]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('^The', text)

['The']

In [78]:
re.findall('^The', text)

['The']

In [79]:
re.findall('^.*fox', text)

['The quick brown fox']

`$` - End of line

In [80]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('.......$', text)

['low dog']

In [81]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall("^.*$", text)

['The quick brown fox jumped over the lazy yellow dog']

## Reference

**Matching characters**
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)

**Multiple Matches**
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

**Repeating**

*'greedy' means that it captures as much as it can, 'lazy' means it captures as little as possible.*
- `?` - One or zero of the preceding match
- `+` - One or more of the preceding match (greedy)
- `*` - Zero or more of the preceding match (greedy)
- `*?`, `+?`  - Lazy versions of `*` and `+`

**Position**
- `^` - Start of line
- `$` - End of line

# Additional tips

Choose a range for repetition with `{min,max}`. e.g.

In [82]:
text = "YOLO"
re.search('YOLO{1,3}$', text) 

<re.Match object; span=(0, 4), match='YOLO'>

In [83]:
text = "YOLOOO"
re.search('YOLO{1,3}$', text)

<re.Match object; span=(0, 6), match='YOLOOO'>

In [84]:
text = "YOLOOOOOO"
re.search('YOLO{1,3}$', text)

*Negation*
    
Use the caret in square brackets: `[^aeiou]` means *not* a, e, i, o, or u

*Groups*
    
Treat multiple characters together, like if they were a single character.

Use parentheses. e.g:

In [85]:
text = "banana"
re.findall('^ba(na)+$', text)

['na']

In [86]:
text = "lololololololololololol"
re.findall('^l(ol)+$', text)

['ol']

Capturing groups:

In [87]:
text = "Ketchup Catsup"
re.findall('(Ketch|Cats)up', text)

['Ketch', 'Cats']

## Regex Crosswords

Try https://regexcrossword.com for a Puzzle Game