# Regular Expressions

## Announcements

## Final project discussion

## Review

## Looking Back

- SQL, Relational Databases
   - Selecting, sorting, limiting, joins
   - Using SQLite in Colab with `%%sql`

- Data manipulation with Pandas
   - DataFrames and Series's
   - Import/Exporting to SQL
   - Pulling tables from web
   - Selection, sorting, counting

- Split-Apply-Combine
   - Groupby in Pandas e.g. `.groupby('column').mean()`

- Visualization

- Semi-Structured data in MongoDB
    - JSON
    - selection, sorting
    - Aggregations
    - MapReduce concepts

- String pattern matching and extraction with regular expressions

- Next week
    - Advanced Pandas (rolling, dates)
    - Web scraping

## Final Project Updates

## <center>Regular Expressions</center>
### <center>aka *regex*</center>

## *Recording - Intro to regular expressions*

In [None]:
from IPython.display import HTML
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_f7fvihh9&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_b9qukktj" width="790" height="474" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

*Video Notes: Link to Pandas string methods documentation: https://pandas.pydata.org/docs/user_guide/text.html#string-methods*

### Overview

Regular Expressions help you work with strings

*Pattern Matching*

e.g. Find all phone numbers on a web page

*Manipulation*

e.g. Match "{Lastname}, {Firstname}" in a set of records and rewrite it as "{Firstname} {Lastname}"

## Why?

- Checking whether an input is valid (i.e. password, phone number, email, etc.)
- Cleaning data
- More complex data subsetting
- Working with user inputs or other unstructured data

### Q: Where can you use regular expressions?

### A: Many, many places!

## In Python

In [None]:
import re
comment = "It was a dark and stormy night." 

Find a simple string:

In [None]:
re.findall('dark', comment)

['dark']

Find all sequences of one or more word characters:

In [None]:
re.findall('\w+', comment)

['It', 'was', 'a', 'dark', 'and', 'stormy', 'night']

## In SQL

SQLite doesn't support it, but...

**MySQL**

Select columns that match alphanumeric characters only:

```
SELECT * FROM table WHERE column REGEXP '^[A-Za-z0-9]+$';
```

**Postgresql**

Match strings that include foo, bar, or baz:

```
SELECT * FROM table WHERE value ~ 'foo|bar|baz';
```

## In Pandas

In [None]:
import pandas as pd
movies = pd.read_csv('https://raw.githubusercontent.com/organisciak/Scripting-Course/master/data/movielens_small.csv')
movies.sample()

Unnamed: 0,userId,rating,title,genres,timestamp,year
35931,109,4.0,"Lord of the Rings: The Fellowship of the Ring,...",Adventure,1153229733,2001


Find movies where there is a digit (`\d`) right before the end of the string (`$`):

In [None]:
matches = movies['title'].str.contains('\d$')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
49307,29,3.0,Scream 2,Comedy,1313925074,1997
73691,574,3.5,Predator 2,Action,1232817836,1990
40428,153,1.0,Pokémon the Movie 2000,Animation,1046739925,2000
86675,475,1.5,Transporter 2,Action,1447327663,2005
65868,571,4.0,2012,Action,1334342752,2009
41287,48,4.0,Toy Story 3,Adventure,1318721995,2010
82142,564,3.0,Wayne's World 2,Comedy,974836927,1993
12610,243,4.0,Fahrenheit 9/11,Documentary,1094226630,2004
47833,247,4.0,Die Hard 2,Action,953102260,1990
92992,239,3.0,Love Potion #9,Comedy,991863254,1992


Find movies where the substring ' Part ' exists:

In [None]:
matches = movies.title.str.contains(' Part ')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
50290,346,4.0,Back to the Future Part III,Adventure,1044651160,1990
50167,155,2.0,Back to the Future Part II,Adventure,943350250,1989
41358,620,4.0,Harry Potter and the Deathly Hallows: Part 1,Action,1455532293,2010
41392,620,4.0,Harry Potter and the Deathly Hallows: Part 2,Action,1455532588,2011
50232,612,2.0,Back to the Future Part II,Adventure,1455638646,1989
41390,553,4.0,Harry Potter and the Deathly Hallows: Part 2,Action,1423010553,2011
50312,486,3.5,Back to the Future Part III,Adventure,1464121628,1990
25138,309,5.0,"Godfather: Part II, The",Crime,1114566751,1974
50216,518,3.0,Back to the Future Part II,Adventure,945364806,1989
25136,297,3.5,"Godfather: Part II, The",Crime,1318703831,1974


Find movies that are named "The ... of ..."

In [None]:
matches = movies.title.str.contains('^The .+ of ')
movies[matches].sample(10)

Unnamed: 0,userId,rating,title,genres,timestamp,year
93217,294,4.0,The Diary of Anne Frank,Drama,1119922983,1959
94735,262,2.5,The Plague of the Zombies,Horror,1433901869,1966
68275,270,3.0,The Hobbit: The Battle of the Five Armies,Adventure,1469306927,2014
68277,347,3.0,The Hobbit: The Battle of the Five Armies,Adventure,1462999892,2014
68250,84,4.0,The Theory of Everything,Drama,1429911324,2014
92711,199,3.0,The Mating Habits of the Earthbound Human,Comedy,1214914182,1999
36099,212,4.0,The Count of Monte Cristo,Action,1228789284,2002
36097,73,4.0,The Count of Monte Cristo,Action,1264835164,2002
36102,294,4.0,The Count of Monte Cristo,Action,1112390008,2002
99139,547,4.0,The End of the Tour,Drama,1454253806,2015


## In MongoDB

In [None]:
from pymongo import MongoClient
client = MongoClient()
db = client.week7
collection = db.cooking

Find an recipe with an ingredient called "yellow ..."

In [None]:
collection.find_one({
    "ingredients": {"$regex": "yellow .*"}
})

{'_id': ObjectId('5cedb796db075a25e4ac71b5'),
 'cuisine': 'southern_us',
 'id': 25693,
 'ingredients': ['plain flour',
  'ground pepper',
  'salt',
  'tomatoes',
  'ground black pepper',
  'thyme',
  'eggs',
  'green tomatoes',
  'yellow corn meal',
  'milk',
  'vegetable oil']}

After unwinding the recipes to one doc per ingredient, find ingredients with a qualified salt:

In [None]:
pipeline = [
    { "$unwind": "$ingredients" },
    { "$project": {"ingredients": 1, "_id":0} },
    { "$match":{
        "ingredients": {"$regex": "^.+ salt" }
        }
    },
    { "$limit": 5 }
]
results = collection.aggregate(pipeline)
list(results)

[{'ingredients': 'sea salt'},
 {'ingredients': 'kosher salt'},
 {'ingredients': 'fine sea salt'},
 {'ingredients': 'kosher salt'},
 {'ingredients': 'kosher salt'}]

Count the qualified salt types:

In [None]:
pipeline = [
    { "$unwind": "$ingredients" },
    { "$project": {"ingredients": 1, "_id":0} },
    { "$match":{ "ingredients": {"$regex": "^.+ salt$" } } },
    { "$group":{
        "_id": "$ingredients", "count": {"$sum": 1} } 
    },
    { "$sort": { "count": -1} },
    { "$limit": 20 }
]
results = collection.aggregate(pipeline)
list(results)

[{'_id': 'kosher salt', 'count': 3113},
 {'_id': 'sea salt', 'count': 940},
 {'_id': 'coarse salt', 'count': 578},
 {'_id': 'fine sea salt', 'count': 285},
 {'_id': 'garlic salt', 'count': 240},
 {'_id': 'seasoning salt', 'count': 131},
 {'_id': 'table salt', 'count': 79},
 {'_id': 'coarse sea salt', 'count': 68},
 {'_id': 'coarse kosher salt', 'count': 64},
 {'_id': 'celery salt', 'count': 52},
 {'_id': 'fine salt', 'count': 24},
 {'_id': 'onion salt', 'count': 15},
 {'_id': 'rock salt', 'count': 14},
 {'_id': 'black salt', 'count': 12},
 {'_id': 'pickling salt', 'count': 12},
 {'_id': 'Himalayan salt', 'count': 11},
 {'_id': 'celtic salt', 'count': 9},
 {'_id': 'maldon sea salt', 'count': 8},
 {'_id': 'smoked sea salt', 'count': 6},
 {'_id': 'iodized salt', 'count': 4}]

### Note on variation

- Regular Expressions are *close* to standard, but different implementations are slightly different.

## Basics of Regular Expressions

In this class: we'll cover the basics, practiced in Python and Pandas.

To follow along:

In [None]:
import re

In [None]:
text = ""
re.findall('', text)

['ell']

## Wild Cards

In [None]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_ezsa2pdh&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_gsd80sah" width="790" height="474" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

`a` - Match the letter 'a'. Same for most other characters

In [None]:
text = "Colorado"
re.findall('o', text)

['o', 'o', 'o']

In [None]:
text = "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo"
re.findall('Buffalo buffalo', text)

['Buffalo buffalo', 'Buffalo buffalo', 'Buffalo buffalo']

`.` - Match any single character

In [None]:
text = "who, what, where, why, and how"
re.findall('wh.', text)

['who', 'wha', 'whe', 'why']

In [None]:
text = "who, what, where, why, and how"
re.findall('wh.,', text)

['who,', 'why,']

- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters

In [None]:
text = "Who, what, where, why, and how"
re.findall('\w\w\w,', text)

['Who,', 'hat,', 'ere,', 'why,']

In [None]:
text = "Who, what, where, why, and how"
re.findall('\w', text)

['W',
 'h',
 'o',
 'w',
 'h',
 'a',
 't',
 'w',
 'h',
 'e',
 'r',
 'e',
 'w',
 'h',
 'y',
 'a',
 'n',
 'd',
 'h',
 'o',
 'w']

`\d` - Match any digit

In [None]:
text = "Party like it's 1999"
re.findall('\d', text)

['1', '9', '9', '9']

In [None]:
text = "Party like it's 1999"
re.findall('\d\d\d\d', text)

['1999']

### *What if I want to match an actual backslash or period?*

This is a problem:

In [None]:
text = "Dr. Jones Drinks Too Much"
re.findall('Dr.', text)

['Dr.', 'Dri']

Precede the character with a backslash

E.g.

- `.` - Matches *any* character
- `\.` - Matches a literal period

In [None]:
re.findall('Dr\.', text)

['Dr.']

## Reference (so far)
- `a` - Match the letter 'a'. Same for most other characters
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period

Let's try the first few lab questions - 1.1. to 1.4.

In [None]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_6scfp6s9&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_fatbgvj9" width="790" height="474" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

`\s` - Match any whitespace character (space, tabs, line breaks sometimes)

*What will this return?*

In [None]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('\s....\s', text)

[' over ', ' lazy ']

`[ab]` - Group of multiple possible characters - in this case 'a' or 'b'

In [None]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('[Tt]he', text)

['The', 'the']

- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z

In [None]:
text = "text 1-800-SPAM for more information"
re.findall('[A-Z][A-Z][A-Z][A-Z]', text)

['SPAM']

Those square brackets are same as before, so you can group A-Z with other matches.

e.g. Match capital letters, digits, or hyphens:

In [None]:
text = "text 1-800-SPAM for more information"
re.findall('[\d\-A-Z]+', text)

['1-800-SPAM']

*Note above that a hyphen is another special character, so matching for a literal `-` is done with `\-`.*

Returning to the earlier data.

In [None]:
titles = movies.title.drop_duplicates()

"The (single word) of ..."

In [None]:
matches = titles.str.contains('^The \w+ of ')
titles[matches].sample(10)

96669    The Disappearance of Eleanor Rigby: Her
88980               The Earrings of Madame de...
99821                       The Face of an Angel
97199                           The Book of Life
68479                         The Age of Adaline
95699                 The Lair of the White Worm
93216                    The Diary of Anne Frank
94869             The Possession of Michael King
97198                             The Best of Me
94735                  The Plague of the Zombies
Name: title, dtype: object

In [None]:
matches = titles.str.contains(':')
titles[matches].sample(10)

67847                  Captain America: The Winter Soldier
29299    Léon: The Professional (a.k.a. The Professiona...
96755    Will Ferrell: You're Welcome America - A Final...
88694         Nightmare on Elm Street 3: Dream Warriors, A
82307                   Police Academy 6: City Under Siege
92984                             Exorcist II: The Heretic
75368                      Tabu: A Story of the South Seas
94572                       Sherlock: The Abominable Bride
93522       Librarian, The: The Curse of the Judas Chalice
76784         City Slickers II: The Legend of Curly's Gold
Name: title, dtype: object

In [None]:
matches = titles.str.contains("^\w+\-\w+$")
titles[matches]

259                  Ben-Hur
12000             Spider-Man
40269                  X-Men
55032                  U-571
58796             Scooby-Doo
61252              Fail-Safe
65729                G-Force
66092               Kick-Ass
68396                Ant-Man
69332            Re-Animator
69765    Slaughterhouse-Five
81831                  K-PAX
83228                 BURN-E
83394               Non-Stop
83908               Bio-Dome
89557            Topsy-Turvy
93256               Cry-Baby
94106              She-Devil
95079               Kon-Tiki
96155              De-Lovely
96602               Catch-22
96617                Ben-hur
96717               Semi-Pro
98638                  T-Men
99056     Shakespeare-Wallah
99971        Straight-Jacket
Name: title, dtype: object

## Reference (so far)
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

## Repetition

In [None]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_eawu2nno&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_mdxb0rv1" width="790" height="474" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

`?` - One or zero of the preceding match

In [None]:
text = "color colour"
re.findall('colou?r', text)

['color', 'colour']

- `+` - One or more of the preceding match
- `*` - Zero or more of the preceding match

In [None]:
text = "GOAL GOOOOOOOOOAAAAAAL"
re.findall('GO+A+L', text)

['GOAL', 'GOOOOOOOOOAAAAAAL']

In [None]:
text = "GOAL"
re.findall('GO+A+L', text)

['GOAL']

`*` and `+` are *greedy* in Python. They will grab as much as possible. 

In [None]:
text = "<p>Something or other</p><p>Yet more junk.</p>" 
re.findall('<p>.*</p>', text)

['<p>Something or other</p><p>Yet more junk.</p>']

In [None]:
text = "foo1@gmail.com;b-a-r@gmail.com;baz@gmail.com" 
re.findall('\w.*@gmail.com', text)

['foo1@gmail.com;b-a-r@gmail.com;baz@gmail.com']

`*?` is the *lazy* alternative, it will grab as little as possible.

['foo1@gmail.com', 'b-a-r@gmail.com', 'baz@gmail.com']

In [None]:
re.findall('\w.*?@gmail.com', text)

['foo1@gmail.com', 'b-a-r@gmail.com', 'baz@gmail.com']

## Reference (so far)

**Matching characters**
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)

**Multiple Matches**
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

**Repeating**

*'greedy' means that it captures as much as it can, 'lazy' means it captures as little as possible.*
- `?` - One or zero of the preceding match
- `+` - One or more of the preceding match (greedy)
- `*` - Zero or more of the preceding match (greedy)
- `*?`, `+?`  - Lazy versions of `*` and `+`

## Start and End of Line

In [None]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_h4jaxqcp&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_2jd77h7r" width="790" height="474" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

`^` - Start of line

In [None]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('^The', text)

['The']

In [None]:
re.findall('^The', text)

['The']

In [None]:
re.findall('^.*fox', text)

['The quick brown fox']

`$` - End of line

In [None]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall('.......$', text)

['low dog']

In [None]:
text = "The quick brown fox jumped over the lazy yellow dog"
re.findall("^.*$", text)

['The quick brown fox']

## Reference

**Matching characters**
- `a` - Match the letter `a`. Same for most other characters
- `.` - Match any single character
- `\w` - Match any word character (letters, number... support for non-English characters varies)
- `\W` - Match any non-word characters
- `\d` - Match any digit
- `.` - Matches *any* character
- `\.` - Matches a literal period
- `\s` - Match any whitespace character (space, tabs, line breaks sometimes)

**Multiple Matches**
- `[ab]` - Group of multiple possible characters - in this case `a` or `b`
- `[a-z]` matches any character from a to z
- `[A-Z]` matches any character from A to Z
- `[A-Zab]` matches any character from A to Z (`A-Z`), *or* `a` *or* `b`

**Repeating**

*'greedy' means that it captures as much as it can, 'lazy' means it captures as little as possible.*
- `?` - One or zero of the preceding match
- `+` - One or more of the preceding match (greedy)
- `*` - Zero or more of the preceding match (greedy)
- `*?`, `+?`  - Lazy versions of `*` and `+`

**Position**
- `^` - Start of line
- `$` - End of line

# Additional tips

Choose a range for repetition with `{min,max}`. e.g.

In [None]:
HTML('<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/2357732/sp/235773200/embedIframeJs/uiconf_id/41433732/partner_id/2357732?iframeembed=true&playerId=kaltura_player&entry_id=0_yrvd1amw&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[hotspots.plugin]=1&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=0_j3srkwqh" width="640" height="360" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>')

In [None]:
text = "YOLO"
re.search('YOLO{1,3}$', text) 

<_sre.SRE_Match object; span=(0, 4), match='YOLO'>

In [None]:
text = "YOLOOO"
re.search('YOLO{1,3}$', text)

<_sre.SRE_Match object; span=(0, 6), match='YOLOOO'>

In [None]:
text = "YOLOOOOOO"
re.search('YOLO{1,3}$', text)

<_sre.SRE_Match object; span=(0, 6), match='YOLOOO'>

*Negation*
    
Use the caret in square brackets: `[^aeiou]` means *not* a, e, i, o, or u

*Groups*
    
Treat multiple characters together, like if they were a single character.

Use parentheses. e.g:

In [None]:
text = "banana"
re.findall('^ba(na)+$', text)

['na']

In [None]:
text = "lololololololololololol"
re.findall('^l(ol)+$', text)

['ololololololololololol']

Capturing groups:

In [None]:
text = "Ketchup Catsup"
re.findall('(Ketch|Cats)up', text)

['Ketch', 'Cats']