# Regex

- What is a regular expression?
    - python flavored regex, but larger in scope.
- When are regular expressions useful?
    - parsing regualr text
    - structured (some sort) text
    - commonly used in data aquisition, data prep
    - regex can be simple operations 

In [2]:
import pandas as pd
import re

In [3]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [4]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [5]:
regexp = r'a'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

- sets match to a and spans the index to a

In [6]:
regexp = r'b'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

- sets match to b and spans the index to b

In [7]:
regexp = r'ab'
subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 2), match='ab'>

- sets match to ab and spans the index from a to b

In [8]:
regexp = r'd'
subject = 'abc'

re.search(regexp, subject)

- nothing happens because there is no d litteral

In [12]:
regexp = r'ab'
subject = 'abc'

re.findall(regexp, subject)

['ab']

- using findall produces a dictoniary of the specified subject, if there is non then it produces an empty list
- seach only produces the first instances of the regexp

In [13]:
regexp = r''
subject = 'abc'

re.findall(regexp, subject)

['', '', '', '']

- if you only have "" inyour regexp then it only produces a list of the located "".

In [15]:
regexp = r'.'
subject = 'abc'

re.findall(regexp, subject)

['a', 'b', 'c']

- works as a comand f for all values.

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the cell above to start experimenting with regular expressions.</p>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        <li>Change your regular expression to just the "." character. What are the results?</li>
    </ol>
</div>

### Metacharacters

- `.`: anything
- `\w`: any word charecters (alphanumeric_
- `\s`: any whitespace
- `\d`: any numbers
- Captial variants: match anything that is not the lowercase varient.

In [23]:
regexp = r'\W'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(3, 4), match=' '>

In [16]:
regexp = r'\w'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [24]:
regexp = r'\S'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [19]:
regexp = r'\s'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(3, 4), match=' '>

In [26]:
regexp = r'\D'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [22]:
regexp = r'\d'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(4, 5), match='1'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        <li>What does the regular expression <code>\w\w</code> match?</li>
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
    </ol>
</div>

In [27]:
regexp = r'\w'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', '1', '2', '3']

In [30]:
regexp = r'\W'
subject = 'abc 123'

re.findall(regexp, subject)

[' ']

In [28]:
regexp = r'\s'
subject = 'abc 123'

re.findall(regexp, subject)

[' ']

In [31]:
regexp = r'\S'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', '1', '2', '3']

In [29]:
regexp = r'\d'
subject = 'abc 123'

re.findall(regexp, subject)

['1', '2', '3']

In [32]:
regexp = r'\D'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', ' ']

In [38]:
regexp = r'\w\w'
subject = 'abc 123'

re.findall(regexp, subject)

['ab', '12']

- looks for two charecters in the subject

In [39]:
regexp = r'\w\w\w'
subject = 'abc 123'

re.findall(regexp, subject)

['abc', '123']

- looks for three charecters in the subject

In [44]:
regexp = r'\w\s\d'
subject = 'abc 123'

re.findall(regexp, subject)

['c 1']

- this works because the the c and 1 are split by a whitespace and so it fill out the paremeters for w, s, d

In [48]:
regexp = r'\d\d\d'
subject = 'abc 123'

re.findall(regexp, subject)

['123']

### Repeating

- `{}`: a specific number of repititions
- `*`: zero or more
- `+`: one or more
- `?`: optional
- greedy + non-greedy

In [52]:
regexp = r'\w+' # one or more word characters
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [55]:
regexp = r'\s*\w+' 
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [51]:
regexp = r'\d+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(4, 7), match='123'>

In [56]:
regexp = r'\s*\d+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(3, 7), match=' 123'>

In [58]:
regexp = r'\d{2}' # two digits in a row
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(4, 6), match='12'>

In [62]:
regexp = r'\w{2}' # two alphanumarics in a row
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 2), match='ab'>

In [71]:
regexp = r'abcd?' # optional if you have the d character or not.
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [72]:
regexp = r'\w+' # greedy (trys to match as much as it can untill it hits a non word character.)
subject = 'abc 123'

re.findall(regexp, subject)

['abc', '123']

In [74]:
regexp = r'\w+?' # non-greedy (try to match least as posible)
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', '1', '2', '3']

In [78]:
regexp = r'.+\d' # match as much as you can up until a digit.
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 7), match='abc 123'>

In [79]:
regexp = r'.+?\d' # non greedy version
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 5), match='abc 1'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches any urls in the subject.</li>
    </ol>
</div>

In [88]:
regexp = r'\d+'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

In [89]:
regexp = r'\d'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['2', '0', '1', '4', '6', '0', '0', '3', '5', '0', '7', '8', '2', '3', '0']

In [97]:
regexp = r'\d{5}'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['78230']

In [149]:
regexp = r'https?://.+?com'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['http://codeup.com', 'https://alumni.codeup.com']

### Any/None Of

In [157]:
regexp = r'[a1]'
subject = 'abc 123'

re.findall(regexp, subject)

['a', '1']

In [150]:
regexp = r'[a1][b2][c3]'
subject = 'abc 123'

re.match(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

In [158]:
subject = '123abc'

re.match(regexp, subject)

<re.Match object; span=(0, 1), match='1'>

In [160]:
regexp = r'[^1 - 6]'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'b', 'c', '2', '3']

In [163]:
regexp = r'[^c3]+' # matches everthing but ^c3
subject = 'abc 123'

re.findall(regexp, subject)

['ab', ' 12']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

In [192]:
regexp = r'[02468]'
subject = '76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"'

re.findall(regexp, subject)

['6',
 '8',
 '2',
 '2',
 '6',
 '2',
 '0',
 '2',
 '0',
 '4',
 '2',
 '0',
 '0',
 '0',
 '0',
 '2',
 '0',
 '0',
 '4',
 '2',
 '2',
 '2',
 '0']

In [193]:
regexp = r'[13579]{2,}'
subject = '76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"'

re.findall(regexp, subject)

['131', '11', '53']

In [206]:
regexp = r'[aeiou]'
subject = 'aardvark'

re.match(regexp, subject) # matches from the start of the word

<re.Match object; span=(0, 1), match='a'>

In [205]:
regexp = r'[aeiou]' 
subject = 'aardvark'

re.search(regexp, subject) # matches from the whole word

<re.Match object; span=(0, 1), match='a'>

In [210]:
regexp = r'[aeiou]'
subject = 'banana'

if re.search(regexp, subject):
    print('Found a vowel')
    print(re.search(regexp, subject))
else:
    print('No vowels found')

Found a vowel
<re.Match object; span=(1, 2), match='a'>


### Anchors

- `^`: starts with
- `$`: ends with

In [213]:
regexp = r'^b' # does the string start with b?
subject = 'abc 123'

re.search(regexp, subject) # No output

In [214]:
regexp = r'[a-z$]' 
subject = 'abc 123'

re.search(regexp, subject) 

<re.Match object; span=(0, 1), match='a'>

In [216]:
regexp = r'[a-z]$' 
subject = 'abc 123'
re.search(regexp, subject) # anchors to end of jubject if outside brackets so no numbers are within a-z

In [217]:
regexp = r'^.*\d$' 
subject = 'abc 123'

re.search(regexp, subject) 

<re.Match object; span=(0, 7), match='abc 123'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [293]:
regexp = r'^[aeiouAEIOU]' 
subject = 'abc, Ale 123'

re.findall(regexp, subject) 

['a']

In [297]:
regexp = r'^[A-Z]' 
subject = 'Vaal'

re.search(regexp, subject) 

<re.Match object; span=(0, 1), match='V'>

In [303]:
regexp = r'[A-Z]$'
subject = 'BanneR'

re.findall(regexp, subject) 

['R']

In [305]:
regexp = r'^[A-Z].*[A-Z]$' # starts and ends with a captial letter.
subject = 'BanneR'

re.findall(regexp, subject) 

['BanneR']

### Capture Groups

In [306]:
regexp = '.*?(\d+)'
s = pd.Series(['abc', 'abc123', '123'])
s.str.extract(regexp)

Unnamed: 0,0
0,
1,123.0
2,123.0


## `re.sub`

- removing
- substitution

In [325]:
regexp = r'\d+'
subject = 'abc123'

re.sub(regexp, '', subject)

'abc'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [390]:
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])
dates

0    2020-11-12
1    2020-07-13
2    2021-01-12
dtype: object

In [399]:
dates.str.replace(r'(\d{4})-(\d{2})-(\d{2})', r'\2/\3/\1', regex=True)

0    11/12/2020
1    07/13/2020
2    01/12/2021
dtype: object

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [400]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [401]:
df.text.str.extract(r'(https?)://(\w+)\.(\w+)')

Unnamed: 0,0,1,2
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [402]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [403]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [405]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}