# Regular Expressions

- [Examples](#Examples)
- [Basic Regex](#Basic-Regex)
    - [Metacharacters](#Metacharacters)
    - [Repitition](#Repitition)
    - [Any of / None of](#Any-of-/-None-of)
    - [Anchors](#Anchors)
    - [Other Functions](#Other-Functions)
    - [Capture Groups](#Capture-Groups)
    - [Flags](#Flags)
    - [Usage with Pandas](#Usage-with-Pandas)

In [1]:
import re
import pandas as pd

## Examples

Say I want to parse the following lines in a log file:

<div style="font-family: monospace; overflow: scroll; white-space: pre">GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
</div>

In [2]:
inputs = pd.Series([
    r'GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
    r'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
    r'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58'
])
display(inputs)

outputs = inputs.str.extract(r'^([A-Z]+)\s(\/[^?^ ]+)(\?\S+)?\s\[(.+)\]\s(\S+)\s(\{[0-9]+\})\s([0-9]+)\s(\".+\")\s(\d+.\d+.\d+.\d)')
outputs.columns = ['call', 'url', 'qurl', 'timestamp', 'version', 'status', 'size', 'payload', 'ip']
display(outputs)

0    GET /api/v1/sales?page=86 [16/Apr/2019:193452+...
1    POST /users_accounts/file-upload [16/Apr/2019:...
2    GET /api/v1/items?page=3 [16/Apr/2019:193453+0...
dtype: object

Unnamed: 0,call,url,qurl,timestamp,version,status,size,payload,ip
0,GET,/api/v1/sales,?page=86,16/Apr/2019:193452+0000,HTTP/1.1,{200},510348,"""python-requests/2.21.0""",97.105.19.5
1,POST,/users_accounts/file-upload,,16/Apr/2019:193452+0000,HTTP/1.1,{201},42,"""User-Agent: Mozilla/5.0 (X11; Fedora; Fedora;...",97.105.19.5
2,GET,/api/v1/items,?page=3,16/Apr/2019:193453+0000,HTTP/1.1,{429},3561,"""python-requests/2.21.0""",97.105.19.5


Extract various components of an address:

In [3]:
addresses = pd.Series([
    '84 Rainey Street, Arlen, TX',
    '4 Privet Drive, Little Whinging, Surrey, U.K.',
    '740 Evergreen Terrace, Springfield',
    '1 Infinite Loop, Cupertino, California',
    'Wayne Manor, Gotham City',
    '124 Conch Street, Bikini Bottom',
])
addresses

0                      84 Rainey Street, Arlen, TX
1    4 Privet Drive, Little Whinging, Surrey, U.K.
2               740 Evergreen Terrace, Springfield
3           1 Infinite Loop, Cupertino, California
4                         Wayne Manor, Gotham City
5                  124 Conch Street, Bikini Bottom
dtype: object

In [4]:
data = addresses.str.extract(r'^(\d+)?\s*(.*?),\s*([\w\s]+)')
data.columns = ['house_no', 'street', 'city']
data

Unnamed: 0,house_no,street,city
0,84.0,Rainey Street,Arlen
1,4.0,Privet Drive,Little Whinging
2,740.0,Evergreen Terrace,Springfield
3,1.0,Infinite Loop,Cupertino
4,,Wayne Manor,Gotham City
5,124.0,Conch Street,Bikini Bottom


In [5]:
# find all the csv files refrenced in the curriculum (this won't work for you)
# !(cd ~/codeup/curriculum/data-science/content && rg --vimgrep ".*pd.read_csv\(['\"](.+)['\"]\).*" -r '$1')

In [6]:
# find all the imports in .py files in the curriculum (this won't work for you)
# !(cd ~/codeup/curriculum/data-science/content && rg --vimgrep '^import\s+([\.\w]+)\s*(as\s*\w+)?.*$' -r '$1')

## Basic Regex

- what is a regex? (bigger than python, different flavors)
- raw strings
- re.findall (but also others)

In [7]:
# for demonstration in this lesson
from zgulde.hl_matches import hl_all_matches_nb as hl # pip install zgulde

In [8]:
subject = 'Hello, Bayes! Today is Dec 3 and the temperature is 70 degrees.'

In [9]:
re.findall(r'H', subject)

['H']

In [10]:
re.findall(r'e', subject)

['e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e']

In [11]:
hl(r'e', subject)

In [12]:
hl(r'70', subject)

### Metacharacters

In [13]:
display(hl(r'\w', subject))
display(hl(r'\W', subject))

In [14]:
display(hl(r'\d', subject))
display(hl(r'\D', subject))

In [15]:
display(hl(r'\s', subject))
display(hl(r'\S', subject))

In [16]:
display(hl(r'.', subject))

### Repitition

In [17]:
hl(r'\w+', subject)

### Any of / None of

In [18]:
display(hl(r'[aeiou]', subject))
display(hl(r'[^aeiou]', subject))

In [19]:
hl(r'[A-Z][a-z]+', subject+' FU')

In [20]:
hl(r'\b[A-Z]{2}\b', subject.upper())

### Anchors

In [21]:
re.findall(r'^.', subject)

['H']

In [22]:
hl(r'^.', subject)

In [23]:
hl(r'.{3}$', subject)

In [24]:
hl(r'\b.\b', subject)

No matches!
No matches!
No matches!
No matches!
No matches!
No matches!
No matches!
No matches!


### Other Functions

- `re.search`
- `re.sub`
- `re.compile` + flags

### Capture Groups

In [25]:
hl(r'\w+(\w)', subject)

In [26]:
hl(r'(\w)\1', subject)

In [27]:
hl(r'([aeiou])\w+\1', subject)

In [28]:
## double letter

In [29]:
date = '03 12 2019'

In [30]:
# re.sub(needle, replacement, haystack)
re.sub(r'(\d+)\s(\d+)\s(\d+)', r'\3-\2-\1', date)

'2019-12-03'

### Flags

In [31]:
re.compile(r'', re.IGNORECASE | re.MULTILINE | re.VERBOSE)

re.compile(r'', re.IGNORECASE|re.MULTILINE|re.UNICODE|re.VERBOSE)

### Usage with Pandas

In [32]:
pd.Series.str.extract

<function pandas.core.strings.StringMethods.extract(self, pat, flags=0, expand=True)>