# Regex

## What?
Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns.
With RegEx patterns we can:
- Does this string match a pattern?
- Is there a match for the pattern anywhere in the string?
- Modify + split strings in various ways

In [1]:
import pandas as pd
import re # part of the python stdlib

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [192]:
lines = pd.Series(log_file_lines.strip().split('\n'))
lines[0]

'76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"'

In [4]:
regex = r'''
    (?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"
    \s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"
    '''

regex = re.compile(regex, re.VERBOSE)
regex

re.compile(r'\n    (?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"\n    \s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"\n    ',
           re.UNICODE|re.VERBOSE)

In [5]:
lines.str.extract(regex)

Unnamed: 0,ip,timestamp,method,path,status,bytes_sent,referrer,user_agent
0,76.185.131.226,11/May/2020:14:25:53 +0000,GET,/,200,42,-,python-requests/2.23.0
1,76.185.131.226,11/May/2020:16:25:46 +0000,GET,/,200,42,-,python-requests/2.23.0
2,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/,200,42,-,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
3,76.185.131.226,11/May/2020:16:25:58 +0000,GET,/favicon.ico,200,162,https://python.zach.lol/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6...
4,104.5.217.57,11/May/2020:16:26:27 +0000,GET,/,200,42,-,python-requests/2.23.0
5,76.185.131.226,11/May/2020:16:26:46 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
6,76.185.131.226,11/May/2020:16:26:54 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
7,104.5.217.57,11/May/2020:16:27:04 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
8,76.185.131.226,11/May/2020:16:27:05 +0000,GET,/documentation,200,348,-,python-requests/2.23.0
9,76.185.131.226,11/May/2020:16:27:10 +0000,GET,/documentation,200,348,-,python-requests/2.23.0


## re library function

### `re.findall` 

 - finds all substrings where the RE matches; returns a list

### Literals - start simple

In [6]:
subject = 'abc'

#### find the letter a

In [9]:
regexp = r'a'
re.findall(regexp, subject)

['a']

#### find the letter c

In [10]:
regexp = r'c'
re.findall(regexp, subject)

['c']

#### find the letter d

In [11]:
regexp = r'd'
re.findall(regexp, subject)

[]

### Literals - make it more complex

In [12]:
subject = 'Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one'

#### find mary

In [15]:
regexp = r'mary'
re.findall(regexp, subject)

[]

##### regex flag: re.IGNORECASE

In [16]:
re.findall(regexp, subject, re.IGNORECASE)

['Mary']

#### find little

In [17]:
regexp = r'little'
re.findall(regexp, subject, re.IGNORECASE)

['little', 'little']

#### find the number 1

In [18]:
regexp = r'1'
re.findall(regexp, subject, re.IGNORECASE)

['1', '1', '1']

### Metacharacters

-  `.` : anything


- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number



- `\d`: any digit
- `\D`: anything that is *not* a digit


- `\s` : any whitespace

In [19]:
subject = 'abc. 123'

#### try all the metacharacters

In [20]:
regexp = r'.'
re.findall(regexp, subject)

['a', 'b', 'c', '.', ' ', '1', '2', '3']

In [21]:
regexp = r'\w'
re.findall(regexp, subject)

['a', 'b', 'c', '1', '2', '3']

In [22]:
regexp = r'\W'
re.findall(regexp, subject)

['.', ' ']

In [24]:
regexp = r'\d'
re.findall(regexp, subject)

['1', '2', '3']

In [25]:
regexp = r'\D'
re.findall(regexp, subject)

['a', 'b', 'c', '.', ' ']

#### what does \w\w bring back?

In [23]:
regexp = r'\w\w'
re.findall(regexp, subject)

['ab', '12']

#### match the period

In [26]:
regexp = r'\.'
re.findall(regexp, subject)

['.']

#### match the string 'c 1' using only metacharacters

In [27]:
subject = 'c 1'

In [31]:
regexp = r'\w\s\d'
re.findall(regexp, subject)

['c 1']

### Repeating

- `{}`: custom number of repititions
    - `{x}`: exactly x repititions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repititions
- `*`: zero or more
- `+`: one or more
- `?`: optional
- `?`: greedy

#### what will be returned?

In [32]:
subject = 'abc 123'

In [33]:
regexp = r'\w+\s?\d'

re.findall(regexp, subject)

['abc 1', '23']

### Find the matches

In [34]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. 
    You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.
    """

#### match all the numbers

In [40]:
regexp = r'\d'
re.findall(regexp, subject)

['2', '0', '1', '4', '6', '0', '0', '3', '5', '0', '7', '8', '2', '3', '0']

In [42]:
regexp = r'\d+'
re.findall(regexp, subject)

['2014', '600', '350', '78230']

#### match a 5 digit number, but not a number with fewer digits

In [43]:
regexp = r'\d{5}'
re.findall(regexp, subject)

['78230']

#### match a 4 or more digit number

In [44]:
regexp = r'\d{4,}'
re.findall(regexp, subject)

['2014', '78230']

#### match 3 to 4 digit number

In [45]:
regexp = r'\d{3,4}'
re.findall(regexp, subject)

['2014', '600', '350', '7823']

#### match `http://` or `https://`

In [47]:
regexp = r'https?://'
re.findall(regexp, subject)

['http://', 'https://']

In [48]:
regexp = r'http\w?://'
re.findall(regexp, subject)

['http://', 'https://']

In [49]:
regexp = r'https*://'
re.findall(regexp, subject)

['http://', 'https://']

### Any of / None of

- `[]`: will match anything inside of
- `[^]`: will match anything not inside of
- `[-]`: will match a range of values inside of

In [68]:
subject = 'abc 12345'

#### match using brackets

In [53]:
regexp = r'[a1]'
re.findall(regexp, subject)

['a', '1']

In [54]:
regexp = r'[c15]'
re.findall(regexp, subject)

['c', '1', '5']

In [56]:
regexp = r'[a1][b2][c3]'
re.findall(regexp, subject)

['abc', '123']

#### match using carrot

In [55]:
regexp = r'[^c15]'
re.findall(regexp, subject)

['a', 'b', ' ', '2', '3', '4']

#### match using range

In [60]:
regexp = r'[1-4]'
re.findall(regexp, subject)

['1', '2', '3', '4']

In [61]:
regexp = r'[1-2a-b]'
re.findall(regexp, subject)

['a', 'b', '1', '2']

#### match using range and carrot

In [65]:
regexp = r'[^1-4a-z]'
re.findall(regexp, subject)

[' ', '5']

### Anchors

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

In [76]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. 
    You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.
    """

#### match 3 or 4 digit number

In [78]:
regexp = r'\b\d{3,4}\b'
re.findall(regexp, subject)

['2014', '600', '350']

In [79]:
subject = 'kiwi aardvark banana codeup data science academy extra'

#### match all words that start with a vowel

In [87]:
#using a boundry
regexp = r'\b[aeiou]\w+'
re.findall(regexp, subject)

['aardvark', 'academy', 'extra']

In [91]:
#using an anchor
regexp = r'^[aeiou]\w+'
re.findall(regexp, subject)

[]

In [92]:
#split subjects to use anchor
subjects = subject.split()

for subject in subjects:
    print(re.findall(regexp, subject))

[]
['aardvark']
[]
[]
[]
[]
['academy']
['extra']


#### match all words that end with a vowel

In [94]:
regexp = r'\w+[aeiou]$'

for subject in subjects:
    print(re.findall(regexp, subject))

['kiwi']
[]
['banana']
[]
['data']
['science']
[]
['extra']


In [105]:
subject = 'kiwi aardvark banana codeup data science academy extra'

In [106]:
regexp = r'\w+[aeiou]\b'
re.findall(regexp, subject)

['kiwi', 'banana', 'data', 'science', 'extra']

### Capture Groups

- `()`: grab what's contained in parentheses 

In [107]:
subject = '''
You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
'''

#### find the domain only

In [113]:
regexp = r'https://(\w+).com'
re.findall(regexp, subject)

['codeup']

#### find everything after the first sentence

In [118]:
regexp = r'.com. (.+)'
re.findall(regexp, subject)

['Our ip address is 123.123.123.123 (maybe).']

#### find the ip address

In [126]:
regexp = r'is (.+)\s'
re.findall(regexp, subject)

['123.123.123.123 (maybe).']

In [128]:
#dont be greedy
regexp = r'is (.+?)\s'
re.findall(regexp, subject)

['123.123.123.123']

In [130]:
regexp = r'\d+.\d+.\d+.\d+'
re.findall(regexp, subject)

['123.123.123.123']

In [133]:
regexp = r'((\d+\.){3}\d+)'
re.findall(regexp, subject)

[('123.123.123.123', '123.')]

#### find domain and ip address

In [134]:
subject.strip()

'You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).'

In [136]:
regexp = r'https://(\w+).com. Our ip address is (.*?)\s'
re.findall(regexp, subject)

[('codeup', '123.123.123.123')]

In [137]:
regexp = r'\/(\w+).*?(\d.*?)\s'
re.findall(regexp, subject)

[('codeup', '123.123.123.123')]

#### find the protocol, domain, and tld

In [138]:
regexp = r'(https?)://(\w+)\.(\w+)'

re.findall(regexp, subject)

[('https', 'codeup', 'com')]

### Non Capture Group

- `?:`: to ignore a capture group also called shy

#### find the ip address

In [139]:
regexp = r'((?:\d{3}\.){3}\d+)'

re.findall(regexp, subject)

['123.123.123.123']

#### find the protocol and tld

In [140]:
regexp = r'(https?)://(?:\w+)\.(\w+)'

re.findall(regexp, subject)

[('https', 'com')]

## more re library functions

- `re.findall` finds all substrings where the RE matches; returns a list


- `re.search` scans through a string, looking for any location where the RE matches; returns match object
- `re.sub` allows us to match a regex and substitute in a new substring for the match; returns string

### `re.search`

- scans through a string, looking for any location where the RE matches; returns match object

In [141]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. 
    You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.
    """

#### find the word located

In [142]:
regexp = r'located'
re.search(regexp, subject)

<re.Match object; span=(33, 40), match='located'>

In [143]:
#results type
type(re.search(regexp, subject))

re.Match

#### find numbers

In [144]:
regexp = '\d+'
re.search(regexp, subject)

<re.Match object; span=(24, 28), match='2014'>

#### find navarro

In [146]:
regexp = 'navarro'
re.search(regexp, subject, re.IGNORECASE)

<re.Match object; span=(48, 55), match='Navarro'>

In [147]:
#results type
type(re.search(regexp, subject, re.IGNORECASE))

re.Match

In [148]:
type(re.search(regexp, subject))

NoneType

### Name Capture Group

- `?P`: to name a capture group

In [150]:
subject = '''
You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
'''

#### find the protocol, domain, and tld and name them

In [151]:
regexp = r'(https?)://(\w+)\.(\w+)'
re.findall(regexp, subject)

[('https', 'codeup', 'com')]

In [155]:
regexp = r'(?P<protocol>http?s*)://(?P<domain>\w+)\.(?P<tld>\w+)'
match = re.search(regexp, subject)
match

<re.Match object; span=(31, 49), match='https://codeup.com'>

In [156]:
#groups()
match.groups()

('https', 'codeup', 'com')

In [159]:
#groupdict()
match.groupdict()

{'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

##### regex flag:  `re.VERBOSE`

In [161]:
regexp = r'''(?P<protocol>http?s*)://
            (?P<domain>\w+)\.
            (?P<tld>\w+)'''
match = re.search(regexp, subject, re.VERBOSE)
match

<re.Match object; span=(31, 49), match='https://codeup.com'>

### `re.sub`

- allows us to match a regex and substitute in a new substring for the match; returns string

In [162]:
subject = 'abc 12345xyz'

#### remove all the digits

In [163]:
regexp = r'\d'
pattern = ''
re.sub(regexp, pattern ,subject)

'abc xyz'

#### replace all digits with an o 

In [164]:
regexp = r'\d'
pattern = 'o'
re.sub(regexp, pattern ,subject)

'abc oooooxyz'

#### replace all the digits with a single o

In [168]:
regexp = r'\d'
pattern = 'o'
re.sub(regexp, pattern ,subject)

'abc oooooxyz'

### `re.compile`

- prepare a regular expression for use ahead of time, returns expression

In [178]:
emails = [
    "jane@company.com",
    "bob@company.com",
    "jane.janeway@company.com",
    "jane.janeway@dogood.org",
    "jane.janet.janeway@dogood.org", # bonus for the 3 part address
]

In [181]:
regexp = r'(\w+\.*\w+)@(\w+)\.(\w+)'
for email in emails:
    print(re.findall(regexp, email))

[('jane', 'company', 'com')]
[('bob', 'company', 'com')]
[('jane.janeway', 'company', 'com')]
[('jane.janeway', 'dogood', 'org')]
[('janet.janeway', 'dogood', 'org')]


In [187]:
pattern = re.compile(r'''
(?P<name>\w+\.*\w+)@
(?P<domain>\w+)\.
(?P<tld>\w+)
''', re.VERBOSE)
pattern

re.compile(r'\n(?P<name>\w+\.*\w+)@\n(?P<domain>\w+)\.\n(?P<tld>\w+)\n',
           re.UNICODE|re.VERBOSE)

In [190]:
contacts = [re.search(pattern, email).groupdict() for email in emails]
contacts

[{'name': 'jane', 'domain': 'company', 'tld': 'com'},
 {'name': 'bob', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'dogood', 'tld': 'org'},
 {'name': 'janet.janeway', 'domain': 'dogood', 'tld': 'org'}]

In [191]:
pd.DataFrame(contacts)

Unnamed: 0,name,domain,tld
0,jane,company,com
1,bob,company,com
2,jane.janeway,company,com
3,jane.janeway,dogood,org
4,janet.janeway,dogood,org
