# Regex

What is it?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns

Why do we care? 
- With this, we can extract text, match text, or replace text that matches a pattern

In [4]:
import pandas as pd
import numpy as np

#impore reg ex
import re

## re library function

`re.findall(pattern, string)` 

 - finds all substrings where the REGEX matches; returns a list

### Literals - start simple

In [2]:
subject = 'abc'

#### find the letter a

In [6]:
#find the letter a, if there is one - return ['a']
re.findall(r'a',subject)

['a']

In [7]:
regexp = r'a'

re.findall(regexp, subject)

['a']

#### find the letter d

In [8]:
#saving to a variable
regexp = r'a'

re.findall(regexp, subject)

['a']

### Literals - make it more complex

In [10]:
subject = 'Mary had a little lamb. 1 little lamb. Not 10 lambs, not 12, not 22, just one'

#### find not

In [11]:
#case sensitive!
regexp = r'not'

re.findall(regexp, subject)

['not', 'not']

In [12]:
regexp = r'Not'

re.findall(regexp, subject)

['Not']

##### regex flag: re.IGNORECASE

In [13]:
regexp = r'not'

re.findall(regexp, subject, re.IGNORECASE)

['Not', 'not', 'not']

#### find lamb

In [15]:
#this is searching for the pattern only regardless of what is before and what is after
regexp = 'lamb'

re.findall(regexp, subject)

['lamb', 'lamb', 'lamb']

## Metacharacters
<div class="alert alert-block alert-info">
    
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number



- `\d`: any digit
- `\D`: anything that is *not* a digit


- `\s` : any whitespace


-  `.` : anything
    
</div>

### Try them all 

In [16]:
subject = 'abcccC. 123!'

#### `\w`: any letter or number

In [17]:
#any letter or number
regexp = r'\w'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '1', '2', '3']

#### what does \w\w bring back?

In [19]:
#pairs of patterns
regexp = r'\w\w'

re.findall(regexp, subject)

['ab', 'cc', 'cC', '12']

#### `\W`: anything that is not a letter or number

In [20]:
#not a letter or number
regexp = r'\W'

re.findall(regexp, subject)

['.', ' ', '!']

#### `\d`: any digit

In [22]:
#any digit
regexp = r'\d'

re.findall(regexp, subject)

['1', '2', '3']

#### `\D`: anything that is not a digit

In [23]:
#anything that is not a digit
regexp = r'\D'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '!']

#### `\s` : any whitespace

In [24]:
#spaces
regexp = r'\s'

re.findall(regexp, subject)

[' ']

#### `.` : anything

In [27]:
#everything cut up into pieces
regexp = r'.'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '1', '2', '3', '!']

#### find the `.` only
use an escape character, `\`, to find characters that are metacharacters

In [30]:
#saying i want the most literal form of this expression, the period
regexp = r'\.'

re.findall(regexp, subject)

['.']

<div class="alert alert-block alert-warning">
    <b>Mini-Exercise</b>
    
- Match the string 'c 1' using only metacharacters
- The returned list should have only element in it
- Find 3 different syntax combinations

    
</div>

In [31]:
subject = 'c 1'

In [53]:
#match the string using metacharacters: not a digit, not a letter or number, whatever is left
regexp = r'\D\W.'

re.findall(regexp, subject)

['c 1']

In [55]:
#add it together using anything 3 times
regexp = r'...'

re.findall(regexp, subject)

['c 1']

In [52]:
#adding metacharacters together
regexp = r'\D\s\d'

re.findall(regexp, subject)

['c 1']

## Repeating

<div class='alert alert-info'>
    
- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy
    
</div>

In [56]:
subject = 'ccccc! 123 ccc! 99!'

#### find the whole string

In [58]:
#find something and then everything after that
regexp = r'.+'

re.findall(regexp, subject)

['ccccc! 123 ccc! 99!']

#### find all the c groups

In [59]:
#specific number of instances
regexp = r'c{5}'

re.findall(regexp, subject)

['ccccc']

In [61]:
#3 c's
regexp = r'c{3}'

re.findall(regexp, subject)

['ccc', 'ccc']

In [65]:
#grab all groups three or more
regexp = r'c{3,}'

re.findall(regexp, subject)

['ccccc', 'ccc']

In [66]:
#grab all groups without knowing how many repetitions there are
regexp = r'c+'

re.findall(regexp, subject)

['ccccc', 'ccc']

#### find 123 and anything after it

In [71]:
#find a pattern and anything after it
regexp = r'123.+'

re.findall(regexp, subject)

['123 ccc! 99!']

#### find the exclamation points and everything inbetween them

In [75]:
#give me the first exclamation point, then anything after that, give me the last
#exclamation point

#this is being greedy!
regexp = r'!.+!'

re.findall(regexp, subject)

['! 123 ccc! 99!']

#### find the exclamation points and everything inbetween the first two

In [77]:
#first exclamation point, everything after, stop at the next exclamation you see,
#and give me the last exclamation point
regexp = r'!.+?!'

re.findall(regexp, subject)

['! 123 ccc!']

<div class='alert alert-warning'>
<b>Mini Exercise</b>
    
From the below string, find the following information:
- Find all the numbers
- Find the number that has exactly 5 digits
- Find numbers that has 4 or more digits
- Find `http://` or `https://`
- Find the sentences contained in quotes

In [164]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"!
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"!'

#### Find all the numbers

In [165]:
regexp = r'\d+'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

#### Find a number that has exactly 5 digits

In [166]:
regexp = r'\d{5}'

re.findall(regexp, subject)

['78230']

In [167]:
regexp = r'\d\d\d\d\d'

re.findall(regexp, subject)

['78230']

#### Find numbers that has 4 or more digits

In [87]:
regexp = r'\d{4,}'

re.findall(regexp, subject)

['2014', '78230']

#### Find the sentences contained in quotes

In [94]:
regexp = r'".+?"'

re.findall(regexp, subject)

['"launch your career in tech!"', '"codeup is a great school"']

#### Find `http://` or `https://`

In [107]:
regexp = r'.tt.+?\D\W{2,}'

re.findall(regexp, subject)

['http://', 'https://']

In [170]:
regexp = r'.t{2,}.s?\W{2,}'

re.findall(regexp, subject)

['http://', 'https://']

In [168]:
regexp = r'.t{2,}ps?\W{2,}'

re.findall(regexp, subject)

['http://', 'https://']

In [171]:
regexp = r'https*://'

re.findall(regexp, subject)

['http://', 'https://']

### Any of / None of

<div class='alert alert-info'>
    
- `[]`: will match any element inside of
- `[^]`: will match any element NOT inside of
- `[-]`: will match a range of values inside of
    
</div>

#### match using brackets

In [173]:
subject = 'abc 12345 1bc'

#### find a or b

In [174]:
regexp = r'[ab]'

re.findall(regexp, subject)

['a', 'b', 'b']

In [175]:
regexp = r'a|b'

re.findall(regexp, subject)

['a', 'b', 'b']

#### find values that are NOT a or b

In [176]:
regexp = r'[^ab]'

re.findall(regexp, subject)

['c', ' ', '1', '2', '3', '4', '5', ' ', '1', 'c']

#### find values that are between 2 and 4

In [155]:
regexp = r'[2-4]'

re.findall(regexp, subject)

['2', '3', '4']

### Anchors
<div class='alert alert-info'>

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

</div>

In [196]:
subject = 'kiwi aardvark banana codeup data science academy extra'

#### match all words that start with a vowel

In [184]:
regexp = r'[aeiou]\w+'

re.findall(regexp, subject)

['iwi', 'aardvark', 'anana', 'odeup', 'ata', 'ience', 'academy', 'extra']

In [185]:
#using a boundary
regexp = r'\b[aeiou]\w+'

re.findall(regexp, subject)

['aardvark', 'academy', 'extra']

In [190]:
#using a carrot - only looks at the beginning of the string. The subject is an entire string, cannot look at elements alone
regexp = r'^[aeiou]\w+'

re.findall(regexp, subject)

['extra']

In [187]:
#split subjects to use anchor
subjects = subject.split()
subjects

['kiwi', 'aardvark', 'banana', 'codeup', 'data', 'science', 'academy', 'extra']

In [193]:
#use for loop to cycle through them
for subject in subjects:
    print(re.findall(r'^[aeiou]\w+', subject))

[]
['aardvark']
[]
[]
[]
[]
['academy']
['extra']


#### match all words that end with a vowel

In [200]:
#using a carrot - only looks at the beginning of the string. The subject is an entire string, cannot look at elements alone
regexp = r'\w+[aeiou]\b'

re.findall(regexp, subject)

['kiwi', 'banana', 'data', 'science', 'extra']

In [201]:
subject = 'kiwi aardvark banana codeup data science academy extra'

In [202]:
#finds the last element in a string that ends in a vowel - only looks at the very end of a string, not each element
regexp = r'\w+[aeiou]\b$'

re.findall(regexp, subject)

['extra']

<div class='alert alert-warning'>

<b>Mini Exercise</b>
    
Write regular expressions to find the following values

- Find any even digits (regardless if its apart of a bigger number)
- Find entire numbers that are even
   
    
- Find 2 or more odd digits in a row.

    
- Find all the capital letters
- Find all words that start with a capital letter
    
<div>

In [294]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"!
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"!'

#### Find any even digits (regardless if its apart of a bigger number)

In [295]:
regexp = r'[02468]'

re.findall(regexp, subject)

['2', '0', '4', '6', '0', '0', '0', '8', '2', '0']

#### Find entire numbers that are even

In [307]:
regexp = r'\d+[02468]\b'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

#### Find 2 or more odd digits in a row (regardless if its apart of a bigger number)

In [308]:
regexp = r'[13579]{2,}'

re.findall(regexp, subject)

['35']

#### Find all the capital letters

In [298]:
regexp = r'[A-Z]'

re.findall(regexp, subject)

['C', 'N', 'S', 'S', 'S', 'A', 'T', 'X', 'Y']

#### Find all words that start with a capital letter

In [302]:
#split subjects to use anchor
subjects = subject.split()
#use for loop to cycle through them
for sub in subjects:
    print(re.findall(r'^\b[A-Z]\w+', sub))

['Codeup']
[]
[]
[]
[]
[]
[]
[]
['Navarro']
['St']
['Suite']
[]
['San']
['Antonio']
['TX']
[]
[]
[]
[]
[]
[]
[]
['You']
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]


In [309]:
regexp = r'[A-Z]\w*'

re.findall(regexp, subject)

['Codeup', 'Navarro', 'St', 'Suite', 'San', 'Antonio', 'TX', 'You']

### Capture Groups

<div class='alert alert-info'>
    
- `()`: grab what's contained in parentheses 
    
</div> 

In [312]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### find the domain only

In [314]:
regexp = r'https?://(.+).com'

re.findall(regexp, subject)

['codeup']

#### find everything after the first sentence

In [324]:
regexp = r'\.\s(.+)'

re.findall(regexp, subject)

['Our ip address is 123.123.123.123 (maybe).']

#### find the ip address

In [353]:
regexp = r'address is (.+)'

re.findall(regexp, subject)

['123.123.123.123 (maybe).']

In [361]:
#dont be greedy, but we are missing one
regexp = r'\d{3}\.(.+?)\s'

re.findall(regexp, subject)

['123.123.123']

In [362]:
#using grouping to add the last one but it will also return BOTH parantheses in the expression
regexp = r'((\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

[('123.123.123.123', '123.')]

### Non Capture Group

<div class='alert alert-info'>

- `?:`: to ignore a capture group also called shy

#### find the ip address

In [366]:
#using the non-capture group to exclude the first parantheses to ignore, only grabbing the group that is not shy
regexp = r'((?:\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

['123.123.123.123']

<div class='alert alert-warning'>

<b>Mini-Exercise</b>    
    
- Find the protocol, domain, and tld
- Edit your previous code with a non-capture group to remove the domain from your result
    
</div>

In [367]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### Find the protocol, domain, and tld

In [371]:
regexp = r'(https?://)(\w+)\.(\w+)'

re.findall(regexp, subject)

[('https://', 'codeup', 'com')]

#### Edit your previous code with a non-capture group to remove the domain

In [372]:
regexp = r'(https?://)(?:\w+)\.(\w+)'

re.findall(regexp, subject)

[('https://', 'com')]

## more re library functions

### `re.search`

- scans through a string, looking for any location where the RE matches; returns match object
- stops at the first complete match, doesn't iterate through

In [1]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, 
    San Antonio, TX 78230. 
    You can find us online at http://codeup.com and
    our alumni portal is located at https://alumni.codeup.com.
    """

#### find the word located

In [6]:
#save the match result
regexp = r'located'

match = re.search(regexp, subject)
match

<re.Match object; span=(33, 40), match='located'>

In [10]:
#results type
type(match)

re.Match

In [11]:
#.span()
match.span()

(33, 40)

In [14]:
#.group()
match.group()

'located'

In [None]:
#index locate
match[0]

#### find numbers

In [17]:
#match only pulls back the first match, doesn't iterate
regexp = r'\d+'

re.search(regexp, subject)

<re.Match object; span=(24, 28), match='2014'>

#### find navarro

In [18]:
regexp = 'navarro'

re.search(regexp, subject)

In [19]:
#results type returns None because there is none
type(re.search(regexp, subject))

NoneType

In [20]:
#case sensitive
regexp = 'Navarro'

re.search(regexp, subject)

<re.Match object; span=(48, 55), match='Navarro'>

In [21]:
type(re.search(regexp, subject))

re.Match

### Name Capture Group

- `?P`: to name a capture group

In [38]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### find the protocol, domain, and tld and name them

In [23]:
regexp = r='(http?s*)://(\w+)\.(\w+)'

re.search(regexp, subject)

<re.Match object; span=(35, 53), match='https://codeup.com'>

In [27]:
#groups()
regexp = r='(?P<protocol>http?s*)://(?P<domain>\w+)\.(?P<tld>\w+)'

match = re.search(regexp, subject)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

In [28]:
#looking at the groups
match.groups()

('https', 'codeup', 'com')

In [29]:
#groupdict()
match.groupdict()

{'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

##### regex flag:  `re.VERBOSE`

- `re.VERBOSE` will ignore whitespace in regex pattern

In [30]:
regexp = r'''
            (?P<protocol>http?s*)://
            (?P<domain>\w+)\.
            (?P<tld>\w+)
            '''
match = re.search(regexp, subject, re.VERBOSE)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

### `re.sub(pattern, repl, string)`

- allows us to match a regex and substitute in a new substring for the match; returns string

In [31]:
subject = 'abc 12345xyz'

#### remove all the digits

In [32]:
regexp = r'\d'

re.sub(regexp,'', subject)

'abc xyz'

#### replace all digits with an o 

In [33]:
regexp = r'\d'

re.sub(regexp,'o', subject)

'abc oooooxyz'

#### replace all the digits with a single o

In [36]:
regexp = r'\d+'

re.sub(regexp,'o', subject)

'abc oxyz'

#### Using regex with a str.replace, add regex=True arguement

In [43]:
# subjects
pd.Series(subject).str.replace(r'[aeiou]','*', regex=True)

0    \n    Y** c*n f*nd *s *n th* w*b *t https://c*...
dtype: object

### `re.compile`

- prepare a regular expression for use ahead of time, returns expression

In [44]:
emails = [
    "jane@company.com",
    "bob@company.com",
    "jane.janeway@company.com",
    "jane.janeway@dogood.org",
    "jane.janet.janeway@dogood.org", # bonus for the 3 part address
]

#### get name, domain, tld

In [46]:
#using findall to split up the email
regexp = r'(\w+\.*\w+)@(\w+)\.(\w+)'

for email in emails:
    print(re.findall(regexp, email))

[('jane', 'company', 'com')]
[('bob', 'company', 'com')]
[('jane.janeway', 'company', 'com')]
[('jane.janeway', 'dogood', 'org')]
[('janet.janeway', 'dogood', 'org')]


In [47]:
#compile the pattern
pattern = re.compile(r'''
        (?P<name>\w+\.*\w+)@
        (?P<domain>\w+)\.
        (?P<tld>\w+)
        ''', re.VERBOSE)
pattern

re.compile(r'\n        (?P<name>\w+\.*\w+)@\n        (?P<domain>\w+)\.\n        (?P<tld>\w+)\n        ',
re.UNICODE|re.VERBOSE)

In [50]:
#throw into contacts list of dictionaries
contacts = [re.search(pattern,email).groupdict() for email in emails]
contacts

[{'name': 'jane', 'domain': 'company', 'tld': 'com'},
 {'name': 'bob', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'dogood', 'tld': 'org'},
 {'name': 'janet.janeway', 'domain': 'dogood', 'tld': 'org'}]

In [51]:
#turn into df
pd.DataFrame(contacts)

Unnamed: 0,name,domain,tld
0,jane,company,com
1,bob,company,com
2,jane.janeway,company,com
3,jane.janeway,dogood,org
4,janet.janeway,dogood,org


# Recap

re functions
- `re.findall()`:finds all substrings where the REGEX matches, returns a list
- `re.search()`: scans through a string, looking for any location where the RE matches; returns match object
    - `match.span()`
    - `match.group()`
    - `match.groups()`
    - `match.groupdict()`
- `re.sub()`:allows us to match a regex and substitute in a new substring for the match; returns string
- `re.compile()`:prepare a regular expression for use ahead of time, returns expression

flags
- `re.IGNORECASE`: ignore case
- `re.VERBOSE`: ignore whitespace



Metacharacters
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything


Repeating

- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy


Any/None

- `[]`: will match anything inside of
- `[^]`: will match anything NOT inside of
- `[-]`: will match a range of values inside of


Capture Groups

- `()`: grab what's contained in parentheses 
- `?:`: to ignore a capture group also called shy
- `?P`: to name a capture group


Other
- `\`: escape character
    


