# Introduction to Regular Expressions

> A regular expression is a group of characters or symbols which is used to find a specific pattern in a text.

Regular expressions are used to replace text within a string, validating forms, extracting a substring from a string based on a pattern match, and so much more. The term "regular expression" is a mouthful, so you will usually find the term abbreviated to "regex" or "regexp".

Theese are the most common reasons you will use Regular Expressions: 
- Looking for substrings within input
- Cleaning of input
- Restructuring of input
- Validating of input

***When learning regular expressions, it may be helpful to also learn restraint. You might be tempted, like me, to see regular expressions as a solution to far too many problems.***

Note that this is a basic introduction to Regular Expressions. There are more in-depth guides out there (see below) that will allow you to truly master Regular Expressions. For this session we are going to focus on the application, learning how to apply Regular Expressions to Pandas dataframes.

### Resources
[Mastering Regular Expressions](https://www.thriftbooks.com/w/mastering-regular-expressions/259441/item/2535867/?mkwid=%7cdc&pcrid=448964098780&pkw=&pmt=&slid=&plc=&pgrid=105775167313&ptaid=pla-924743127976&gclid=CjwKCAiAxp-ABhALEiwAXm6IyV4UD5p8UN1K8KqzEW5ZgAhqxVMoS3Zvo9ppibcqmNyFLJk-MDmG6hoClfEQAvD_BwE#idiq=2535867&edition=2310692) - Hands down, the best resource for learning Regular Expressions<br>
[Regex101.com](https://regex101.com/) - Best online resource for testing your regular expressions<br>
[Regular Expressions Cheat Sheet](https://www.regular-expressions.info/) - Quick Reference Guide<br>
[Regular-Expressions.info](https://www.regular-expressions.info/) - Best online resource for learning regular expressions<br>
[Regular Expression Python Documenation](https://docs.python.org/2/howto/regex.html) - Offical Documentation<br>
[Learn Regular Expressions the Easy Way](https://github.com/ziishaned/learn-regex/blob/master/README.md#learn-regex) - Great resource for free on Github

# Basic Syntax

## Special Characters

Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.

| Character | Description |
| :---: | --- |
|\. | Matches any single character except a newline |
|\^ | Matches a pattern at the start of a string |
|\$ | Matches a pattern at the end of a string |
|\\ | Escapes these special chatacters \^ \$ \* \+ \? \[ \] \( \) \{ \} \| \\  |
|\| | Alternation, matches characters before or after |
|\? | Makes the preceding symbol optional |
|\+ | Matches one or more repetitions of the preceding symbol |
|\* | Matches 0 or more repetitions of the preceding symbol |
|\[] |  Matches any character that is contained between the square brackets |
|\[^] | Matches any character that is not contained between the square brackets |
|\{m,n} | Match from m repetitions to n repetitions |
|\() | Match any character inside the parenthesis, indicates the start of a group |

## Special Sequences

| Character | Description |
| :---: | --- |
| \d | Matches any decimal digit; this is equivalent to the class [0-9] |
| \D | Matches any non-digit character; this is equivalent to the class [^0-9] | 
| \s | Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v] | 
| \S | Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v] | 
| \w | Matches any word; this is equivalent to the class [a-zA-Z0-9_] | 
| \W | Matches any non-word; this is equivalent to the class [^a-zA-Z0-9_] | 
| \A | Matches if the specified characters are at the start of the string | 
| \Z | Matches if the specified characters are at the end of the string | 
| \B | Matches if the specified charactersa are NOT at the beginning or end of a word | 
| \b | Matches if the specified charactersa are at the beginning or end of a word | 

## LookArounds

| Character | Description |
| :---: | --- |
| \?= | Positive lookahead - the first part of the expression must be followed by the lookahead expression |
| \?! | Negative lookahead - get all matches from an input string that are not followed by a certain pattern | 
| \?<= | Positive lookbehind - get all the matches that are preceded by a specific pattern | 
| \?<! | Negative Lookbehind - get all the matches that are not preceded by a specific pattern | 


## Flags

| Character | Description |
| :---: | --- |
| \i | Case insensitive: Match will be case-insensitive. |
| \g | Global Search: Match all instances, not just the first. | 
| \m | Multiline: Anchor meta characters work on each line | 

# Regex Functions

In [285]:
import re

# Matching a word
sentence = 'One is 1, ten is 10, one hundred is 100'
pattern = 'One'

print(re.search(pattern, sentence), '\n') # Returns a match object if found
print(re.match(pattern, sentence), '\n') # Only works if the match is found at the beginning of the string
print(re.split(pattern, sentence), '\n') # Splits the string, removes the match
print(re.findall(pattern, sentence), '\n') # Return all matches into a list of strings
print(re.finditer(pattern, sentence), '\n') # Returns an iterator, which you can loop through
print(re.sub(pattern, '1', sentence), '\n') # Replaces with a new string
print(re.subn(pattern, '1', sentence)) # Returns a tuple

<re.Match object; span=(0, 3), match='One'> 

<re.Match object; span=(0, 3), match='One'> 

['', ' is 1, ten is 10, one hundred is 100'] 

['One'] 

<callable_iterator object at 0x0000024C0A054148> 

1 is 1, ten is 10, one hundred is 100 

('1 is 1, ten is 10, one hundred is 100', 1)


In [286]:
''' Splitting Strings '''
import timeit

# Pandas
a = sentence.split(",")
%timeit a

# Regex
b = re.split(",", sentence)
%timeit b

26.4 ns ± 1.86 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
26.4 ns ± 1.55 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [287]:
''' Replacing '''
# Pandas
a = sentence.replace('1', 'One')
%timeit a

# Regex
b = re.sub('1', 'One', sentence)
%timeit b

24.8 ns ± 2.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
23.7 ns ± 0.643 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


# String Examples

In [292]:
sentence = "One is 1, ten is 10, one hundred is 100"

''' Finding all the letters in a string '''
print(re.findall(r'\w', sentence))

''' Finding all the words in a string, including numbers '''
print(re.findall(r'\w+', sentence))

''' Finding all the words in a string '''
print(re.findall(r'\D+', sentence))

''' Finding all the numbers within a string '''
print(re.findall(r'\d', sentence))

''' Finding all the numbers within a string that are of length 1 to 3 '''
print(re.findall(r'\d+', sentence))

['O', 'n', 'e', 'i', 's', '1', 't', 'e', 'n', 'i', 's', '1', '0', 'o', 'n', 'e', 'h', 'u', 'n', 'd', 'r', 'e', 'd', 'i', 's', '1', '0', '0']
['One', 'is', '1', 'ten', 'is', '10', 'one', 'hundred', 'is', '100']
['One is ', ', ten is ', ', one hundred is ']
['1', '1', '0', '1', '0', '0']
['1', '10', '100']


In [295]:
email = 'matt_federighi@gensler.com'

''' Everything before the @ symbol '''
print(re.findall(r'.+?(?=@)', email))

''' Everything after the @ symbol '''
print(re.findall(r'@(.*)', email))

''' Everything between @ and . '''
print(re.findall(r'(?<=@)(.*)(?=\.)', email))

['matt_federighi']
['gensler.com']
['gensler']


In [298]:
file_path = r'C:\Users\Matt\Downloads\PythonLearningGroup\RegularExpressions.py'

''' Grab the File name '''
print(re.findall(r'([^\\]+$)', file_path))



''' Split the string on the \\ '''
print(re.findall(r'([^\\]+)', file_path))

# ''' Splitting and getting values from the file path '''
print(file_path.split("\\"))

['RegularExpressions.py']
['C:', 'Users', 'Matt', 'Downloads', 'PythonLearningGroup', 'RegularExpressions.py']
['C:', 'Users', 'Matt', 'Downloads', 'PythonLearningGroup', 'RegularExpressions.py']


In [299]:
# Pandas
a = file_path.split("\\")[-1]
print(f"Pandas Split - {a}")
%timeit a

# Regex Split
b = re.split(r"\\", file_path)[-1]
print(f"Regex Split - {b}")
%timeit b

# Regex Findall
c = re.findall(r'([^\\]+)', file_path)[-1]
print(f"Pandas Findall - {c}")
%timeit c

Pandas Split - RegularExpressions.py
24 ns ± 0.539 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Regex Split - RegularExpressions.py
25.5 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Pandas Findall - RegularExpressions.py
28.3 ns ± 2.54 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [305]:
''' Searching for strings in a list '''
animals = ['Blue Cat', 'Red Dog', 'Green Cat', 'Yellow Cat', 'Purple Dog', 'Lonely Category', 'Orange cat']

# Define the pattern
pattern = re.compile(r'.*Cat\b', re.I)

# List Comprehension
print([cat for cat in animals if pattern.match(cat)])

# Alternatively, you can create a iterator
# Note that filter creates the iterator, and you need to convert it back to a list
# Otherwise you will get a filter object returned
print(list(filter(pattern.match, animals)))

['Blue Cat', 'Green Cat', 'Yellow Cat', 'Orange cat']
['Blue Cat', 'Green Cat', 'Yellow Cat', 'Orange cat']


In [304]:
''' Notice that Lonely Category was added to the list.
    Also note that the Orange cat is missing too
    Lets make this non-case sensitive and get category out of there '''

# Define the pattern
pattern = re.compile(r'.*Cat$', re.I) # or re.IGNORECASE

# List Comprehension
print([cat for cat in animals if pattern.match(cat)])

['Blue Cat', 'Green Cat', 'Yellow Cat', 'Orange cat']


In [306]:
''' Password Requirements 
    Alphanumeric string that may include _ and - having a length of 8 to 16 characters ''' 

# Define the function with input
def password_check(password): 
    
    # Define the password
    pattern = re.compile(r'^[a-z0-9_-]{8,16}$', re.I)
    
    # Look for a pattern match on the password
    match = pattern.match(str(password))
    
    # If True, return acceptable text
    if bool(match) == True:
        return 'This is an acceptable password'
    
    # Else unacceptable text
    else: 
        return 'This is not an acceptable password, please try again'

In [311]:
password_check('helloworld_')

'This is an acceptable password'

# Dataframe Examples

In [334]:
# Package to create fake data
import pandas as pd

# Create fake data
data = [{'Name': 'Steph Johnson',  'Email': 'steph@gensler.com',  'Address': '392 Edgefield Avenue Little Rock, AR 72209', 'Answer': 'I am 30 years old'}, 
        {'Name': 'James Peterson', 'Email': 'james@oracle.com',   'Address': '9736 Prairie St. Port Charlotte, FL 33952',  'Answer': 'Why do you need this?'},
        {'Name': 'Peter Black',    'Email': 'peter@google.com',   'Address': '82 W. Princess Street Algonquin, IL 60102',  'Answer': '42 and 3 months'},
        {'Name': 'Flora Arte',     'Email': 'flora@linkedin.com', 'Address': '166 Stonybrook Drive Westfield, MA 01085',   'Answer': "I'm 19"},
        {'Name': 'Olive Van Damne',    'Email': 'olive@amazon.com',   'Address': '212 Wakehurst Ave. Absecon, NJ 08205',       'Answer': '56'}, 
        {'Name': 'Steve',    'Email': 'steve@amazon.com',   'Address': '214 Wakehurst Ave. Absecon, NJ 08205',       'Answer': '58'}]

# Initialize Dataframe
df = pd.DataFrame(data)

# View Data
df

Unnamed: 0,Name,Email,Address,Answer
0,Steph Johnson,steph@gensler.com,"392 Edgefield Avenue Little Rock, AR 72209",I am 30 years old
1,James Peterson,james@oracle.com,"9736 Prairie St. Port Charlotte, FL 33952",Why do you need this?
2,Peter Black,peter@google.com,"82 W. Princess Street Algonquin, IL 60102",42 and 3 months
3,Flora Arte,flora@linkedin.com,"166 Stonybrook Drive Westfield, MA 01085",I'm 19
4,Olive Van Damne,olive@amazon.com,"212 Wakehurst Ave. Absecon, NJ 08205",56
5,Steve,steve@amazon.com,"214 Wakehurst Ave. Absecon, NJ 08205",58


In [339]:
''' Get First and Last Name '''

# Very easy with str.split, allowing you to pass Regex into Pandas functions 
df[['First Name', 'Last Name']] = df['Name'].str.split('\s', 1, expand=True)

df

Unnamed: 0,Name,Email,Address,Answer,First Name,Last Name
0,Steph Johnson,steph@gensler.com,"392 Edgefield Avenue Little Rock, AR 72209",I am 30 years old,Steph,Johnson
1,James Peterson,james@oracle.com,"9736 Prairie St. Port Charlotte, FL 33952",Why do you need this?,James,Peterson
2,Peter Black,peter@google.com,"82 W. Princess Street Algonquin, IL 60102",42 and 3 months,Peter,Black
3,Flora Arte,flora@linkedin.com,"166 Stonybrook Drive Westfield, MA 01085",I'm 19,Flora,Arte
4,Olive Van Damne,olive@amazon.com,"212 Wakehurst Ave. Absecon, NJ 08205",56,Olive,Van Damne
5,Steve,steve@amazon.com,"214 Wakehurst Ave. Absecon, NJ 08205",58,Steve,


In [325]:
''' Get the Company Name '''

# We already have this from the code above, let's apply it to the dataframe
df['Company'] = re.findall(r'(?<=@)(.*)(?=\.)', str(df['Email']).title())

df

Unnamed: 0,Name,Email,Address,Answer,Company
0,Steph Johnson,steph@gensler.com,"392 Edgefield Avenue Little Rock, AR 72209",I am 30 years old,Gensler
1,James Peterson,james@oracle.com,"9736 Prairie St. Port Charlotte, FL 33952",Why do you need this?,Oracle
2,Peter Black,peter@google.com,"82 W. Princess Street Algonquin, IL 60102",42 and 3 months,Google
3,Flora Arte,flora@linkedin.com,"166 Stonybrook Drive Westfield, MA 01085",I'm 19,Linkedin
4,Olive Van Damne,olive@amazon.com,"212 Wakehurst Ave. Absecon, NJ 08205",56,Amazon


In [326]:
''' Get the Age from Answer '''

# Note that we can't use findall, as we did above
# Also note that we are only getting the first number in the string
df['Age'] = df['Answer'].str.extract('(\d+)')

df

Unnamed: 0,Name,Email,Address,Answer,Company,Age
0,Steph Johnson,steph@gensler.com,"392 Edgefield Avenue Little Rock, AR 72209",I am 30 years old,Gensler,30.0
1,James Peterson,james@oracle.com,"9736 Prairie St. Port Charlotte, FL 33952",Why do you need this?,Oracle,
2,Peter Black,peter@google.com,"82 W. Princess Street Algonquin, IL 60102",42 and 3 months,Google,42.0
3,Flora Arte,flora@linkedin.com,"166 Stonybrook Drive Westfield, MA 01085",I'm 19,Linkedin,19.0
4,Olive Van Damne,olive@amazon.com,"212 Wakehurst Ave. Absecon, NJ 08205",56,Amazon,56.0


In [329]:
''' Parsing the Address 
    This is a beast of its own and there are entire companies dedicated to verifying addresses 
    We are not going to go into this, but below are some ways you could go about doing it 
    Notice its almost impossible with Regex to parse the street and the city '''

# Grab the last number occurrence in the Address string
df['Zip Code'] = df['Address'].str.extract('(\d+)$')

# Grab all words after comma
df['State'] = df['Address'].str.extract('(?<=,)(.*)(?=\s\d)')


In [330]:
df

Unnamed: 0,Name,Email,Address,Answer,Company,Age,Zip Code,State
0,Steph Johnson,steph@gensler.com,"392 Edgefield Avenue Little Rock, AR 72209",I am 30 years old,Gensler,30.0,72209,AR
1,James Peterson,james@oracle.com,"9736 Prairie St. Port Charlotte, FL 33952",Why do you need this?,Oracle,,33952,FL
2,Peter Black,peter@google.com,"82 W. Princess Street Algonquin, IL 60102",42 and 3 months,Google,42.0,60102,IL
3,Flora Arte,flora@linkedin.com,"166 Stonybrook Drive Westfield, MA 01085",I'm 19,Linkedin,19.0,1085,MA
4,Olive Van Damne,olive@amazon.com,"212 Wakehurst Ave. Absecon, NJ 08205",56,Amazon,56.0,8205,NJ
