# Regular Expressions (AKA *regex*)

* Covered in Module 8.3.9
* Module 8.3.9 has a good cheatsheet in the lower half of the page
* Another cheatsheet: https://cheatography.com/davechild/cheat-sheets/regular-expressions/
* See https://regexr.com/
* See https://regex101.com/ (I think this one is the best)

Please take to heart the quote from Module 8.3.9:
> It starts getting wild quickly

## Regular expressions are just strings of characters that are used as a search pattern.

---
## Functions
* **findall()** - Returns a list containing all matches
* **search()**  - Returns a Match object if there is a match anywhere in the string
* **split()**   - Returns a list where the string has been split at each match
* **sub()**     - Replaces one or many matches with a string
---

In [13]:
# You will need to impoert:
import re

---
## findall()

In [22]:
# Simplest example I can think of
# How many times does a substring appear in text?
sentence = "The fat cat that sat on a mat"
list_of_matches = re.findall(r"cat", sentence)
list_of_matches

['cat']

In [25]:
# If there are no matches?
list_of_matches = re.findall(r"xxx", sentence)
list_of_matches
# Empty list!

[]

In [30]:
# Find all the words
re.findall(r'/\b(at)\b/g', sentence)

[]

In [31]:
# Find all words that contain a string
re.findall(r'(\w*at\w*)', sentence)

['fat', 'cat', 'that', 'sat', 'mat']

---
## split()

In [35]:
# Split the string based on a pattern.
# In this case the pattern is a wgitespace character
# returns a list
sentence = "The fat cat that sat on a mat"
re.split(r"\s", sentence)

['The', 'fat', 'cat', 'that', 'sat', 'on', 'a', 'mat']

In [36]:
# If the split canot occur, it does not split
re.split(r"x", sentence)

['The fat cat that sat on a mat']

In [37]:
# Split only the first 2 occurences
# This is the "maxsplit" character
re.split(r"\s", sentence, 2)

['The', 'fat', 'cat that sat on a mat']

In [50]:
# Split only the first 2 occurences
# This is the "maxsplit" character
sentence = "The yellow fatty cat that sat on a mat"
split_list = re.split(r"(\w*at\w*)\s", sentence)
split_list

['The yellow ', 'fatty', '', 'cat', '', 'that', '', 'sat', 'on a mat']

In [51]:
# Notice the ''. There is probably a way to change the regex expression wo hanle this.
# But lets use a list comprehension to remove them
without_empty = [w for w in split_list if w != '']
without_empty

['The yellow ', 'fatty', 'cat', 'that', 'sat', 'on a mat']

In [54]:
# Or we can make is a one liner
without_empty = [w for w in re.split(r"(\w*at\w*)\s", "The yellow fatty cat that sat on a mat") if w != '']
without_empty

['The yellow ', 'fatty', 'cat', 'that', 'sat', 'on a mat']

---
## sub()

In [55]:
# Substitute a comma for any whitespaces
sentence = "The yellow fatty cat that sat on a mat"
re.sub(r"\s", ",", sentence)

'The,yellow,fatty,cat,that,sat,on,a,mat'

In [57]:
# Substitute a comma for the FIRST 3 whitespaces
sentence = "The yellow fatty cat that sat on a mat"
re.sub(r"\s", ",", sentence, 3)

'The,yellow,fatty,cat that sat on a mat'

In [62]:
# Subsitute for all spaces
re.sub(r"\s", " SPACE ", sentence)

'The SPACE yellow SPACE fatty SPACE cat SPACE that SPACE sat SPACE on SPACE a SPACE mat'

In [69]:
# Subsitute for all words
re.sub(r"\w+", "WORD", sentence)

'WORD WORD WORD WORD WORD WORD WORD WORD WORD'

In [70]:
# Subsitute for all occurences of "cat"
re.sub(r"(cat)", "WORD", sentence)

'The yellow fatty WORD that sat on a mat'

In [72]:
# Any word with "at"
re.sub(r"(\w*at\w*)", "AT-WORD", sentence)

'The yellow AT-WORD AT-WORD AT-WORD AT-WORD on a AT-WORD'

---
## search()
* Returns a Match object if there is a match anywhere in the string
* A Match Object is an object containing information about the search and the result.
* You don't get back a list, but rather, a match object.

### The match Object contains 2 methods and one property

* .span() returns a tuple containing the start-, and end positions of the match. Note the end position is NOT INCLISIVE. For eample a single character at index 3 would be (3, 4).
* .string returns the string passed into the function. Note there is no (), this is a property
* .group() returns the part of the string where there was a match


In [93]:
# Search for a space. 
sentence = "The fat cat that sat On a mat"
re.search(r"\s", sentence)
# (you get back a match object)

<re.Match object; span=(3, 4), match=' '>

In [94]:
# Search for a word starting with a capital "O"
re.search(r"\bO\w+",sentence)

<re.Match object; span=(21, 23), match='On'>

In [95]:
# Gett he part where it matched
re.search(r"\bO\w+",sentence).group()

'On'

In [97]:
re.search(r"\bO\w+",sentence).span()

(21, 23)

In [98]:
re.search(r"\bO\w+",sentence).string

'The fat cat that sat On a mat'

---
## Phone Numbers

In [115]:
# Get the phone numbers from a string
phone_number = "The first phone number id 303-333-4444 the other is 916-333-5555, the third is 408-333-66789"
re.findall("(\d{3})-(\d{3})-(\d{4})", phone_number)
# See the past phone number - it incorrectly matched. it has too many digits

[('303', '333', '4444'), ('916', '333', '5555'), ('408', '333', '6678')]

In [116]:
# Append (?!\d) to the expresion. See "Negative lookahead groups" in Module 8.3.9 
phone_number = "The first phone number id 303-333-4444 the other is 916-333-5555, the third is 408-333-66789"
re.findall("(\d{3})-(\d{3})-(\d{4})(?!\d)", phone_number)

[('303', '333', '4444'), ('916', '333', '5555')]

In [111]:
area_codes = re.findall("(\d{3})-(?:\d{3})-(?:\d{4})(?!\d)", phone_number)
area_codes

['303', '916']

In [118]:
# Create a list of the full phone numbers as strings
# List compremension!
['-'.join(ph) for ph in re.findall("(\d{3})-(\d{3})-(\d{4})(?!\d)", phone_number)]

['303-333-4444', '916-333-5555']