# Regular Expressions (Regex)
- Regular expressions allow you to match patterns in strings, rather than matching exact characters.  

### Python `re` Library
- The `re` library in Python allows you to use regular expressions. 
- Some methods of note are:
    - `.search()` (search for a particular pattern given a string)
    - `.findall()` (finds all substrings that match a given pattern)
    - `.sub()` (replaces all matched substrings with another given substring or group)
    
### Regex Quantifiers
- `?`: exactly zero or one occurrences of the preceding element
- `*`: zero or more occurrences of the preceding element
- `+`: one or more occurrences of the preceding element
- `{n}`: preceding item is matched exactly `n` times
- `{,n}`: preceding item is matched up to `n` times inclusive
- `{n,}`: preceding item is matched at least `n` or more times
- `{m,n}`: preceding item is matched at least `m` or more times, but up to `n` times inclusive
    
### Escaping Special Characters
- Like special characters in Python (i.e `\n`), you will also need to escape special characters in regex.
- For example, if you wanted to match a literal bracket `(`, you have to type `\(` to escape it as `()` in regex is used to capture a literal group of characters

In [1]:
import re
import pandas as pd

In [2]:
string = r'Name: Chris, ph: (03) 9923 1123, comments: this is not my real number'

# this is the regex pattern we want
# notice that we need to "escape" the brackets
pattern = r'\(03\) \d{4} \d{4}'

if re.search(pattern, string) :
    print("Phone number found")
    print(re.findall(pattern, string))
else :
    print("Not found")

Phone number found
['(03) 9923 1123']


In [3]:
# This examples looks for phone numbers that match the format above
strings = [
    r'Name: Chris, ph: (03) 9923 1123, comments: this is not my real number',
    r'Name: John, ph: 03-9923-1123, comments: this might be an old number',
    r'Name: Sara, phone: (03)-9923-1123, comments: there is data quality issues, so far, three people sharig the same number',
    r'Name: Christopher, ph: (03)\-9923 -1123, comments, is this the same Chris in the first record?'
]

# change this line
pattern = r'\(?03\)?[ -\\]+\d{4}[ -\\]+\d{4}'

for s in strings:
    if re.search(pattern, s) :
        print("Phone number found")
        print(re.findall(pattern, s))
    else :
        print("Not found")

Phone number found
['(03) 9923 1123']
Phone number found
['03-9923-1123']
Phone number found
['(03)-9923-1123']
Phone number found
['(03)\\-9923 -1123']


In [4]:
data = pd.read_csv('regex.csv')
data.head()

Unnamed: 0,review,address
0,Avengers: Endgame is dumb. Very dumb. It's a m...,39 Aaron Place Norwood VIC 5091
1,What an unbelievable accomplishment to have sh...,461 Achernar Close Bondi Junction VIC 5125
2,"Disclosure: I'm NOT a Marvel superfan, but I'v...",24 Adair Street Bundaberg VIC 2127
3,"""Avengers: Endgame"" is about memories, nostalg...",51 Academy Close Floreat VIC 2680
4,It feels and watches like a seasin finale of a...,12 Abercorn Crescent Joondalup NSW 3055


Our objective for this example is to split the `address` field into `street_num`, `street_address`, `suburb`, `state`, `postcode` so that we can extract the **state**.

Assuming that the street types fall under the subset:
- Place
- Close
- Street or St (abbreviation)
- Crescent
- Circuit

Now, here's a pattern that should work for the majority of cases:
- `r'\d+ .+ (?:Place|Close|Street|Crescent|Circuit) .+ [A-Z]{2,3} \d{4}'` 

Let's break it down:
1. `\d+ .+` corresponds to at least one or more digits (street number) followed by one or more alphanumeric characters (street name) separated by a single space.
2. `(?:Place|Close|Street|St|Crescent|Circuit)` matches exactly one of the street types. The `?:` inside the bracket means that we **do not** want this as a capturing group.
3. `.+` matches one or more alphanumeric characters (suburb) .
4. `[A-Z]{2,3}` matches the 2 or 3 character state code.
5. `\d{4}` matches a 4 digit postcode.

In [6]:
standard_pattern = r'\d+ .+ (?:Place|Close|Street|Crescent|Circuit) .+ [A-Z]{2,3} \d{4}'
re.findall(standard_pattern, '39 Aaron Place Norwood VIC 5091') # This should return a match

['39 Aaron Place Norwood VIC 5091']

Combining Regex with the `df.apply()` method is very powerful. Here's an example to find addresses that **DO NOT** match the pattern.

In [7]:
# return the address if we cannot find anything else return None
no_match = data['address'].apply(lambda x: x if not re.findall(standard_pattern, x) else None)

# return all non-None values
no_match.loc[no_match.notnull()].values

array(['30 Abercrombie Circuit Hazelbrook  5571',
       '12 Adamson Crescent Werribee South  6151',
       '41 Abercrombie Circuit Old Reynella  2250',
       "9 A'beckett Street St Albans  6169",
       '29 Abercorn Crescent Cookernup QLD 850',
       ' Adair Street Wandin North NSW 3954',
       '75 Adamson Crescent Casino  2756',
       '38 Abbott Street Lakemba  6062',
       ' Acraman Place Blacktown WA 2121',
       '18 Acraman Place Cedar Vale  6017',
       '43 Acacia Garden Beeliar NSW 4030',
       '2 Abernethy Street Baxter  2640',
       '104 Abbott Street Currans Hill  2112',
       ' Adinda Street Concord West NSW 7315',
       '176 Adamson Crescent Granville QLD ',
       ' Adamson Crescent Ashgrove  2031',
       '15 Adinda Street Coomera  3161',
       ' Abrahams Crescent Seven Hills WA 3175',
       '20 Abercrombie Court Joondalup VIC 2030',
       ' Adcock Place Charlestown NSW 2454',
       '73 Abercorn Crescent Warrandyte  3942',
       '12 Adinda St Inglewood QLD

<blockquote style="padding: 10px; background-color: #ebf5fb;">

## Class Discussion Question
How might we split between `street_address` and `suburb`? </blockquote>

<blockquote style="padding: 10px; background-color: #FFD392;">

## Individual Exercise
- Modify the `standard_pattern` to support capture groups. 
- Then, split the matched addresses into the required fields `street_num`, `street_address`, `suburb`, `state`, `postcode`.

For example, `'39 Aaron Place Norwood VIC 5091'` should return something like this:
```python
['', '39', 'Aaron Place', 'Norwood', 'VIC', '5091', '']
```

In [8]:
# this is currently the standard pattern, you will need to change it to capture the required fields 
standard_pattern = r'\d+ .+ (?:Place|Close|Street|Crescent|Circuit) .+ [A-Z]{2,3} \d{4}'

### SOLUTION
pattern_with_capture_groups = r'(\d+) (.+ (Place|Close|Street|Crescent|Circuit)) (.+) ([A-Z]{2,3}) (\d{4})'

In [9]:
print(re.split(standard_pattern, '39 Aaron Place Norwood VIC 5091'))
print(re.split(pattern_with_capture_groups, '39 Aaron Place Norwood VIC 5091'))

['', '']
['', '39', 'Aaron Place', 'Place', 'Norwood', 'VIC', '5091', '']


Now let's discuss a test case which doesn't work.

In [10]:
re.split(pattern_with_capture_groups, '66 Abercrombie Circuit St Kilda  3211')

['66 Abercrombie Circuit St Kilda  3211']

Notice how we only get a single capture group back? 

### Important: `re.search()` vs `re.match()`
So what's the difference? 
- `re.match()` only returns a match if it _occurs at start of the string_. It works similar to the `str.lstrip()` method.
- `re.search()` on the other hand is pretty much the same, but _searches_ the whole string for a potential match.

See below for the use cases.

In [11]:
street_pattern= r'\d+ .+ Place|Close|Street|Crescent|Circuit'

for method in (re.match, re.search):
    print(method(street_pattern, '39 Aaron Place Norwood VIC 5091'))

<re.Match object; span=(0, 14), match='39 Aaron Place'>
<re.Match object; span=(0, 14), match='39 Aaron Place'>


These both work since there is a match at the start. Now, let's try and match the state which occurs at the end of the string).

In [12]:
state_pattern= r'[A-Z]{2,3}'

for method in (re.match, re.search):
    print(method(state_pattern, '39 Aaron Place Norwood VIC 5091'))

None
<re.Match object; span=(23, 26), match='VIC'>


Notice how the `match` method doesn't return anything?

<blockquote style="padding: 10px; background-color: #FFD392;">

## Individual Exercise
Using Regex or an alternative method (`Series.str.extract`), extract the `state` from the `address` field into a new column.

Documentation for `Series.str.extract`: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html#pandas.Series.str.extract

In [13]:
### SOLUTION 1 USING REGEX
def find_state(s):
    """
    A function which finds the state code if possible
    """
    state_pattern= r'[A-Z]{2,3}'
    
    state = re.search(state_pattern, s)
    
    return state.group(0) if state else None

data['state_regex'] = data['address'].apply(find_state)
data['state_regex']

0      VIC
1      VIC
2      VIC
3      VIC
4      NSW
      ... 
333    VIC
334     WA
335    QLD
336    QLD
337     WA
Name: state_regex, Length: 338, dtype: object

In [14]:
### SOLUTION 2 USING str.extract
state_pattern_g = r'([A-Z]{2,3})'
data['state_extract'] = data['address'].str.extract(state_pattern_g)
data['state_extract']

0      VIC
1      VIC
2      VIC
3      VIC
4      NSW
      ... 
333    VIC
334     WA
335    QLD
336    QLD
337     WA
Name: state_extract, Length: 338, dtype: object

# <u>Concept: $n$-grams and Fuzzy Matching</u>

- To test the second hypothesis, we need to find which reviews refer to `"Captain Marvel"`. 
- However, they might refer to this character as `"Cap Marvel"`, `"Capt. Marvel"`, or there might even be some spelling errors! 

To overcome this, we can use what we call `Fuzzy Matching`. 

In [15]:
sample_review = "I still don't see Capt. Mervel as the leader of the New Avengers, it is still an undeveloped character"

The first step to do so is to split the paragraphs into bi-grams so we can match against `"Captain Marvel"` (which is a bi-gram itself).

FYI, your device may need to download the `punkt` toolbox for `nltk`.

In [16]:
from nltk import ngrams

In [17]:
list(ngrams(sample_review.split(), 2))

[('I', 'still'),
 ('still', "don't"),
 ("don't", 'see'),
 ('see', 'Capt.'),
 ('Capt.', 'Mervel'),
 ('Mervel', 'as'),
 ('as', 'the'),
 ('the', 'leader'),
 ('leader', 'of'),
 ('of', 'the'),
 ('the', 'New'),
 ('New', 'Avengers,'),
 ('Avengers,', 'it'),
 ('it', 'is'),
 ('is', 'still'),
 ('still', 'an'),
 ('an', 'undeveloped'),
 ('undeveloped', 'character')]

<blockquote style="padding: 10px; background-color: #ebf5fb;">

## Class Discussion Question
What does it mean to "pad"? Discuss the `pad_left` and `pad_right` arguments.

In [18]:
def bigram(text):
    """
    Computes bigrams (2-grams) given a tokenized text and joins the bigram into a single word
    """
    return [f"{word1} {word2}" for word1, word2 in ngrams(text.split(), 2)]

bigram(sample_review)

['I still',
 "still don't",
 "don't see",
 'see Capt.',
 'Capt. Mervel',
 'Mervel as',
 'as the',
 'the leader',
 'leader of',
 'of the',
 'the New',
 'New Avengers,',
 'Avengers, it',
 'it is',
 'is still',
 'still an',
 'an undeveloped',
 'undeveloped character']

The second step is to calculate the Levenshtein edit-distance between two strings.

In [19]:
from nltk import edit_distance

In [20]:
edit_distance('Captain Marvel', 'Capt. Mervel')

4

<blockquote style="padding: 10px; background-color: #ebf5fb;">

## Class Exercise
Write a function which calculates the normalised edit **similarity score** between 0 and 1. 

In [21]:
def normalised_edit_sim(s1, s2):
    """
    A function which computes the Levenshtein edit distance
    """
    return 1 - edit_distance(s1, s2) / max(len(s1), len(s2))

In [22]:
# TEST CASES
print(normalised_edit_sim('Captain Marvel', 'Capt. Mervel'))
print(normalised_edit_sim('Captain Marvel', 'Captain America'))
print(normalised_edit_sim('Captain Marvel', 'Cap. Marvel'))
print(normalised_edit_sim('Captain Marvel', 'cap marvel'))
print(normalised_edit_sim('Captain Marvel', 'Marvel'))

0.7142857142857143
0.6
0.7142857142857143
0.5714285714285714
0.4285714285714286


<blockquote style="padding: 10px; background-color: #ebf5fb;">
    
## Class Discussion Question
1. What would be a good _"threshold"_ to determine fuzzy matches for this case?
2. Notice the difference between matches 3 and 4. What can we do to improve the fuzzy matching?

<blockquote style="padding: 10px; background-color: #FFD392;">

## Individual Exercise
Write a function that takes in a review, and output the lowercased fuzzy-matched terms if it passes a certain threshold. By default, the threshold is set to `0.7`.

If there is no match, return `None`.

In [23]:
### SOLUTION
def fuzzy_match(review, threshold=0.7, title='Captain Marvel'):
    """
    A function which applies fuzzy matching to find a certain movie title.
    """
    # convert to lowercase
    review = review.casefold()
    title = title.casefold()
    
    # we use the same bigram function that we defined above
    bigrams = bigram(review)
    
    # find matches greater than thresholds using the normalised_edit_distance function that we defined above
    matches = [word for word in bigrams if normalised_edit_sim(word, title) >= threshold]
    
    return matches[0] if matches else None

In [24]:
fuzzy_match(sample_review, 0.65)

'capt. mervel'

We can then apply this `fuzzy_match` function over all the reviews and find all unique variants.

In [25]:
results = data['review'].apply(fuzzy_match)
results.loc[results.notnull()].unique()

array(['capt. marvel', '"captain marvel"', 'captain marvel',
       'captian marvel', 'captain marvels', 'captain marvel,',
       'capt marvel', 'captain marvel/brie', 'larsoncaptain marvel',
       'captain marvel.', 'cap marvel', 'casual marvel', 'cap. marvel',
       '10.captain marvel', 'cpt marvel'], dtype=object)

# <u> Challenge questions <u>
For the non-matching addresses in **Individual Exercise 3**, use the `re.sub` method to replace all `"St"` with `"Street"`.

However, **do not** replace `St Kilda` and `St Albans`!!!

Hint: Use negative lookahead (https://www.regular-expressions.info/lookaround.html)

In [26]:
### ANSWER HERE
pattern_to_find_st = r''

### SOLUTION
# Find `St` that is not followed by `Kilda` or `Albans` or `reet` (in case originally it's Street)
pattern_to_find_st = r'St(?!(?: Kilda| Albans|reet))'

In [27]:
# TEST CASES
print(re.sub(pattern_to_find_st, 'Street', '66 Abercrombie Circuit St Kilda  3211'))
print(re.sub(pattern_to_find_st, 'Street', "9 A'beckett Street St Albans  6169"))
print(re.sub(pattern_to_find_st, 'Street', '93 Adair St Auburn VIC 6084'))
print(re.sub(pattern_to_find_st, 'Street', "5 A'beckett St St Kilda East VIC 6018"))

66 Abercrombie Circuit St Kilda  3211
9 A'beckett Street St Albans  6169
93 Adair Street Auburn VIC 6084
5 A'beckett Street St Kilda East VIC 6018
