<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Regular-Expressions" data-toc-modified-id="Regular-Expressions-1">Regular Expressions</a></span><ul class="toc-item"><li><span><a href="#Metacharacters" data-toc-modified-id="Metacharacters-1.1">Metacharacters</a></span></li><li><span><a href="#Extracting-numerical-values-from-strings" data-toc-modified-id="Extracting-numerical-values-from-strings-1.2">Extracting numerical values from strings</a></span><ul class="toc-item"><li><span><a href="#Using-regular-expression-to-tidy-up-fields-in-the-data" data-toc-modified-id="Using-regular-expression-to-tidy-up-fields-in-the-data-1.2.1">Using regular expression to tidy-up fields in the data</a></span><ul class="toc-item"><li><span><a href="#Use-regular-expressions-to-clean-dates" data-toc-modified-id="Use-regular-expressions-to-clean-dates-1.2.1.1">Use regular expressions to clean dates</a></span></li><li><span><a href="#Use-Regular-Expressions-to-clean-country-spellings" data-toc-modified-id="Use-Regular-Expressions-to-clean-country-spellings-1.2.1.2">Use Regular Expressions to clean country spellings</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Regular Expressions

## Metacharacters

| Character | Description                                          | Example       |                                                                              |
|-----------|------------------------------------------------------|---------------|------------------------------------------------------------------------------|
| []        | A set of characters                                  | "[a-m]"       | Lowercase characters from a to m                                             |
| \         | Signals a special sequence/escape special characters | "\d"          | Digit characters                                                             |                                                                     
| .         | Any character                                        | "he..o"       | Sequence that starts with "he", followed by two (any) characters, and an "o" |
| ^         | Starts with                                          | "^hello"      | string starts with 'hello'                                                   |
| $         | Ends with                                            | "world"      | string ends with 'world'                                                     |
| *         | Zero or more occurrences                             | "aix*"        | string contains "ai" followed by 0 or more "x" characters                    |
| +         | One or more occurrences                              | "aix+"        | string contains "ai" followed by 1 or more "x" characters                    |
| {}        | Exactly the specified number of occurrences          | "al{2}"       | string contains "a" followed by exactly two "l" characters                   |
| |      | Either or                                 |                                 |
| ()         | Grouping                                           |          |                          

For more info check this [link](https://www.w3schools.com/python/python_regex.asp)

First, let's introduce the *regular expression* to define patterns to match strings

For this, we need to `import re` which is a regular expression module in Python

* **`re.compile()`**
* **`re.match()`**

**Exercise-1:** define a regular expression to match US phone numbers that fit the pattern of `xxx-xxx-xxxx`

In [3]:
# notebook imports

import pandas as pd
import numpy as np
import re

import numpy as np
import matplotlib.pyplot as plt

In [14]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))


True
False




## Extracting numerical values from strings
When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use:

* **`re.findall()`**: returns a list that contains all matches

In [15]:
# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)

['10', '1']


`\d` is the pattern required to find digits. This should be followed with a `+` so that the previous element is matched one or more times. This ensures that `10` is viewed as one number and not as `1` and `0.

**Examples:**
Match the following patterns:
1. dollar sign, arbitrary number of digits, a decimal point, 2 digits
2. A capital letter followed by an arbitrary number of alphanumeric characters

In [16]:
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)

True
True


If you want to test your *regular expression* within different entries, try this [website](https://regex101.com)

### Using regular expression to tidy-up fields in the data

#### Use regular expressions to clean dates

In [21]:
df = pd.read_csv('BL-Flickr-Images-Book.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8287 entries, 0 to 8286
Data columns (total 15 columns):
Identifier                8287 non-null int64
Edition Statement         773 non-null object
Place of Publication      8287 non-null object
Date of Publication       8106 non-null object
Publisher                 4092 non-null object
Title                     8287 non-null object
Author                    6509 non-null object
Contributors              8287 non-null object
Corporate Author          0 non-null float64
Corporate Contributors    0 non-null float64
Former owner              1 non-null object
Engraver                  0 non-null float64
Issuance type             8287 non-null object
Flickr URL                8287 non-null object
Shelfmarks                8287 non-null object
dtypes: float64(3), int64(1), object(11)
memory usage: 971.2+ KB


One field where it makes sense to enforce a numeric value is the `Date of Publication` so that we can do calculations down the road

In [27]:
df['Date of Publication']

0              1879 [1878]
1                     1868
2                     1869
3                     1851
4                     1857
5                     1875
6                     1872
7                      NaN
8                     1676
9                     1679
10                    1802
11                    1859
12                    1888
13             1839, 38-54
14                    1897
15                    1865
16                 1860-63
17                    1873
18                    1866
19                    1899
20                    1814
21                    1820
22                    1800
23      1847, 48 [1846-48]
24                 [1897?]
25                 [1897?]
26                    1893
27                    1805
28                    1837
29                    1896
               ...        
8257                   NaN
8258                  1896
8259                   NaN
8260                   NaN
8261                  1750
8262                  1879
8

* **`str.extract`**

A particular book can have only one date of publication. Therefore, we need to do the following:

* Remove the extra dates in square brackets, wherever present: 1879 [1878]
* Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
* Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
* Convert the string nan to NumPy’s NaN value

In [30]:
# regular expression to extract pattern
regex = r'(^\d{4})'

In [31]:
extr = df['Date of Publication'].str.extract(r'(^\d{4})', expand=False)
extr.head()

0    1879
1    1868
2    1869
3    1851
4    1857
Name: Date of Publication, dtype: object

Now convert this `object` dtype col in numeric

In [32]:
df['Date of Publication'] = pd.to_numeric(extr)
df['Date of Publication'].dtype

dtype('float64')

---------------
#### Use Regular Expressions to clean country spellings

In [7]:
df = pd.read_csv('life_expp.csv',index_col=0)
df.head()

Unnamed: 0,country,year,life_expectancy
0,Afghanistan,1800,28.21
1,Albania,1800,35.4
2,Algeria,1800,28.82
3,Angola,1800,26.98
4,Antigua and Barbuda,1800,33.54


In [8]:
df.country.value_counts()

Portugal                          100
Timor-Leste                       100
Mauritania                        100
Bosnia and Herzegovina            100
Greece                            100
Australia                         100
Aruba                             100
Slovenia                          100
Djibouti                          100
Syria                             100
Nicaragua                         100
Honduras                          100
Malaysia                          100
Montenegro                        100
Mayotte                           100
Oman                              100
Romania                           100
Guam                              100
St. Vincent and the Grenadines    100
West Bank and Gaza                100
Macao, China                      100
Bhutan                            100
Mauritius                         100
China                             100
Guinea                            100
Nigeria                           100
Rwanda      

It is reasonable to assume that country names will contain:

* The set of lower and upper case letters.
* Whitespace between words.
* Periods for any abbreviations.

To confirm that the column is fine we need to use regular expressions

In [10]:
countries = df['country']

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
# to have a Boolean for the match
mask_inverse = ~mask
print(mask_inverse.head())

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
invalid_countries

0    False
1    False
2    False
3    False
4    False
Name: country, dtype: bool


38          Congo, Dem. Rep.
39               Congo, Rep.
41             Cote d'Ivoire
73             Guinea-Bissau
77          Hong Kong, China
105             Macao, China
106           Macedonia, FYR
118    Micronesia, Fed. Sts.
177              Timor-Leste
196    Virgin Islands (U.S.)
Name: country, dtype: object

In [13]:
df_clean = df.drop(invalid_countries.index, axis=0)

print(df.shape)
print(df_clean.shape)

(20100, 3)
(20090, 3)
