<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Regular-Expressions" data-toc-modified-id="Regular-Expressions-1">Regular Expressions</a></span><ul class="toc-item"><li><span><a href="#Metacharacters" data-toc-modified-id="Metacharacters-1.1">Metacharacters</a></span><ul class="toc-item"><li><span><a href="#Extracting-a-pattern-from-strings" data-toc-modified-id="Extracting-a-pattern-from-strings-1.1.1">Extracting a pattern from strings</a></span></li></ul></li></ul></li><li><span><a href="#Regex-with-Pandas-and-Named-Groups" data-toc-modified-id="Regex-with-Pandas-and-Named-Groups-2">Regex with Pandas and Named Groups</a></span></li><li><span><a href="#Using-regular-expression-to-tidy-up-fields-in-the-data" data-toc-modified-id="Using-regular-expression-to-tidy-up-fields-in-the-data-3">Using regular expression to tidy-up fields in the data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Use-.str.extract-to-extract-usefull-characters-for-column-data" data-toc-modified-id="Use-.str.extract-to-extract-usefull-characters-for-column-data-3.0.0.1">Use <code>.str.extract</code> to extract usefull characters for column data</a></span></li><li><span><a href="#Filter-out-strings-that-have-some-pattern-using-.str.contains" data-toc-modified-id="Filter-out-strings-that-have-some-pattern-using-.str.contains-3.0.0.2">Filter out strings that have some pattern using <code>.str.contains</code></a></span></li></ul></li><li><span><a href="#Cleaning-strings-using-.replace" data-toc-modified-id="Cleaning-strings-using-.replace-3.0.1">Cleaning strings using <strong><code>.replace</code></strong></a></span></li></ul></li></ul></li></ul></div>

# Regular Expressions

## Metacharacters

| Character | Description                                          | Example       |                                                                              |
|-----------|------------------------------------------------------|---------------|------------------------------------------------------------------------------|
| []        | A set of characters                                  | "[a-m]"       | Lowercase characters from a to m                                             |
| \         | Signals a special sequence/escape special characters | "\d"          | Digit characters                                                             |                                                                     
| .         | Any character                                        | "he..o"       | Sequence that starts with "he", followed by two (any) characters, and an "o" |
| ^         | Starts with                                          | "^hello"      | string starts with 'hello'                                                   |
| $         | Ends with                                            | "world"      | string ends with 'world'                                                     |
| *         | Zero or more occurrences                             | "aix*"        | string contains "ai" followed by 0 or more "x" characters                    |
| +         | One or more occurrences                              | "aix+"        | string contains "ai" followed by 1 or more "x" characters                    |
| {}        | Exactly the specified number of occurrences          | "al{2}"       | string contains "a" followed by exactly two "l" characters                   |
| |      | Either or                                 |                                 |
| ()         | Grouping                                           |          |                          

For more info check this [link](https://www.w3schools.com/python/python_regex.asp)

**Meta-Characters: Special Symbols**
* **`\d`** Any digit >> [0-9]
* **`\D`** Any non-digit >> [^0-9]
* **`\s`** Any whitespace >> [\t\n\r\f\v]
* **`\S`** Any non-whitespace >> [^\t\n\r\f\v]
* **`\w`** Alphanumeric character >> [a-zA-Z0-9_]
* **`\W`** Non-alphanumeric character >> [^a-zA-Z0-9_]

**Meta-Characters: Repetitions**
* **`*`**: matches ZERO or more occurances
* **`+`**: matches ONE or more occurances
* **`?`**: matches ZERO or ONE occurances
* **`{n}`**: exactly n repetitions
* **`{n,}`**: at least n repetiotions
* **`{,n}`**: at most n repetitions
* **`{m,n}`**: at least m and at most n repetitions

NOTE: Meta-Characters are placed right after the expression


First, let's introduce the *regular expression* to define patterns to match strings

For this, we need to `import re` which is a regular expression module in Python

* **`re.compile()`**
* **`re.match()`**

**Exercise-1:** define a regular expression to match US phone numbers that fit the pattern of `xxx-xxx-xxxx`

In [2]:
# notebook imports

import pandas as pd
import numpy as np
import re

import numpy as np
import matplotlib.pyplot as plt

In [14]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))


True
False


**Example-2:**
Match the following patterns:
1. dollar sign, arbitrary number of digits, a decimal point, 2 digits
2. A capital letter followed by an arbitrary number of alphanumeric characters

In [16]:
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)

True
True


If you want to test your *regular expression* within different entries, try this [website](https://regex101.com)



### Extracting a pattern from strings

**Example1**: Extract Numeric Values from String

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use:

* **`re.findall()`**: returns a list that contains all matches

In [1]:
# Import the regular expression module
import re

text = 'the recipe calls for 10 strawberries and 1 banana'

# Find the numeric values: matches
matches = re.findall('\d+', text)
# Print the matches
print(matches)

['10', '1']


`\d` is the pattern required to find digits. This should be followed with a `+` so that the previous element is matched one or more times. This ensures that `10` is viewed as one number and not as `1` and `0.

**Example2**: Extract Callouts (@) from a tweet string

In [1]:
tweet = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
words = tweet.split(' ')

words

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr',
 '@UN',
 '@UN_Women']

In [12]:
[w for w in words if w.startswith('@')]

['@', '@UN', '@UN_Women']

This doesnt work. Therefore, we have to match a certian pattern after the '@'

We can use regular expressions to help us with more complex parsing.

* **`re.findall()`**
* **`re.search()`**

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

`[]+` one or more occurances for what's inside the brackets

All this can be expressed in short as:
`'@\w+'`

In [2]:
import re # import re - a module that provides support for regular expressions

[w for w in words if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

In [10]:

# Find the numeric values: matches (another way)
matches2 = re.findall('@[A-Za-z0-9_]+', tweet)

# Print the matches
print(matches2)

['@UN', '@UN_Women']


# Regex with Pandas and Named Groups

In [3]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [4]:
# find the number of characters for each string in df['text']
df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [5]:
# find the number of tokens (words) for each string in df['text']
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [6]:
# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [7]:
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [8]:
# find all occurances of the digits
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [9]:
# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [10]:
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [11]:
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

<br>

To create new columns from the data use **`.str.extract`** which will create a new column for each group

In [12]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [13]:
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


To name the groups (hence the columns) use `?P<NAME>` inside the group

In [14]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


--------------------
# Using regular expression to tidy-up fields in the data

#### Use `.str.extract` to extract usefull characters for column data

In [3]:
df = pd.read_csv('BL-Flickr-Images-Book.csv')

One field where it makes sense to enforce a numeric value is the `Date of Publication` so that we can do calculations down the road

In [27]:
df['Date of Publication']

0              1879 [1878]
1                     1868
2                     1869
3                     1851
4                     1857
5                     1875
6                     1872
7                      NaN
8                     1676
9                     1679
10                    1802
11                    1859
12                    1888
13             1839, 38-54
14                    1897
15                    1865
16                 1860-63
17                    1873
18                    1866
19                    1899
20                    1814
21                    1820
22                    1800
23      1847, 48 [1846-48]
24                 [1897?]
25                 [1897?]
26                    1893
27                    1805
28                    1837
29                    1896
               ...        
8257                   NaN
8258                  1896
8259                   NaN
8260                   NaN
8261                  1750
8262                  1879
8

* **`str.extract`**

A particular book can have only one date of publication. Therefore, we need to do the following:

* Remove the extra dates in square brackets, wherever present: 1879 [1878]
* Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
* Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
* Convert the string nan to NumPy’s NaN value

In [30]:
# regular expression to extract pattern
regex = r'(^\d{4})'

In [31]:
extr = df['Date of Publication'].str.extract(r'(^\d{4})', expand=False)
extr.head()

0    1879
1    1868
2    1869
3    1851
4    1857
Name: Date of Publication, dtype: object

Now convert this `object` dtype col in numeric

In [32]:
df['Date of Publication'] = pd.to_numeric(extr)
df['Date of Publication'].dtype

dtype('float64')

---------------
#### Filter out strings that have some pattern using `.str.contains`

In [7]:
df = pd.read_csv('life_expp.csv',index_col=0)
df.head()

Unnamed: 0,country,year,life_expectancy
0,Afghanistan,1800,28.21
1,Albania,1800,35.4
2,Algeria,1800,28.82
3,Angola,1800,26.98
4,Antigua and Barbuda,1800,33.54


In [8]:
df.country.value_counts()

Portugal                          100
Timor-Leste                       100
Mauritania                        100
Bosnia and Herzegovina            100
Greece                            100
Australia                         100
Aruba                             100
Slovenia                          100
Djibouti                          100
Syria                             100
Nicaragua                         100
Honduras                          100
Malaysia                          100
Montenegro                        100
Mayotte                           100
Oman                              100
Romania                           100
Guam                              100
St. Vincent and the Grenadines    100
West Bank and Gaza                100
Macao, China                      100
Bhutan                            100
Mauritius                         100
China                             100
Guinea                            100
Nigeria                           100
Rwanda      

It is reasonable to assume that country names will contain:

* The set of lower and upper case letters.
* Whitespace between words.
* Periods for any abbreviations.

To confirm that the column is fine we need to use regular expressions

In [10]:
countries = df['country']

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
# to have a Boolean for the match
mask_inverse = ~mask
print(mask_inverse.head())

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
invalid_countries

0    False
1    False
2    False
3    False
4    False
Name: country, dtype: bool


38          Congo, Dem. Rep.
39               Congo, Rep.
41             Cote d'Ivoire
73             Guinea-Bissau
77          Hong Kong, China
105             Macao, China
106           Macedonia, FYR
118    Micronesia, Fed. Sts.
177              Timor-Leste
196    Virgin Islands (U.S.)
Name: country, dtype: object

In [13]:
df_clean = df.drop(invalid_countries.index, axis=0)

print(df.shape)
print(df_clean.shape)

(20100, 3)
(20090, 3)


---------------------------
### Cleaning strings using **`.replace`**
**Example**: Clean the DataFrame so that:
* All digits are removed
* Anything inside a bracket is removed

In [20]:
df = pd.DataFrame(['hi (you)','hey(me)','hey55','mous6afa',
                  'age (me) one'])

df

Unnamed: 0,0
0,hi (you)
1,hey(me)
2,hey55
3,mous6afa
4,age (me) one


`\(([^)]+)\)`

* **`\(`** matches the character **`(`** literally (case sensitive)
* **`([^)]+)`**
Match a single character **not** (**`)`**) present inside the squared bracket. In other words, match all characters except for **`)`**
* **`+`** Quantifier — Matches (all inside squared bracket) between one and unlimited times, as many times as possible, giving back as needed (greedy)
* **`\)`** matches the character **`)`** literally (case sensitive)

In [21]:
#remove only inside a bracket
df_inside = df.replace(regex=True, 
                to_replace=[r'\d', r' \(([^)]+)\)'], value=r'')

In [22]:
df_inside

Unnamed: 0,0
0,hi
1,hey(me)
2,hey
3,mousafa
4,age one


In [23]:
#remove anything after the bracket

df_after = df.replace(regex=True, 
                to_replace=[r'\d', r' \(.*'], value=r'')

df_after

Unnamed: 0,0
0,hi
1,hey(me)
2,hey
3,mousafa
4,age


Note that 'hey(me)' stayed the same because the regex matched a space before the bracket.

The difference between 'df_inside' and 'df_after' will appear in the last entry

**What if you want to do that with a string not a df**

**`re.sub`**

In [13]:
string = 'hey (you) here'
string = re.sub(r' \(.*',"",string)
string

'hey'