# Text Wrangling and Regex

Working with text: applying string methods and regular expressions

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import zipfile
import pandas as pd

## Demo 1: Canonicalizing County Names

Load both **county_and_state.csv** and **county_and_population.csv**

In [3]:
df1 = pd.read_csv('./data/county_and_state.csv')
df2 = pd.read_csv('./data/county_and_population.csv')

In [4]:
# display both frames
#code Here

df1

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LS


In [5]:
df2

Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


Both of these DataFrames share a "County" column. Unfortunately, formatting differences mean that we can't directly merge the two DataFrames using the "County"s.

In [6]:
#merge both frames usding pd.merge
#The output would only show the header. We will resolve it later on.

pd.merge(df1,df2,on = 'County')

#  This is because the values in the county string are not in the same format

Unnamed: 0,County,State,Population


To address this, we can **canonicalize** the "County" string data to apply a common formatting.

In [7]:
def canonicalize_county(county_series):
    county = county_series.str.lower()
    county = county.str.replace(" ", "")
    county = county.str.replace("&", "")
    county = county.str.replace(".", "")
    county = county.str.replace("county", "")
    county = county.str.replace("parish", "")
    return county
canonicalize_county
                          


<function __main__.canonicalize_county(county_series)>

Apply canonicalize_county on 'County' columns in both frames.

In [8]:
df2

Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


In [9]:
# Code 
df1['County'] = canonicalize_county(df1['County'])
df2['County'] = canonicalize_county(df2['County'])

Now, the merge works as expected!

In [10]:
# code Here
merge_df = pd.merge(df1,df2,on='County')
merge_df

Unnamed: 0,County,State,Population
0,dewitt,IL,16798
1,lacquiparle,MN,8067
2,stjohnthebaptist,LS,43044


## Demo 2: Extracting Log Data

Load log.txt

In [11]:
#code Here
        # Process each line as needed
#Hint: the file will be read almost the same way as we read the json file.
#Use the file.readlines function to read the lines in the txt file

with open('./data/log.txt','r') as files:
        log_file = files.readlines()
log_file
        

['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n',
 '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /stat141/Notes/dim.html HTTP/1.0" 404 302 "http://eeyore.ucdavis.edu/stat141/Notes/session.html"\n',
 '169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1"\n']

Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Looking at the data, we see that these items are not in a fixed position relative to the beginning of the string. That is, slicing by some fixed offset isn't going to work.

Instead, we'll need to use some more sophisticated thinking. Let's focus on only the first line of the file.

In [12]:
# code here
first_line = log_file[0]
first_line

'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'

Apply string functions of python to extract date from first entry in log file.

In [13]:
# you might take help from the slides
partinent = first_line.split('[')[1].split(']')[0]
day,month,rest = partinent.split('/')
year,hour,minute,rest = rest.split(":")
seconds,time_zone= rest.split(' ')
                               
day,month,year,hour,minute,seconds                                 



('26', 'Jan', '2014', '10', '47', '58')

This worked, but felt fairly "hacky" – the code above isn't particularly elegant. A much more sophisticated but common approach is to extract the information we need using a *regular expression*.


# Regular Expressions


## String Extraction with Regex

Python `re.findall` returns a list of all extracted matches: extract numbers from given string.

In [14]:
first_line

'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'

In [35]:
import re
text = "My social security number is 123-45-6789 bro, or actually maybe it’s 321-45-6789."
pattern = r"\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]"
asd= re.findall(pattern,first_line)
day,month,year,hour,min,sec,time = asd[0]
day

'26'

In [16]:
import re

text = "My social security number is 123-45-6789 bro, or actually maybe it’s 321-45-6789."
# code Here
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
number = re.findall(pattern,text)
number
#Expected output:  ['123-45-6789', '321-45-6789']

['123-45-6789', '321-45-6789']

<br/>

Now, let's see vectorized extraction in `pandas`:

 `.str.findall` returns a `Series` of lists of all matches in each record.

In [17]:
data = ['987-65-4321','forty','123-45-6789 bro or 321-45-6789','999-99-9999']

df = pd.Series(data)
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
df.str.findall(pattern)

# code here to convert the above data into a data

0                 [987-65-4321]
1                            []
2    [123-45-6789, 321-45-6789]
3                 [999-99-9999]
dtype: object

Find all entries containing numbers in dataframe.

In [18]:
# -> Series of lists
df = pd.Series(data)
pattern = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"

df.str.findall(pattern)



#Expected Output:
#  0                   [987, 65, 4321]
#  1                                []
#  2    [123, 45, 6789, 321, 45, 6789]
#  3                   [999, 99, 9999]
#  Name: SSN, dtype: object

0                     [(987, 65, 4321)]
1                                    []
2    [(123, 45, 6789), (321, 45, 6789)]
3                     [(999, 99, 9999)]
dtype: object

## Extraction Using Regex Capture Groups

The Python function `re.findall`, in combination with parentheses returns specific substrings (i.e., **capture groups**) within each matched string, or **match**.

In [37]:
text = """I will meet you at 08:30:00 pm tomorrow"""
# code here

pattern = r'(\d\d):(\d\d):(\d\d)'
matchstring = re.findall(pattern,text)

In [38]:
# the three capture groups in the first matched string
hour, minute, second = matchstring[0]
hour,minute,second

('08', '30', '00')

<br/>

In `pandas`, we can use `.str.extract` to extract each capture group of **only the first match** of each record into separate columns.

In [39]:
# back to SSNs
data = ['987-65-4321','forty','123-45-6789 bro or 321-45-6789','999-99-9999']
series = pd.Series(data)

pattern = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"

series.str.extract(pattern)

Unnamed: 0,0,1,2
0,987.0,65.0,4321.0
1,,,
2,123.0,45.0,6789.0
3,999.0,99.0,9999.0


In [22]:
# Will extract the first match of all groups
data = ['987-65-4321','forty','123-45-6789 bro or 321-45-6789','999-99-9999']
series = pd.Series(data)

pattern = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"

series.str.extract(pattern)

Unnamed: 0,0,1,2
0,987.0,65.0,4321.0
1,,,
2,123.0,45.0,6789.0
3,999.0,99.0,9999.0


Alternatively, `.str.extractall` extracts **all matches** of each record into separate columns. Rows are then MultiIndexed by original record index and match index.

In [23]:
# -> DataFrame, one row per match

data = ['987-65-4321','forty','123-45-6789 bro or 321-45-6789','999-99-9999']

df = pd.Series(data)
pattern = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"
df.str.extractall(pattern)



Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,987,65,4321
2,0,123,45,6789
2,1,321,45,6789
3,0,999,99,9999


## Canonicalization with Regex

In regular Python, canonicalize with `re.sub` (standing for "substitute"):

In [24]:
text = '<div><td valign="top">Moo</td></div>'

pattern = r"<[^>]+>"
re.sub(pattern,"",text)

#code here to extract the word Moo

'Moo'

<br/>

In `pandas`, canonicalize with `Series.str.replace`.

In [25]:
# example dataframe of strings, convert in dataframe
df_html = ['<div><td valign="top">Moo</td></div>',
                   '<a href="http://ds100.org">Link</a>',
                   '<b>Bold text</b>']
df_html = pd.DataFrame(df_html,columns = ['HTML'])
df_html

Unnamed: 0,HTML
0,"<div><td valign=""top"">Moo</td></div>"
1,"<a href=""http://ds100.org"">Link</a>"
2,<b>Bold text</b>


In [26]:
# Series -> Series
#Extract only words from the above given df_html like Moo, Link and Bold text

pattern = r"<[^>]+>"
df_html['HTML'].str.replace(pattern,"",regex = True)

0          Moo
1         Link
2    Bold text
Name: HTML, dtype: object


# Revisiting Text Log Processing using Regex

### Python `re` version

In [27]:
line = log_file[0]

# code here
display(line)

'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'

### `pandas` version

In [28]:
df = pd.DataFrame(log_file,columns = ['Log'])
df

Unnamed: 0,Log
0,169.237.46.168 - - [26/Jan/2014:10:47:58 -0800...
1,"193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] ""..."
2,"169.237.46.240 - """" [3/Feb/2006:10:18:37 -0800..."


Option 1: `Series.str.findall`

In [29]:
pattern = r"\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]"
df['Log'].str.findall(pattern)

0    [(26, Jan, 2014, 10, 47, 58, -0800)]
1      [(2, Feb, 2005, 17, 23, 6, -0800)]
2     [(3, Feb, 2006, 10, 18, 37, -0800)]
Name: Log, dtype: object

<br/>

Option 2: `Series.str.extractall`

In [30]:
# code here
pattern = r"\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]"

df = df['Log'].str.extractall(pattern)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,26,Jan,2014,10,47,58,-800
1,0,2,Feb,2005,17,23,6,-800
2,0,3,Feb,2006,10,18,37,-800


In [31]:
df.columns = ['day','month','year','hour','minute','seconds','timezone']
df['index'] = [0,1,2]

df

Unnamed: 0_level_0,Unnamed: 1_level_0,day,month,year,hour,minute,seconds,timezone,index
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,26,Jan,2014,10,47,58,-800,0
1,0,2,Feb,2005,17,23,6,-800,1
2,0,3,Feb,2006,10,18,37,-800,2


Wrangling either of these two DataFrames into a nice format (like below) is left as an exercise for you!


||Day|Month|Year|Hour|Minute|Second|Time Zone|
|---|---|---|---|---|---|---|---|
|0|26|Jan|2014|10|47|58|-0800|
|1|2|Feb|2005|17|23|6|-0800|
|2|3|Feb|2006|10|18|37|-0800|


In [32]:
# your code here
...