# Text Wrangling and Regex

Working with text: applying string methods and regular expressions

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [59]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import zipfile
import pandas as pd

## Demo 1: Canonicalizing County Names

Load both **county_and_state.csv** and **county_and_population.csv**

In [14]:
# display both frames
c_population = pd.read_csv(r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\county_and_population.csv')
c_population


Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


In [15]:
c_state = pd.read_csv(r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\county_and_state.csv')
c_state

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LS


Both of these DataFrames share a "County" column. Unfortunately, formatting differences mean that we can't directly merge the two DataFrames using the "County"s.

In [16]:
#merge both frames
merged = c_state.merge(right = c_population)
merged

Unnamed: 0,County,State,Population


To address this, we can **canonicalize** the "County" string data to apply a common formatting.

In [24]:
def canonicalize_county(merged):
    return (merged.str.lower()
                  .str.replace(" ", "")
                  .str.replace("&", "and")
                  .str.replace(".", "")
                  .str.replace("county", "")
                  .str.replace("parish", "")
    )



Apply canonicalize_county on 'County' columns in both frames.

In [27]:
# Code Here
c_population['County'] = canonicalize_county(c_population['County'])
c_state['County'] = canonicalize_county(c_state['County'])
print(c_population)
print(c_state)

             County  Population
0            dewitt       16798
1       lacquiparle        8067
2     lewisandclark       55716
3  stjohnthebaptist       43044
             County State
0            dewitt    IL
1       lacquiparle    MN
2     lewisandclark    MT
3  stjohnthebaptist    LS


  return (merged.str.lower()


Now, the merge works as expected!

In [28]:
merged = c_state.merge(right = c_population)
merged


Unnamed: 0,County,State,Population
0,dewitt,IL,16798
1,lacquiparle,MN,8067
2,lewisandclark,MT,55716
3,stjohnthebaptist,LS,43044


## Demo 2: Extracting Log Data

Load log.txt

In [119]:
#code Here
        # Process each line as needed
log_file_path = r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\log.txt'
with open(log_file_path, 'r') as file:
    for line in file:
        line = line.strip()
        data = line.split()
        print(data)

['169.237.46.168', '-', '-', '[26/Jan/2014:10:47:58', '-0800]', '"GET', '/stat141/Winter04/', 'HTTP/1.1"', '200', '2585', '"http://anson.ucdavis.edu/courses/"']
['193.205.203.3', '-', '-', '[2/Feb/2005:17:23:6', '-0800]', '"GET', '/stat141/Notes/dim.html', 'HTTP/1.0"', '404', '302', '"http://eeyore.ucdavis.edu/stat141/Notes/session.html"']
['169.237.46.240', '-', '""', '[3/Feb/2006:10:18:37', '-0800]', '"GET', '/stat141/homework/Solutions/hw1Sol.pdf', 'HTTP/1.1"']


Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Looking at the data, we see that these items are not in a fixed position relative to the beginning of the string. That is, slicing by some fixed offset isn't going to work.

Instead, we'll need to use some more sophisticated thinking. Let's focus on only the first line of the file.

In [47]:
# code here
with open(log_file_path, 'r') as file:
    first_line = file.readline().strip()
first_line

'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"'

Apply string functions of python to extract date from first entry in log file.

In [50]:
                                        # find the text enclosed in square brackets
                                        # split up the date/month/year
                                        # split up the hour:minute:second
                                        # split the timezone after the blank space
# Extract the part enclosed in square brackets
start_index = first_line.find('[') + 1
end_index = first_line.find(']')
date_time_str = first_line[start_index:end_index]

# Split date and time
date_part, time_part = date_time_str.split(':', 1)  # Split only on the first colon

# Extract day, month, year from the date part
day = date_part.split('/')[0]
month = date_part.split('/')[1]
year = date_part.split('/')[2]

# Extract hour, minute, second from the time part
time_parts = time_part.split(':')
hour = time_parts[0]
minute = time_parts[1]
second = time_parts[2].split(' ')[0]  # Split off timezone if present

# Print the result
print(f"('{day}', '{month}', '{year}', '{hour}', '{minute}', '{second} ')")

('26', 'Jan', '2014', '10', '47', '58 ')


This worked, but felt fairly "hacky" – the code above isn't particularly elegant. A much more sophisticated but common approach is to extract the information we need using a *regular expression*.


# Regular Expressions


## String Extraction with Regex

Python `re.findall` returns a list of all extracted matches: extract numbers from given string.

In [52]:
import re

text = "My social security number is 123-45-6789 bro, or actually maybe it’s 321-45-6789.";
reg = r'[0-9]{3}-[0-9]{2}-[0-9]{4}'
sec_number = re.findall(reg,text)
sec_number

['123-45-6789', '321-45-6789']

<br/>

Now, let's see vectorized extraction in `pandas`:

 `.str.findall` returns a `Series` of lists of all matches in each record.

In [81]:

data = ['987-65-4321','forty','123-45-6789 bro or 321-45-6789','999-99-9999']
df = pd.DataFrame(data,columns = ['SNN'])
df

Unnamed: 0,SNN
0,987-65-4321
1,forty
2,123-45-6789 bro or 321-45-6789
3,999-99-9999


Find all entries containing numbers in dataframe.

In [82]:
# -> Series of lists
reg = r'[0-9]{3}-[0-9]{2}-[0-9]{4}'
df_snn = df['SNN'].str.findall(reg)
df_snn

0                 [987-65-4321]
1                            []
2    [123-45-6789, 321-45-6789]
3                 [999-99-9999]
Name: SNN, dtype: object

## Extraction Using Regex Capture Groups

The Python function `re.findall`, in combination with parentheses returns specific substrings (i.e., **capture groups**) within each matched string, or **match**.

In [83]:
text = """I will meet you at 08:30:00 pm tomorrow"""

reg="(\d\d):(\d\d):(\d\d)"
time=re.findall(reg,text)
time

[('08', '30', '00')]

In [84]:
# the three capture groups in the first matched string
hour, minute, second = time[0]
time[0]

('08', '30', '00')

<br/>

In `pandas`, we can use `.str.extract` to extract each capture group of **only the first match** of each record into separate columns.

In [85]:
# back to SSNs
df_snn

0                 [987-65-4321]
1                            []
2    [123-45-6789, 321-45-6789]
3                 [999-99-9999]
Name: SNN, dtype: object

In [86]:
# Will extract the first match of all groups
reg_m = r'([0-9]{3})-([0-9]{2})-([0-9]{4})'
extracted = df["SNN"].str.extract(reg_m)
extracted

Unnamed: 0,0,1,2
0,987.0,65.0,4321.0
1,,,
2,123.0,45.0,6789.0
3,999.0,99.0,9999.0


Alternatively, `.str.extractall` extracts **all matches** of each record into separate columns. Rows are then MultiIndexed by original record index and match index.

In [101]:
# -> DataFrame, one row per match
# df["SNN"].str.extractall(extracted)
reg_m = r'([0-9]{3})-([0-9]{2})-([0-9]{4})'

# Extract the first match of each group
extracted = df["SNN"].str.extractall(reg_m)

# Print the result
print(extracted)

           0   1     2
  match               
0 0      987  65  4321
2 0      123  45  6789
  1      321  45  6789
3 0      999  99  9999


## Canonicalization with Regex

In regular Python, canonicalize with `re.sub` (standing for "substitute"):

In [103]:
text = '<div><td valign="top">Moo</td></div>'

pattern = r'<[^>]+>'
re.sub(pattern,'',text)

'Moo'

<br/>

In `pandas`, canonicalize with `Series.str.replace`.

In [104]:
# example dataframe of strings, convert in dataframe
df_html = ['<div><td valign="top">Moo</td></div>',
                   '<a href="http://ds100.org">Link</a>',
                   '<b>Bold text</b>']
df_html
df=pd.DataFrame(df_html,columns=['Html'])
df

Unnamed: 0,Html
0,"<div><td valign=""top"">Moo</td></div>"
1,"<a href=""http://ds100.org"">Link</a>"
2,<b>Bold text</b>


In [107]:
# Series -> Series
#Extract only words
pattern = r'\b\w+\b'
df['Words'] = df['Html'].str.findall(pattern)
df

Unnamed: 0,Html,Words
0,"<div><td valign=""top"">Moo</td></div>","[div, td, valign, top, Moo, td, div]"
1,"<a href=""http://ds100.org"">Link</a>","[a, href, http, ds100, org, Link, a]"
2,<b>Bold text</b>,"[b, Bold, text, b]"



# Revisiting Text Log Processing using Regex

### Python `re` version

In [129]:
import re

log_file_path = r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\log.txt'

# Read the file and store each line in log_lines
with open(log_file_path, 'r') as file:
    log_lines = file.readlines()

# Display the first line
line = log_lines[0].strip()
print(line)

# Regular expression to extract date and time components
pattern = r'\[(\d{1,2})/(\w{3})/(\d{4}):(\d{1,2}):(\d{1,2}):(\d{1,2}) ([+-]\d{4})\]'

# Search for the pattern in the line
match = re.search(pattern, line)

if match:
    day, month, year, hour, minute, second, timezone = match.groups()
    print((day, month, year, hour, minute, second, timezone))
else:
    print("No match found")


169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
('26', 'Jan', '2014', '10', '47', '58', '-0800')


### `pandas` version

In [140]:
# code here
import pandas as pd
import re

# Read the log file into a DataFrame
log_file_path = r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\log.txt'
log_df = pd.read_csv(log_file_path, header=None, names=['log_line'])

# Regular expression to extract date and time components
pattern = re.compile(r'\[(\d{1,2})/(\w{3})/(\d{4}):(\d{1,2}):(\d{1,2}):(\d{1,2}) ([+-]\d{4})\]')

# Function to extract date and time components using regex
def extract_date_time(line):
    match = pattern.search(line)
    if match:
        return match.groups()
    else:
        return (None, None, None, None, None, None, None)

# Apply the function to extract date and time components
log_df['date_time_components'] = log_df['log_line'].apply(extract_date_time)

# Split the extracted components into separate columns
date_time_df = pd.DataFrame(log_df['date_time_components'].tolist(), columns=['day', 'month', 'year', 'hour', 'minute', 'second', 'timezone'])

# Concatenate the original log lines with the extracted date and time components
result_df = pd.concat([log_df, date_time_df], axis=1)

# Display the first row with extracted components
result_df1= pd.DataFrame(result_df) 
print(result_df1.head(1))



                                            log_line  \
0  169.237.46.168 - - [26/Jan/2014:10:47:58 -0800...   

                 date_time_components day month  year hour minute second  \
0  (26, Jan, 2014, 10, 47, 58, -0800)  26   Jan  2014   10     47     58   

  timezone  
0    -0800  


Option 1: `Series.str.findall`

In [142]:
# Function to extract date and time components from a log line
df = pd.DataFrame(log_lines, columns=['Log'])

# Regular expression to extract date and time components
pattern = re.compile(r'\[(\d{1,2})/(\w{3})/(\d{4}):(\d{1,2}):(\d{1,2}):(\d{1,2}) ([+-]\d{4})\]')

def extract_datetime_components(log_line):
    match = pattern.search(log_line)
    if match:
        return match.groups()
    else:
        return (None,) * 7

# Apply the function to the DataFrame
df['DateTimeComponents'] = df['Log'].apply(extract_datetime_components)

# Print the result
print(df['DateTimeComponents'])

0    (26, Jan, 2014, 10, 47, 58, -0800)
1      (2, Feb, 2005, 17, 23, 6, -0800)
2     (3, Feb, 2006, 10, 18, 37, -0800)
Name: DateTimeComponents, dtype: object


<br/>

Option 2: `Series.str.extractall`

In [143]:
extracted_data = df['Log'].str.extractall(pattern)

# Wrangle the extracted data into a nice format
extracted_data.columns = ['Day', 'Month', 'Year', 'Hour', 'Minute', 'Second', 'Time Zone']
extracted_data = extracted_data.droplevel(1).reset_index(drop=True)

# Print the result

Wrangling either of these two DataFrames into a nice format (like below) is left as an exercise for you!


||Day|Month|Year|Hour|Minute|Second|Time Zone|
|---|---|---|---|---|---|---|---|
|0|26|Jan|2014|10|47|58|-0800|
|1|2|Feb|2005|17|23|6|-0800|
|2|3|Feb|2006|10|18|37|-0800|


In [None]:
# your code here
...