# Lesson 08: Python String Methods

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
#import zipfile

## Canonicalization with Basic Python

In [2]:
county_and_state = pd.read_csv('data/county_and_state.csv')
county_and_pop = pd.read_csv('data/county_and_population.csv')

Suppose we'd like to join these two tables. Unfortunately, we can't, because the strings representing the county names don't match, as seen below.

In [3]:
county_and_state

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LS


In [4]:
county_and_pop

Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


 Before we can join them, we'll do what I call **canonicalization**.

[Canonicalization](https://en.wikipedia.org/wiki/Canonicalization): A process for converting data that has more than one possible representation into a "standard", "normal", or canonical form (definition via Wikipedia).

In [None]:
county_and_pop['clean_county'] = county_and_pop['County'].map(canonicalize_county)
county_and_state['clean_county'] = county_and_state['County'].map(canonicalize_county)

display(county_and_pop)  # display outputs even if not last line in cell - like a fancy print()
county_and_state

In [None]:
county_and_pop.merge(county_and_state, on='clean_county')

## Processing Data from a Text Log Using Basic Python

In [None]:
log_fname = 'data/log.txt'
!cat {log_fname}

In [None]:
with open(log_fname, 'r') as f:
    log_lines = f.readlines()

In [None]:
log_lines

Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Looking at the data, we see that these items are not in a fixed position relative to the beginning of the string. That is, slicing by some fixed offset isn't going to work.

In [None]:
log_lines[0][20:31]

In [None]:
log_lines[1][20:31]

Instead, we'll need to use some more sophisticated thinking. Let's focus on only the first line of the file.

In [None]:
first = log_lines[0]
first

In [None]:
pertinent = first.split("[")[1].split(']')[0]
day, month, rest = pertinent.split('/')
year, hour, minute, rest = rest.split(':')
seconds, time_zone = rest.split(' ')
day, month, year, hour, minute, seconds, time_zone

A much more sophisticated but common approach is to extract the information we need using a regular expression. See [today's lecture slides](https://ds100.org/sp22/lecture/lec06/) (Spring 2022) for more on regular expressions.

<br/><br/><br/>

---
## Regular Expressions

In [None]:
import re

### Canonicalization with Regex

Python `re.sub`

In [None]:
text = '<div><td valign="top">Moo</td></div>'
pattern = r"<[^>]+>"
re.sub(pattern, '', text)

<br/>

`pandas`: `Series.str.replace`

In [None]:
df_html = pd.DataFrame(['<div><td valign="top">Moo</td></div>',
                   '<a href="http://ds100.org">Link</a>',
                   '<b>Bold text</b>'], columns=['Html'])
df_html

In [None]:
# Series -> Series
df_html["Html"].str.replace(pattern, '', regex=True).to_frame()

---

### Extraction with Regex

Python `re.findall`

In [None]:
text = "My social security number is 123-45-6789 bro, or actually maybe it’s 321-45-6789.";
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)  # ['123-45-6789', '321-45-6789']

Regex Groups

In [None]:
text = """Observations: 03:04:53 - Horse awakens.
03:05:14 - Horse goes back to sleep."""       
pattern = r"(\d\d):(\d\d):(\d\d) - (.*)"
re.findall(pattern, text)

<br/>

`pandas`

In [None]:
df_ssn = pd.DataFrame(
    ['987-65-4321',
     'forty',
     '123-45-6789 bro or 321-45-6789',
     '999-99-9999'],
    columns=['SSN'])
df_ssn

1. `Series.str.findall`

In [None]:
# -> Series of lists
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
df_ssn['SSN'].str.findall(pattern)

2. `Series.str.extract`

In [None]:
# -> DataFrame of first match group
pattern_group = r"([0-9]{3}-[0-9]{2}-[0-9]{4})" # 1 group
df_ssn['SSN'].str.extract(pattern_group)

In [None]:
# Will extract first match of all groups
pattern_group_mult = r"([0-9]{3})-([0-9]{2})-([0-9]{4})" # 3 groups
df_ssn['SSN'].str.extract(pattern_group_mult)

3. `Series.str.extractall`

In [None]:
# -> DataFrame, one row per match
df_ssn['SSN'].str.extractall(pattern_group_mult)

In [None]:
# original dataframe
df_ssn

## Revisiting Text Log Processing using Regex

Python version:

In [None]:
line = log_lines[0]
display(line)

pattern = r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
day, month, year, hour, minute, second, time_zone = re.findall(pattern, line)[0] # get first match
day, month, year, hour, minute, second, time_zone

### Regular expressions can be compiled and used as an object

In [None]:
rx = re.compile(pattern)
rx

In [None]:
rx.search(line)

In [None]:
out = rx.search(line)

In [None]:
out = rx.search(line)
out.group(0)

This lets you write conditional code more easily:

In [None]:
inputs = [line, "blah blah blah"]
for l in inputs:
    out = rx.search(l)
    if out:
        print(out.group(0))
    else:
        print(f'*** No match for: {l[0:5]} ...')

In [None]:
# beyond the scope of lecture, but left here for your interest
day, month, year, hour, minute, second, time_zone = re.search(pattern, line).groups()
day, month, year, hour, minute, second, time_zone

<br/><br/>
### Pandas version

In [None]:
df = pd.DataFrame(log_lines, columns=['Log'])
df

Option 1: `Series.str.findall`

In [None]:
pattern = r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
df['Log'].str.findall(pattern)

<br/>

Option 2: `Series.str.extractall`

In [None]:
df['Log'].str.extractall(pattern)

Wrangling either of these two DataFrames into a nice format (like below) is left as an exercise for you! You will do a related problem on the homework.


||Day|Month|Year|Hour|Minute|Second|Time Zone|
|---|---|---|---|---|---|---|---|
|0|26|Jan|2014|10|47|58|-0800|
|1|2|Feb|2005|17|23|6|-0800|
|2|3|Feb|2006|10|18|37|-0800|


In [None]:
# your code here
...

<br/><br/>
<br/>

---

## Real World Example #1: Restaurant Data

In this example, we will show how regexes can allow us to track quantitative data across categories defined by the appearance of various text fields.

In this example we'll see how the presence of certain keywords can affect quantitative data:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

In [None]:
vio = pd.read_csv('data/violations.csv', header=0, names=['bid', 'date', 'desc'])
desc = vio['desc']
vio.head()

In [None]:
counts = desc.value_counts()
counts.shape

That's a lot of different descriptions!! Can we **canonicalize** at all? Let's explore two sets of 10 rows.

In [None]:
counts[:10]

In [None]:
# Hmmm...
counts[50:60]

In [None]:
# Use regular expressions to cut out the extra info in square braces.
vio['clean_desc'] = (vio['desc']
             .str.replace(r'\s*\[.*\]$', '', regex=True)
             .str.strip()       # removes leading/trailing whitespace
             .str.lower())
vio.head()

In [None]:
# canonicalizing definitely helped
vio['clean_desc'].value_counts().shape

In [None]:
vio['clean_desc'].value_counts().head() 

Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

Below, we use regular expressions and `df.assign()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html?highlight=assign#pandas.DataFrame.assign)) to **method chain** our creation of new boolean features, one per keyword.

In [None]:
# use regular expressions to assign new features for the presence of various keywords
# regex metacharacter | 
with_features = (vio
 .assign(is_unclean     = vio['clean_desc'].str.contains('clean|sanit'))
 .assign(is_high_risk = vio['clean_desc'].str.contains('high risk'))
 .assign(is_vermin    = vio['clean_desc'].str.contains('vermin'))
 .assign(is_surface   = vio['clean_desc'].str.contains('wall|ceiling|floor|surface'))
 .assign(is_human     = vio['clean_desc'].str.contains('hand|glove|hair|nail'))
 .assign(is_permit    = vio['clean_desc'].str.contains('permit|certif'))
)
with_features.head()

<br/><br/>

### EDA

That's the end of our text wrangling. Now let's do some more analysis to analyze restaurant health as a function of the number of violation keywords.

To do so we'll first group so that our **granularity** is one inspection for a business on particular date. This effectively counts the number of violations by keyword for a given inspection.

In [None]:
count_features = (with_features
 .groupby(['bid', 'date'])
 .sum()
 .reset_index()
)
count_features.iloc[255:260, :]

Check out our new dataframe in action:

In [None]:
count_features.query('is_vermin > 1').head(5)

Now we'll reshape this "wide" table into a "tidy" table using a pandas feature called `pd.melt` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=pd%20melt)) which we won't describe in any detail, other than that it's effectively the inverse of `pd.pivot_table`.

Our **granularity** is now a violation type for a given inspection (for a business on a particular date).

In [None]:
broken_down_by_violation_type = pd.melt(count_features, id_vars=['bid', 'date'],
            var_name='feature', value_name='num_vios')

# show a particular inspection's results
broken_down_by_violation_type.query('bid == 489 & date == 20150728')

Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

We have the second half of this question! Now let's **join** our table with the inspection scores, located in `inspections.csv`.

In [None]:
# read in the scores
ins = pd.read_csv('data/inspections.csv',
                  header=0,
                  usecols=[0, 1, 2],
                  names=['bid', 'score', 'date'])
ins.head()

While the inspection scores were stored in a separate file from the violation descriptions, we notice that the **primary key** in inspections is (`bid`, `date`)! So we can reference this key in our join.

In [None]:
# join scores with the table broken down by violation type
violation_type_and_scores = (
    broken_down_by_violation_type
    .merge(ins, on=['bid', 'date'])
)
violation_type_and_scores.head(12)

<br/><br/>

---

Let's plot the distribution of scores, broken down by violation counts, for each inspection feature (`is_clean`, `is_high_risk`, `is_vermin`, `is_surface`).

In [None]:
# you will learn this syntax next week. Focus on interpreting for now.
sns.catplot(x='num_vios', y='score',
               col='feature', col_wrap=2,
               kind='box',
               data=violation_type_and_scores);

Above we can observe:
* The inspection score generally goes down with increasing numbers of violations, as expected.
* Depending on the violation keyword, inspections scores on average go down at slightly different rates.
* For example, that if a restaurant inspection involved 2 violations with the keyword "vermin", the average score for that inspection would be a little bit below 80.

## Bonus Content: Using pd.to_datetime to Extract Time Information

Date parsing using `pd.to_datetime`.

In [None]:
pd.Series(log_lines).str.extract(r'\[(.*) -0800\]').apply(
    lambda s: pd.to_datetime(s, format='%d/%b/%Y:%H:%M:%S'))