# Lesson 08: Python String Methods

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

## 1. Canonicalization with Basic Python

In [None]:
county_and_state = pd.read_csv('data/county_and_state.csv')
county_and_pop = pd.read_csv('data/county_and_population.csv')

Suppose we'd like to join these two tables. Unfortunately, we can't, because the strings representing the county names don't match, as seen below.

In [None]:
county_and_state

In [None]:
county_and_state.County.str.lower().str.replace('county','').str.replace('parish',"").str.replace(" ","")

In [None]:
county_and_pop.County.str.lower().str.replace('.','', regex=False).str.replace('&',"and").str.replace(' ',"")

 Before we can join them, we'll do what I call **canonicalization**.

[Canonicalization](https://en.wikipedia.org/wiki/Canonicalization): A process for converting data that has more than one possible representation into a "standard", "normal", or canonical form (definition via Wikipedia).

In [None]:
def can(county_name):
    return (
        county_name
        .lower()
        .replace(' ','')
        .replace('&','and')
        .replace('county','')
        .replace('parish', '')
        .replace('.', '')
    )

In [None]:
county_and_pop['clean_county'] = county_and_pop.County.apply(can)
county_and_pop

In [None]:
county_and_state['clean_county'] = county_and_state.County.apply(can)
county_and_state

In [None]:
# Display output even if not last line in cell
# Similar to a ike a fancy print()
display(county_and_pop)  
display(county_and_state)

county_and_pop.merge(county_and_state, on='clean_county')

Now lets merge.

### 2. Processing Data from a Text Log Using Basic Python

In [None]:
log_fname = 'data/log.txt'
!cat {log_fname}

In [None]:
with open(log_fname, 'r') as f:
    log_lines = f.readlines()

In [None]:
log_lines

In [None]:
type(log_lines)

Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Looking at the data, we see that these items are not in a fixed position relative to the beginning of the string. That is, slicing by some fixed offset isn't going to work.

In [None]:
log_lines[0][20:31]

In [None]:
log_lines[1][20:31]

Instead, we'll need to use some more sophisticated thinking. Let's focus on only the first line of the file.

In [None]:
first = log_lines[0]
first

In [None]:
df = pd.DataFrame(log_lines, columns=['Log'])
df

Option 1: `Series.str.findall`

### Restaurant Data

In this example, we will show how regexes can allow us to track quantitative data across categories defined by the appearance of various text fields.

In this example we'll see how the presence of certain keywords can affect quantitative data:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

In [None]:
vio = pd.read_csv('data/violations.csv', header=0, names=['bid', 'date', 'desc'])
desc = vio['desc']
vio.head()

In [None]:
vio['desc'].value_counts()

In [None]:
counts = vio['desc']

In [None]:
type(counts)

That's a lot of different descriptions!! Can we **canonicalize** at all? Let's explore two sets of 10 rows.

In [None]:
counts[:10]

In [None]:
counts[50:60]

In [None]:
counts[50]

In [None]:
counts[0]

In [None]:
counts[0].split("[")[0].strip().lower()

In [None]:
counts[50].split("[")[0].strip().lower()

In [None]:
def can_desc(description):
    return description.split("[")[0].strip().lower()

In [None]:
clean_counts = vio['desc'].apply(can_desc)
clean_counts.value_counts()

In [None]:
clean_counts.str.contains('clean|sanit')

Did canonicalizing help?

In [None]:
vio['clean_desc'] = vio['desc'].apply(can_desc)

In [None]:
vio

Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

Below, we use `df.assign()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html?highlight=assign#pandas.DataFrame.assign)) to **method chain** our creation of new boolean features, one per keyword.

In [None]:
with_features = (vio
 .assign(is_unclean   = vio['clean_desc'].str.contains('clean|sanit'))
 .assign(is_high_risk = vio['clean_desc'].str.contains('high risk'))
 .assign(is_vermin    = vio['clean_desc'].str.contains('vermin'))
 .assign(is_surface   = vio['clean_desc'].str.contains('wall|ceiling|floor|surface'))
 .assign(is_human     = vio['clean_desc'].str.contains('hand|glove|hair|nail'))
 .assign(is_permit    = vio['clean_desc'].str.contains('permit|certif'))
)
with_features.head()

## EDA

That's the end of our text wrangling. Now let's do some more analysis to analyze restaurant health as a function of the number of violation keywords.

To do so we'll first group so that our **granularity** is one inspection for a business on particular date. This effectively counts the number of violations by keyword for a given inspection.

Check out our new dataframe in action:

Now we'll reshape this "wide" table into a "tidy" table using a pandas feature called `pd.melt` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=pd%20melt)) which we won't describe in any detail, other than that it's effectively the inverse of `pd.pivot_table`.

Our **granularity** is now a violation type for a given inspection (for a business on a particular date).

Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

We have the second half of this question! Now let's **join** our table with the inspection scores, located in `inspections.csv`.

In [None]:
ins = pd.read_csv('data/inspections.csv',
                  header=0,
                  usecols=[0, 1, 2],
                  names=['bid', 'score', 'date'])
ins.head()

While the inspection scores were stored in a separate file from the violation descriptions, we notice that the **primary key** in inspections is (`bid`, `date`). So we can reference this key in our join.