# Lecture 16 – Text Wrangling and Regex

### DATA 2201, Fall 2024


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import zipfile
%matplotlib inline

In [2]:
with open("data/log.txt", 'r') as f:
    for line in f:
        print(line)

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"

193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /stat141/Notes/dim.html HTTP/1.0" 404 302 "http://eeyore.ucdavis.edu/stat141/Notes/session.html"

169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1"



## Demo 1: Canonicalization with Basic Python

In [3]:
with open('data/county_and_state.csv') as f:
    county_and_state = pd.read_csv(f)
    
with open('data/county_and_population.csv') as f:
    county_and_pop = pd.read_csv(f)    

Suppose we'd like to join these two tables. Unfortunately, we can't, because the strings representing the county names don't match, as seen below.

In [4]:
county_and_state

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LS


In [5]:
county_and_pop

Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


 Before we can join them, we'll do what I call **canonicalization**.

Canonicalization: A process for converting data that has more than one possible representation into a "standard", "normal", or canonical form (definition via Wikipedia).

In [6]:
def canonicalize_county(county_name):
    return (
        county_name
        ...               # lower case
                          # remove spaces
                          # replace &
                          # remove dot
                          # remove county
                          # remove parish
    )

In [7]:
county_and_pop['clean_county'] = county_and_pop['County'].map(canonicalize_county)
county_and_state['clean_county'] = county_and_state['County'].map(canonicalize_county)

display(county_and_pop)  # display outputs even if not last line in cell
county_and_state

Unnamed: 0,County,Population,clean_county
0,DeWitt,16798,dewitt
1,Lac Qui Parle,8067,lacquiparle
2,Lewis & Clark,55716,lewisandclark
3,St. John the Baptist,43044,stjohnthebaptist


Unnamed: 0,County,State,clean_county
0,De Witt County,IL,dewitt
1,Lac qui Parle County,MN,lacquiparle
2,Lewis and Clark County,MT,lewisandclark
3,St John the Baptist Parish,LS,stjohnthebaptist


In [8]:
county_and_pop.merge(county_and_state, on='clean_county')

Unnamed: 0,County_x,Population,clean_county,County_y,State
0,DeWitt,16798,dewitt,De Witt County,IL
1,Lac Qui Parle,8067,lacquiparle,Lac qui Parle County,MN
2,Lewis & Clark,55716,lewisandclark,Lewis and Clark County,MT
3,St. John the Baptist,43044,stjohnthebaptist,St John the Baptist Parish,LS


## Demo 2: Processing Data from a Text Log Using Basic Python

In [9]:
with open('data/log.txt', 'r') as f:
    log_lines = f.readlines()

In [10]:
log_lines

['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n',
 '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /stat141/Notes/dim.html HTTP/1.0" 404 302 "http://eeyore.ucdavis.edu/stat141/Notes/session.html"\n',
 '169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1"\n']

Suppose we want to extract the day, month, year, hour, minutes, seconds, and timezone. Looking at the data, we see that these items are not in a fixed position relative to the beginning of the string. That is, slicing by some fixed offset isn't going to work.

In [11]:
...

'26/Jan/2014'

In [12]:
...

'/Feb/2005:1'

Instead, we'll need to use some more sophisticated thinking. Let's focus on only the first line of the file.

In [13]:
first = log_lines[0]
first

'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'

In [14]:
...

('26', 'Jan', '2014', '10', '47', '58', '-0800')

## Demo 3: Phone numbers 

**Goal**: Extract all phone numbers from a piece of text, assuming they are of the form `'(###) ###-####'`.

In [46]:
contact = '''
Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.
'''

In [47]:
print(contact)


Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.



- We can do this using the same string methods we've come to know and love.

- Strategy:
    - Split by spaces.
    - Check if there are any consecutive "words" where:
        - the first "word" looks like an area code, like `'(678)'`.
        - the second "word" looks like the last 7 digits of a phone number, like `'999-8212'`. 

Let's first write a function that takes in a string and returns whether it looks like an area code.

In [48]:
def is_possibly_area_code(s):
    '''Find strings like the following, e.g., (678), (213), (906)'''
    return (len(s) == 5 and
            ...)

In [49]:
is_possibly_area_code('(123)')

True

In [50]:
is_possibly_area_code('(99)')

False

In [51]:
is_possibly_area_code('(906)')

True

In [52]:
is_possibly_area_code('(aaa)')

False

Let's also write a function that takes in a string and returns whether it looks like the last 7 digits of a phone number.

In [53]:
def is_last_7_phone_number(s):
    '''Find strings that look like 999-8212'''
    return ...

In [54]:
is_last_7_phone_number('999-8212')

True

In [55]:
is_last_7_phone_number('534 1100')

False

In [56]:
is_last_7_phone_number('aaa-1234')

False

Finally, let's split the entire text by spaces, and check whether there are any instances where `pieces[i]` looks like an area code and `pieces[i+1]` looks like the last 7 digits of a phone number.

In [57]:
# Removes punctuation from the end of each string.
pieces = [s.rstrip('.,?;"\'') for s in contact.split()]

for i in range(len(pieces) - 1):
    if is_possibly_area_code(pieces[i]):
        if is_last_7_phone_number(pieces[i+1]):
            print(pieces[i], pieces[i+1])

(800) 867-5309
(800) 123-4567



<br>

These were examples using string methods.

A much more sophisticated but common approach is to extract the information we need using a regular expression. See today's lecture slides for more on regular expressions.

<br/><br/><br/>

---
## Regular Expressions

- A regular expression, or **regex** for short, is a sequence of characters used to **match patterns in strings**.
- Think of regex as a "mini-language" (formally, they are a grammar for descirbing a language).

- **Pros** They are very powerful and are widely used (virtually every programming language has a module for working with them).

- **Cons** They can be hard to read and have many different "dialects"

In [15]:
import re

### Writing regular expressions

- You will ultimately write most of your regular expressions in Python, using the `re` module. We will see how to do so shortly.

- However, a useful tool for designing regular expressions is [regex101.com](https://regex101.com).  Choose the Python “flavor” in the left sidebar

- We will use it heavily during lecture; you should have it open as we work through examples. 

#### Literals

- A literal is a character that has no special meaning.

- Letters, numbers, and some symbols are all literals.

- Some symbols, like `.`, `*`, `(`, and `)`, are special characters.

- ***Example***: The regex `hey` matches the string `'hey'`. The regex `he.` also matches the string `'hey'`.

### Regex building blocks 


The four main building blocks for all regexes are shown below ([table source](https://www.cs.princeton.edu/courses/archive/spring17/cos226/lectures/54RegularExpressions.pdf), [inspiration](https://docs.google.com/presentation/d/1xQsqa7e3xDZ9nBiekbSBOecwvQm8pSVGa-FBoV6aJ7E/edit#slide=id.g11197671c7e_0_919)).

| operation | order of op. | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|:---|
| <span style='color:purple'><b>concatenation</b></span> | 3 | `AABAAB` | `'AABAAB'` | every other string |
| <span style='color:purple'><b>or</b></span> | 4 | `AA\|BAAB` | `'AA'`, `'BAAB'` | every other string |
| <span style='color:purple'><b>closure</b><br>(zero or more)</span> | 2 | `AB*A` | `'AA'`, `'ABBBBBBA'` | `'AB'`, `'ABABA'` |
| <span style='color:purple'><b>parentheses</b></span> | 1 | `A(A\|B)AAB` <hr style="height:1px"> `(AB)*A` | `'AAAAB'`, `'ABAAB'`<hr style="height:1px">`'A'`, `'ABABABABA'` | every other string<hr style="height:1px">`'AA'`, `'ABBA'` |

Note that `|`, `(`, `)`, and `*` are **special characters**, not literals. They manipulate the characters around them.


##### Example (or, parenthesis): 

* What does `DATA 1202|2201` match?
* What does `DATA (1202|2201)` match?

Explore your understanding: [https://regex101.com/r/wnpNfx/1](https://regex101.com/r/wnpNfx/1) and [https://regex101.com/r/SkZigQ/2](https://regex101.com/r/SkZigQ/2)


#### Example 

Write a regular expression that matches `'moon'`, `'moooon'`, etc.
- First, think about how to match strings with any even number of `'o'`s, including zero `'o'`s (i.e. `'mn'`).
- Then, think about how to match only strings with a **positive even** number of `'o'`s.


Try It! [https://regex101.com/r/8tkQ23/1](https://regex101.com/r/8tkQ23/1)

<details>
    <summary>
        Click here to see the answer after you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>
    </summary>
    `m(oo)*n`
</details>

#### Example 

Write a regular expression that matches `'moon'`, `'moooon'`, `'muun'`, `'muuuuun'` etc.

It should match any strings with a **positive even** number of `'o'`s or a **positvie even** number of `'u'`s in the middle.


Try It! [https://regex101.com/r/kJpHeZ/1](https://regex101.com/r/kJpHeZ/1)

<details>
    <summary>
        Click here to see the answer after you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>
    </summary>
    `m(uu(uu)*|oo(oo)*)n` 
</details>

### More regex syntax

| operation | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|
| <span style='color:purple'><b>wildcard</b></span> | `.U.U.U.` | `'CUMULUS'`<br>`'JUGULUM'` | `'SUCCUBUS'`<br>`'TUMULTUOUS'` |
| <span style='color:purple'><b>character class</b></span>  | `[A-Za-z][a-z]*` | `'word'`<br>`'Capitalized'` | `'camelCase'`<br>`'4illegal'` |
| <span style='color:purple'><b>at least one</b></span> | `bi(ll)+y` | `'billy'`<br>`'billlllly'` | `'biy'`<br>`'bily'` |
| <span style='color:purple'><b>between $i$ and $j$ occurrences</b></span> | `m[aeiou]{1,2}m` | `'mem'`<br>`'maam'`<br>`'miem'` | `'mm'`<br>`'mooom'`<br>`'meme'` |

`.`, `[`, `]`, `+`, `{`, and `}` are also special characters, in addition to `|`, `(`, `)`, and `*`.

***Example (character classes, at least one)***: `[A-E]+` is just shortform for `(A|B|C|D|E)(A|B|C|D|E)*`.

***Example (wildcard)***: 
- What does `.` match? 
- What does `he.` match? 
- What does `...` match?

***Example (at least one, closure)***: 
- What does `123+` match?
- What does `123*` match?

***Example (number of occurrences)***: What does `tri{3, 5}` match? Does it match `'triiiii'`?

***Example (character classes, number of occurrences)***:
What does `[1-6a-f]{3}-[7-9E-S]{2}` match?

#### Example 

Write a regular expression that matches any lowercase string has a repeated vowel, such as `'noon'`, `'peel'`, `'festoon'`, or `'zeebraa'`.

Try answering the question with [https://regex101.com](https://regex101.com). 

<br>

<details>
    <summary>
        Click here to see the answer <b>after</b> you've tried it yourself at <a href="https://regex101.com">regex101.com</a>.
    </summary>

<br> 

One possible answer: <code>[a-z]\*(aa|ee|ii|oo|uu)[a-z]\*</code>
 
<br>
    
This regular expression matches strings of lowercase characters that have <code>'aa'</code>, <code>'ee'</code>, <code>'ii'</code>, <code>'oo'</code>, or <code>'uu'</code> in them anywhere. <code>[a-z]\*</code> means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.
</details>

### Even more regex syntax

| operation | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|
| <span style='color:purple'><b>escape character</b></span> | `ucsd\.edu` | `'ucsd.edu'` | `'ucsd!edu'` |
| <span style='color:purple'><b>beginning of line</b></span> | `^ark` | `'ark two'`<br>`'ark o ark'` | `'dark'` |
| <span style='color:purple'><b>end of line</b></span>  | `ark$` | `'dark'`<br>`'ark o ark'` | `'ark two'` |
| <span style='color:purple'><b>zero or one</b></span> | `cat?` | `'ca'`<br>`'cat'` | `'cart'` (matches `'ca'` only) |
| <span style='color:purple'><b>built-in character classes*</b></span> | `\w+` <br> `\d+` | `'billy'`<br>`'231231'` | `'this person'`<br>`'858 people'` |
| <span style='color:purple'><b>character class negation</b></span> | `[^a-z]+` | `'KINGTRITON551'`<br>`'1721$$'` | `'porch'`<br>`'billy.edu'` |

****Note***: in Python's implementation of regex,
- `\d` refers to digits.
- `\w` refers to alphanumeric characters (`[A-Z][a-z][0-9]_`). **Whenever we say "alphanumeric" in an assignment, we're referring to `\w`!**
- `\s` refers to whitespace.
- `\b` is a word boundary.

## Demo 1 Revisit: Canonicalization with Regex

Python `re.sub`

In [16]:
text = '<div><td valign="top">Moo</td></div>'
pattern = r"<[^>]+>"
re.sub(pattern, '', text)

'Moo'

<br/>

`pandas`: `Series.str.replace`

In [18]:
df_html = pd.DataFrame(['<div><td valign="top">Moo</td></div>',
                   '<a href="http://mtu.edu">Link</a>',
                   '<b>Bold text</b>'], columns=['Html'])
df_html

Unnamed: 0,Html
0,"<div><td valign=""top"">Moo</td></div>"
1,"<a href=""http://mtu.edu"">Link</a>"
2,<b>Bold text</b>


In [19]:
# Series -> Series
...

0          Moo
1         Link
2    Bold text
Name: Html, dtype: object

<br><br>


### Extraction with Regex

Python `re.findall`

In [19]:
text = "My social security number is 123-45-6789 bro, or actually maybe it’s 321-45-6789.";
pattern = r""
...  # ['123-45-6789', '321-45-6789']

['123-45-6789', '321-45-6789']

Regex Groups

In [20]:
text = """Observations: 03:04:53 - Horse awakens.
03:05:14 - Horse goes back to sleep."""       
pattern = r"(\d\d):(\d\d):(\d\d) - (.*)"
...

[('03', '04', '53', 'Horse awakens.'),
 ('03', '05', '14', 'Horse goes back to sleep.')]

<br/>

`pandas`

In [21]:
df_ssn = pd.DataFrame(
    ['987-65-4321',
     'forty',
     '123-45-6789 bro or 321-45-6789',
     '999-99-9999'],
    columns=['SSN'])
df_ssn

Unnamed: 0,SSN
0,987-65-4321
1,forty
2,123-45-6789 bro or 321-45-6789
3,999-99-9999


1. `Series.str.findall`

In [22]:
# -> Series of lists
pattern = ...
...

0                 [987-65-4321]
1                            []
2    [123-45-6789, 321-45-6789]
3                 [999-99-9999]
Name: SSN, dtype: object

2. `Series.str.extract`

In [23]:
# -> DataFrame of first match group
pattern_group = ... # 1 group
...

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,987-65-4321
2,0,123-45-6789
2,1,321-45-6789
3,0,999-99-9999


In [24]:
# Will extract first match of all groups
pattern_group_mult = r"" # 3 groups
...

Unnamed: 0,0,1,2
0,987.0,65.0,4321.0
1,,,
2,123.0,45.0,6789.0
3,999.0,99.0,9999.0


3. `Series.str.extractall`

In [25]:
# -> DataFrame, one row per match
...

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,987,65,4321
2,0,123,45,6789
2,1,321,45,6789
3,0,999,99,9999


In [26]:
# original dataframe
df_ssn

Unnamed: 0,SSN
0,987-65-4321
1,forty
2,123-45-6789 bro or 321-45-6789
3,999-99-9999


## Demo 2 Revisit: Text Log Processing using Regex

Python version:

In [27]:
line = log_lines[0]
line = '169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1"'
display(line)
pattern = r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
re.findall(pattern, line)


'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1"'

[('26', 'Jan', '2014', '10', '47', '58', '-0800')]

In [28]:
# beyond the scope of lecture, but left here for your interest
day, month, year, hour, minute, second, time_zone = re.search(pattern, line).groups()
day, month, year, hour, minute, second, time_zone

('26', 'Jan', '2014', '10', '47', '58', '-0800')

<br/><br/>
Pandas version:

In [29]:
df = pd.DataFrame(log_lines, columns=['Log'])
df

Unnamed: 0,Log
0,169.237.46.168 - - [26/Jan/2014:10:47:58 -0800...
1,"193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] ""..."
2,"169.237.46.240 - """" [3/Feb/2006:10:18:37 -0800..."


Option 1: `Series.str.findall`

In [30]:
pattern = r''
...

0    [(26, Jan, 2014, 10, 47, 58, -0800)]
1      [(2, Feb, 2005, 17, 23, 6, -0800)]
2     [(3, Feb, 2006, 10, 18, 37, -0800)]
Name: Log, dtype: object

<br/>

Option 2: `Series.str.extractall`

In [31]:
...

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,26,Jan,2014,10,47,58,-800
1,0,2,Feb,2005,17,23,6,-800
2,0,3,Feb,2006,10,18,37,-800


Wrangling either of these two DataFrames into a nice format (like below) is left as an exercise for you! You will do a related problem on the homework.


||Day|Month|Year|Hour|Minute|Second|Time Zone|
|---|---|---|---|---|---|---|---|
|0|26|Jan|2014|10|47|58|-0800|
|1|2|Feb|2005|17|23|6|-0800|
|2|3|Feb|2006|10|18|37|-0800|


In [None]:
# your code here
...

<br/><br/>
<br/>

---

## Real World Example #1: Restaurant Data

In this example, we will show how regexes can allow us to track quantitative data across categories defined by the appearance of various text fields.

In this example we'll see how the presence of certain keywords can affect quantitative data:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

In [32]:
vio = pd.read_csv('data/violations.csv', header=0, names=['bid', 'date', 'desc'])
desc = vio['desc']
vio.head()

Unnamed: 0,bid,date,desc
0,19,20171211,Inadequate food safety knowledge or lack of ce...
1,19,20171211,Unapproved or unmaintained equipment or utensils
2,19,20160513,Unapproved or unmaintained equipment or utensi...
3,19,20160513,Unclean or degraded floors walls or ceilings ...
4,19,20160513,Food safety certificate or food handler card n...


In [33]:
counts = desc.value_counts()
counts.shape

(14253,)

That's a lot of different descriptions!! Can we **canonicalize** at all? Let's explore two sets of 10 rows.

In [34]:
counts[:10]

desc
Unclean or degraded floors walls or ceilings                          999
Unapproved or unmaintained equipment or utensils                      659
Inadequately cleaned or sanitized food contact surfaces               493
Improper food storage                                                 476
Inadequate and inaccessible handwashing facilities                    467
Moderate risk food holding temperature                                452
Wiping cloths not clean or properly stored or inadequate sanitizer    418
Moderate risk vermin infestation                                      374
Unclean nonfood contact surfaces                                      369
Food safety certificate or food handler card not available            353
Name: count, dtype: int64

In [35]:
# Hmmm...
counts[50:60]

desc
Unclean or degraded floors walls or ceilings  [ date violation corrected: 11/29/2017 ]              16
Unclean or degraded floors walls or ceilings  [ date violation corrected: 9/19/2017 ]               16
Inadequate HACCP plan record keeping                                                                16
Unclean or degraded floors walls or ceilings  [ date violation corrected: 11/27/2017 ]              15
Unclean or degraded floors walls or ceilings  [ date violation corrected: 12/7/2017 ]               15
Inadequately cleaned or sanitized food contact surfaces  [ date violation corrected: 9/26/2017 ]    14
Unclean or degraded floors walls or ceilings  [ date violation corrected: 11/28/2017 ]              14
Unclean or degraded floors walls or ceilings  [ date violation corrected: 9/6/2017 ]                14
Unapproved or unmaintained equipment or utensils  [ date violation corrected: 9/19/2017 ]           14
Unapproved  living quarters in food facility                        

In [None]:
# Use regular expressions to cut out the extra info in square braces.
vio['clean_desc'] = ...
vio.head()

In [37]:
# canonicalizing definitely helped
vio['clean_desc'].value_counts().shape

(68,)

In [38]:
vio['clean_desc'].value_counts().tail() 

clean_desc
mobile food facility stored in unapproved location                   3
mobile food facility not operating with an approved commissary       3
unreported or unrestricted ill employee with communicable disease    2
mobile food facility hcd insignia unavailable                        1
noncompliance with cottage food operation                            1
Name: count, dtype: int64

Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

Below, we use regular expressions and `df.assign()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html?highlight=assign#pandas.DataFrame.assign)) to **method chain** our creation of new boolean features, one per keyword.

In [39]:
# use regular expressions to assign new features for the presence of various keywords
# regex metacharacter | 
with_features = (vio
 .assign(is_unclean     = vio['clean_desc'].str.contains('clean|sanit'))
 .assign(is_high_risk = vio['clean_desc'].str.contains('high risk'))
 .assign(is_vermin    = vio['clean_desc'].str.contains('vermin'))
 .assign(is_surface   = vio['clean_desc'].str.contains('wall|ceiling|floor|surface'))
 .assign(is_human     = vio['clean_desc'].str.contains('hand|glove|hair|nail'))
 .assign(is_permit    = vio['clean_desc'].str.contains('permit|certif'))
)
with_features.head()

Unnamed: 0,bid,date,desc,clean_desc,is_unclean,is_high_risk,is_vermin,is_surface,is_human,is_permit
0,19,20171211,Inadequate food safety knowledge or lack of ce...,inadequate food safety knowledge or lack of ce...,False,False,False,False,False,True
1,19,20171211,Unapproved or unmaintained equipment or utensils,unapproved or unmaintained equipment or utensils,False,False,False,False,False,False
2,19,20160513,Unapproved or unmaintained equipment or utensi...,unapproved or unmaintained equipment or utensils,False,False,False,False,False,False
3,19,20160513,Unclean or degraded floors walls or ceilings ...,unclean or degraded floors walls or ceilings,True,False,False,True,False,False
4,19,20160513,Food safety certificate or food handler card n...,food safety certificate or food handler card n...,False,False,False,False,True,True


<br/><br/>

### EDA

That's the end of our text wrangling. Now let's do some more analysis to analyze restaurant health as a function of the number of violation keywords.

To do so we'll first group so that our **granularity** is one inspection for a business on particular date. This effectively counts the number of violations by keyword for a given inspection.

In [40]:
count_features = (with_features
 .groupby(['bid', 'date'])
 .sum()
 .reset_index()
)
count_features.iloc[255:260, :]

Unnamed: 0,bid,date,desc,clean_desc,is_unclean,is_high_risk,is_vermin,is_surface,is_human,is_permit
255,489,20150728,Unclean or degraded floors walls or ceilings ...,unclean or degraded floors walls or ceilingsmo...,5,0,2,3,0,0
256,489,20150807,Unapproved or unmaintained equipment or utensi...,unapproved or unmaintained equipment or utensi...,1,0,0,1,0,0
257,489,20160308,High risk food holding temperature [ date vi...,high risk food holding temperatureother modera...,2,2,1,0,1,0
258,489,20160721,Low risk vermin infestation [ date violation ...,low risk vermin infestationhigh risk food hold...,2,1,1,1,0,1
259,489,20161220,Inadequately cleaned or sanitized food contact...,inadequately cleaned or sanitized food contact...,3,0,1,2,0,0


Check out our new dataframe in action:

In [41]:
count_features.query('is_vermin > 1').head(5)

Unnamed: 0,bid,date,desc,clean_desc,is_unclean,is_high_risk,is_vermin,is_surface,is_human,is_permit
255,489,20150728,Unclean or degraded floors walls or ceilings ...,unclean or degraded floors walls or ceilingsmo...,5,0,2,3,0,0
291,527,20170821,Inadequate and inaccessible handwashing facili...,inadequate and inaccessible handwashing facili...,1,1,2,1,1,1
1508,2622,20160526,Unclean or degraded floors walls or ceilings ...,unclean or degraded floors walls or ceilingsun...,4,2,2,3,0,0
1573,2721,20150422,Foods not protected from contamination [ date...,foods not protected from contaminationmoderate...,2,1,2,1,0,0
1746,2945,20150921,Inadequate and inaccessible handwashing facili...,inadequate and inaccessible handwashing facili...,2,1,2,2,2,1


Now we'll reshape this "wide" table into a "tidy" table using a pandas feature called `pd.melt` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=pd%20melt)) which we won't describe in any detail, other than that it's effectively the inverse of `pd.pivot_table`.

Our **granularity** is now a violation type for a given inspection (for a business on a particular date).

In [42]:
broken_down_by_violation_type = pd.melt(count_features, id_vars=['bid', 'date'],
            var_name='feature', value_name='num_vios')

# show a particular inspection's results
broken_down_by_violation_type.query('bid == 489 & date == 20150728')

Unnamed: 0,bid,date,feature,num_vios
255,489,20150728,desc,Unclean or degraded floors walls or ceilings ...
12517,489,20150728,clean_desc,unclean or degraded floors walls or ceilingsmo...
24779,489,20150728,is_unclean,5
37041,489,20150728,is_high_risk,0
49303,489,20150728,is_vermin,2
61565,489,20150728,is_surface,3
73827,489,20150728,is_human,0
86089,489,20150728,is_permit,0


Remember our research question:

> **How do restaurant health scores vary as a function of the number of violations that mention a particular keyword?** 
> <br/>
> (e.g., unclean surfaces, vermin, permits, etc.)

<br/>

We have the second half of this question! Now let's **join** our table with the inspection scores, located in `inspections.csv`.

In [43]:
# read in the scores
ins = pd.read_csv('data/inspections.csv',
                  header=0,
                  usecols=[0, 1, 2],
                  names=['bid', 'score', 'date'])
ins.head()

Unnamed: 0,bid,score,date
0,19,94,20160513
1,19,94,20171211
2,24,98,20171101
3,24,98,20161005
4,24,96,20160311


While the inspection scores were stored in a separate file from the violation descriptions, we notice that the **primary key** in inspections is (`bid`, `date`)! So we can reference this key in our join.

In [44]:
# join scores with the table broken down by violation type
violation_type_and_scores = (
    broken_down_by_violation_type
    .merge(ins, on=['bid', 'date'])
)
violation_type_and_scores.head(12)

Unnamed: 0,bid,date,feature,num_vios,score
0,19,20160513,desc,Unapproved or unmaintained equipment or utensi...,94
1,19,20160513,clean_desc,unapproved or unmaintained equipment or utensi...,94
2,19,20160513,is_unclean,1,94
3,19,20160513,is_high_risk,0,94
4,19,20160513,is_vermin,0,94
5,19,20160513,is_surface,1,94
6,19,20160513,is_human,1,94
7,19,20160513,is_permit,1,94
8,19,20171211,desc,Inadequate food safety knowledge or lack of ce...,94
9,19,20171211,clean_desc,inadequate food safety knowledge or lack of ce...,94


<br/><br/>

---

Let's plot the distribution of scores, broken down by violation counts, for each inspection feature (`is_clean`, `is_high_risk`, `is_vermin`, `is_surface`).

In [None]:
sns.catplot(x='num_vios', y='score',
               col='feature', col_wrap=2,
               kind='box',
               data=violation_type_and_scores)

Above we can observe:
* The inspection score generally goes down with increasing numbers of violations, as expected.
* Depending on the violation keyword, inspections scores on average go down at slightly different rates.
* For example, that if a restaurant inspection involved 2 violations with the keyword "vermin", the average score for that inspection would be a little bit below 80.