# Data Wrangling with Pandas

We've seen how to get data with Python. Now let's do some stuff! From here on, we're going to mostly use the PyData stack rather than Python built-in functionality.

Our objective in this section is to learn enough to clean the larger sample of Chicago Health Inspection data and get it ready for modeling.

## Preliminaries: DataFrames

As mentioned, the core data structure in pandas is called a DataFrame. A DataFrame is a tabular data structure, holding many columns, similar to a spreadsheet.

The **Key Features** are

* Easy handling of **missing data**
* **Size mutability**: columns can be inserted and deleted from DataFrames
* Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
* Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
* Intelligent label-based **slicing**, **fancy indexing**, and **subsetting** of large data sets
* Intuitive **merging and joining** data sets
* Flexible **reshaping and pivoting** of data sets
* **Hierarchical labeling** of axes
* Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
* **Time series functionality**: 
  * date range generation and frequency conversion
  * moving window statistics
  * moving window linear regressions
  * date shifting and lagging, etc.

In [2]:
import pandas as pd

dta = pd.read_csv("data/health_inspection_chi.csv")

Pandas provides labelled **indices** to access rows and columns, should they have natural labels.

In [3]:
dta.index

RangeIndex(start=0, stop=25000, step=1)

In [4]:
dta.columns

Index(['address', 'aka_name', 'city', 'dba_name', 'facility_type',
       'inspection_date', 'inspection_id', 'inspection_type', 'latitude',
       'license_', 'location', 'longitude', 'results', 'risk', 'state',
       'violations', 'zip'],
      dtype='object')

For example, with this data set we have a natural unique identifier in the `inspection_id` column. We might wish to make this our index.

In [5]:
dta.head()

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26T00:00:00.000,1965287,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644.0
1,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06T00:00:00.000,1329698,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639.0
2,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03T00:00:00.000,470787,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640.0
3,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
4,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03T00:00:00.000,1335320,License Re-Inspection,41.925218,2232391.0,"{'type': 'Point', 'coordinates': [-87.68750659...",-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647.0


In [6]:
dta = dta.set_index('inspection_id')

In [7]:
dta.index

Int64Index([1965287, 1329698,  470787,   68091, 1335320, 1228169, 1285582,
             557486,   74468, 1522863,
            ...
            2059403,  114871,  657253,  531556,  325228, 2059771, 1965378,
            1490395, 1326565,  413268],
           dtype='int64', name='inspection_id', length=25000)

In [8]:
dta.head()

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26T00:00:00.000,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644.0
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06T00:00:00.000,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639.0
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03T00:00:00.000,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640.0
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03T00:00:00.000,License Re-Inspection,41.925218,2232391.0,"{'type': 'Point', 'coordinates': [-87.68750659...",-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647.0


## Indexing

To look at a column from a DataFrame, you can either use attribute lookup.

In [13]:
dta.address

inspection_id
1965287                     5255 W MADISON ST 
1329698                   5958 W DIVERSEY AVE 
470787                   5400-5402 N CLARK ST 
68091                         2804 N CLARK ST 
1335320                    2409 N WESTERN AVE 
1228169    3481 S DR MARTIN LUTHER KING JR DR 
1285582              3201-3203 W ARMITAGE AVE 
557486              5215 W CHICAGO AVE BLDG E2
74468                    4445 S Drexel (900E) 
1522863               3958 N NARRAGANSETT AVE 
1365288                   10-20 E DELAWARE ST 
1150454                   10123 S WESTERN AVE 
1453434                     81 E VAN BUREN ST 
1098457                 3615 W IRVING PARK RD 
1978957                       3042 N BROADWAY 
1971125                 3658 W IRVING PARK RD 
1434561                        932 N NOBLE ST 
606496                   9440 S LAFAYETTE AVE 
1322114                   2822 W MONTROSE AVE 
1434800                     4140 W Addison ST 
1096484                        701 S STATE ST 

Or you can use the **getitem** syntax that relies on square brackets `[]`, which is familiar from dealing with dictionaries (uses `__getitem__`).

In [22]:
dta['address']

inspection_id
1965287                     5255 W MADISON ST 
1329698                   5958 W DIVERSEY AVE 
470787                   5400-5402 N CLARK ST 
68091                         2804 N CLARK ST 
1335320                    2409 N WESTERN AVE 
1228169    3481 S DR MARTIN LUTHER KING JR DR 
1285582              3201-3203 W ARMITAGE AVE 
557486              5215 W CHICAGO AVE BLDG E2
74468                    4445 S Drexel (900E) 
1522863               3958 N NARRAGANSETT AVE 
1365288                   10-20 E DELAWARE ST 
1150454                   10123 S WESTERN AVE 
1453434                     81 E VAN BUREN ST 
1098457                 3615 W IRVING PARK RD 
1978957                       3042 N BROADWAY 
1971125                 3658 W IRVING PARK RD 
1434561                        932 N NOBLE ST 
606496                   9440 S LAFAYETTE AVE 
1322114                   2822 W MONTROSE AVE 
1434800                     4140 W Addison ST 
1096484                        701 S STATE ST 

These two operations return pandas **Series** objects. **Series** are like single-column DataFrames. If you want to preserve the DataFrame type, index the DataFrame with a list.

In [26]:
dta[['address', 'aka_name']].head()

Unnamed: 0_level_0,address,aka_name
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM
68091,2804 N CLARK ST,Wells Street Popcorn
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE


You can use this syntax to pull out multiple columns.

In [27]:
dta[['address', 'inspection_date']].head()

Unnamed: 0_level_0,address,inspection_date
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1965287,5255 W MADISON ST,2016-09-26T00:00:00.000
1329698,5958 W DIVERSEY AVE,2014-02-06T00:00:00.000
470787,5400-5402 N CLARK ST,2010-12-03T00:00:00.000
68091,2804 N CLARK ST,2010-02-01T00:00:00.000
1335320,2409 N WESTERN AVE,2013-05-03T00:00:00.000


You can index the rows, by using the **loc** and **iloc** accessors.

`loc` does *label-based* indexing.

In [24]:
dta.loc[[1965287, 1329698]]

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26T00:00:00.000,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644.0
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06T00:00:00.000,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639.0


`iloc` on the other hand provides *integer-based* indexing. We can pass a list of rows integers.

In [25]:
dta.iloc[[0, 2]]

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26T00:00:00.000,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644.0
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03T00:00:00.000,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640.0


Both support the Python **slice notation** (`start:stop:step`). This can be really powerful.

In [28]:
dta.iloc[:5]

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26T00:00:00.000,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644.0
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06T00:00:00.000,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639.0
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03T00:00:00.000,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640.0
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03T00:00:00.000,License Re-Inspection,41.925218,2232391.0,"{'type': 'Point', 'coordinates': [-87.68750659...",-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647.0


In [29]:
dta.loc[:1335320]

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26T00:00:00.000,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644.0
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06T00:00:00.000,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639.0
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03T00:00:00.000,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640.0
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657.0
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03T00:00:00.000,License Re-Inspection,41.925218,2232391.0,"{'type': 'Point', 'coordinates': [-87.68750659...",-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647.0


Note that these inspection ids are *not* sorted, yet we can still use slice notation.

Of course, we can also combine row and index labeling.

In [47]:
dta.iloc[:5, 0:3]


Unnamed: 0_level_0,address,aka_name,city
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO


In [49]:
dta.loc[:68091, ["address", "inspection_date"]]

Unnamed: 0_level_0,address,inspection_date
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1965287,5255 W MADISON ST,2016-09-26T00:00:00.000
1329698,5958 W DIVERSEY AVE,2014-02-06T00:00:00.000
470787,5400-5402 N CLARK ST,2010-12-03T00:00:00.000
68091,2804 N CLARK ST,2010-02-01T00:00:00.000


In [48]:
dta.iloc[:2,[2,6]]

Unnamed: 0_level_0,city,inspection_type
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1965287,CHICAGO,Canvass
1329698,CHICAGO,Canvass


In [51]:
dta.inspection_date.head()

inspection_id
1965287    2016-09-26T00:00:00.000
1329698    2014-02-06T00:00:00.000
470787     2010-12-03T00:00:00.000
68091      2010-02-01T00:00:00.000
1335320    2013-05-03T00:00:00.000
Name: inspection_date, dtype: object

## Cleaning Data for Types

So far, we've explicitly made an index. We may next want to convert to the dates to datetime types. Here we'll use the **apply** function to apply a function to each row of a Series.

In [52]:
dta.inspection_date = dta.inspection_date.apply(pd.to_datetime)

In [54]:
dta.inspection_date.head()

inspection_id
1965287   2016-09-26
1329698   2014-02-06
470787    2010-12-03
68091     2010-02-01
1335320   2013-05-03
Name: inspection_date, dtype: datetime64[ns]

Now let's cast zip code from a float to a string. Some zip codes can start with 0 (not in Chicago), and we need to account for that.

In [62]:
dta.head()

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03,License Re-Inspection,41.925218,2232391.0,"{'type': 'Point', 'coordinates': [-87.68750659...",-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647


In [59]:
import numpy as np


def float_to_zip(zip_code):
    if np.isnan(zip_code):
        return np.nan
    
    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code

Here we use Python's **string formatting** facilities to convert from a numeric type to a string. Some of the zip codes are empty strings in the file. Pandas uses numpy's `NaN` to indicate missingness, so we'll return it here.

In [60]:
dta.zip = dta.zip.apply(float_to_zip)

In [61]:
dta.head()

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03,License Re-Inspection,41.925218,2232391.0,"{'type': 'Point', 'coordinates': [-87.68750659...",-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647


DataFrames have a `dtypes` attribute for checking the data types. Pandas relies on NumPy's dtypes objects. Here we see that the `object` dtype is used to hold strings. This for technical reasons.

In [63]:
dta.dtypes[['inspection_date', 'zip']]

inspection_date    datetime64[ns]
zip                        object
dtype: object

We can also convert variables' types, using `astype`. Here, we'll explicitly cast to pandas Categorical type, which is the only non-native numpy type.

In [67]:
dta.dtypes[['address', 'aka_name', 'latitude', 'location']]

address      object
aka_name     object
latitude    float64
location     object
dtype: object

In [68]:
dta.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 1965287 to 413268
Data columns (total 16 columns):
address            25000 non-null object
aka_name           24709 non-null object
city               24981 non-null object
dba_name           25000 non-null object
facility_type      24787 non-null object
inspection_date    25000 non-null datetime64[ns]
inspection_type    25000 non-null object
latitude           24865 non-null float64
license_           24996 non-null float64
location           24865 non-null object
longitude          24865 non-null float64
results            25000 non-null object
risk               24995 non-null object
state              24994 non-null object
violations         23908 non-null object
zip                24990 non-null object
dtypes: datetime64[ns](1), float64(3), object(12)
memory usage: 3.9+ MB


In [69]:
dta.results = dta.results.astype('category')
dta.risk = dta.risk.astype('category')
dta.inspection_type = dta.inspection_type.astype('category')
dta.facility_type = dta.facility_type.astype('category')

If we only select the categorical types, we can see some categorical variables descriptions.

In [70]:
dta.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 1965287 to 413268
Data columns (total 16 columns):
address            25000 non-null object
aka_name           24709 non-null object
city               24981 non-null object
dba_name           25000 non-null object
facility_type      24787 non-null category
inspection_date    25000 non-null datetime64[ns]
inspection_type    25000 non-null category
latitude           24865 non-null float64
license_           24996 non-null float64
location           24865 non-null object
longitude          24865 non-null float64
results            25000 non-null category
risk               24995 non-null category
state              24994 non-null object
violations         23908 non-null object
zip                24990 non-null object
dtypes: category(4), datetime64[ns](1), float64(3), object(8)
memory usage: 3.2+ MB


In [73]:
dta.select_dtypes(['category', 'object']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 1965287 to 413268
Data columns (total 12 columns):
address            25000 non-null object
aka_name           24709 non-null object
city               24981 non-null object
dba_name           25000 non-null object
facility_type      24787 non-null category
inspection_type    25000 non-null category
location           24865 non-null object
results            25000 non-null category
risk               24995 non-null category
state              24994 non-null object
violations         23908 non-null object
zip                24990 non-null object
dtypes: category(4), object(8)
memory usage: 2.5+ MB


We can use the `select_dtypes` method to pull out a DataFrame with only the asked for types.

In [75]:
dta.select_dtypes(['category']).head()

Unnamed: 0_level_0,facility_type,inspection_type,results,risk
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1965287,Restaurant,Canvass,Pass w/ Conditions,Risk 1 (High)
1329698,Restaurant,Canvass,Pass,Risk 1 (High)
470787,Restaurant,SFP,Fail,Risk 1 (High)
68091,Restaurant,Canvass,Pass,Risk 2 (Medium)
1335320,Mobile Food Dispenser,License Re-Inspection,Fail,Risk 3 (Low)


Finally, we might want to exclude a column like `location` since we have the separate `latitude` and `longitude` columns. We can delete columns in a DataFrame using Python's built-in `del` statement.

In [76]:
dta.head()

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03,License Re-Inspection,41.925218,2232391.0,"{'type': 'Point', 'coordinates': [-87.68750659...",-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647


In [77]:
del dta['location']

In [78]:
dta.head()

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1965287,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26,Canvass,41.880237,1991820.0,-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644
1329698,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06,Canvass,41.93125,2099479.0,-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639
470787,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03,SFP,41.979884,1933748.0,-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01,Canvass,41.932921,1954774.0,-87.645155,Pass,Risk 2 (Medium),IL,,60657
1335320,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03,License Re-Inspection,41.925218,2232391.0,-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647


## Dealing with Types using csv Reader

We can do everything that we did above by providing options to `pd.read_csv`.

We saw before that `csv` reads everything in as strings, `json` does some type conversion with facility for doing more, and `pandas` does a bit more type conversion but it isn't always what we want. For example, we want the zip codes to stay strings.

Let's take a look at how to do with pandas `read_csv`. First, we can use the `parse_dates` argument to read in the larger inspections data sample and tell pandas that one of our columns is a date column. We'll also go ahead and make `inspection_id` the index.

In [85]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv", 
    index_col="inspection_id",
    parse_dates=["inspection_date"]
)

Next, we want to turn the zip codes into strings. Here, we need to assume that the input (from the file) is a string as opposed to the above.

In [79]:
import numpy as np


def float_to_zip(zip_code):
    # convert from the string in the file to a float
    try:
        zip_code = float(zip_code)
    except ValueError:  # some of them are empty
        return np.nan
    
    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code

In [84]:
float_to_zip('123')

'00123'

In [83]:
float_to_zip('1234567')

'1234567'

As another example of defensive programming, we have to make sure that empty strings are handled.

In [82]:
float_to_zip('')

nan

We can supply this function to the `converters` argument.

In [86]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    converters={
        'zip': float_to_zip
    },
)

In [87]:
dta.head()

Unnamed: 0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_id,inspection_type,latitude,license_,location,longitude,results,risk,state,violations,zip
0,5255 W MADISON ST,RED SNAPPER FISH CHICKEN & PIZZA,CHICAGO,RED SNAPPER FISH CHICKEN & PIZZA,Restaurant,2016-09-26T00:00:00.000,1965287,Canvass,41.880237,1991820.0,"{'type': 'Point', 'coordinates': [-87.75722039...",-87.75722,Pass w/ Conditions,Risk 1 (High),IL,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",60644
1,5958 W DIVERSEY AVE,TAQUERIA MORELOS,CHICAGO,TAQUERIA MORELOS,Restaurant,2014-02-06T00:00:00.000,1329698,Canvass,41.93125,2099479.0,"{'type': 'Point', 'coordinates': [-87.77590699...",-87.775907,Pass,Risk 1 (High),IL,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,60639
2,5400-5402 N CLARK ST,HAMBURGER MARY'S/MARY'S REC ROOM,CHICAGO,HAMBURGER MARY'S CHICAGO/MARY'S REC ROOM,Restaurant,2010-12-03T00:00:00.000,470787,SFP,41.979884,1933748.0,"{'type': 'Point', 'coordinates': [-87.66842948...",-87.668429,Fail,Risk 1 (High),IL,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...",60640
3,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01T00:00:00.000,68091,Canvass,41.932921,1954774.0,"{'type': 'Point', 'coordinates': [-87.64515454...",-87.645155,Pass,Risk 2 (Medium),IL,,60657
4,2409 N WESTERN AVE,CHICAGO CUPCAKE,CHICAGO,CHICAGO CUPCAKE LLC.,Mobile Food Dispenser,2013-05-03T00:00:00.000,1335320,License Re-Inspection,41.925218,2232391.0,"{'type': 'Point', 'coordinates': [-87.68750659...",-87.687507,Fail,Risk 3 (Low),IL,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,60647


To exclude location, we can take advantage of the fact that the `usecols` argument accepts a function to exclude `location`.

In [None]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    usecols=lambda col: col != 'location'
)

Here we are using a **lambda function** that returns `False` for the location parameter. Lambda functions are what are known as anonymous functions, because they don't have a name. This kind of thing is precisely their intended use.

Here we use a function `lambda x: x` to map the identity function over a list.

In [None]:
list(map(lambda x: x, [1, 2, 3]))

Finally, in a few cases we may want to take advantage of the pandas native `categorical` type. We can use the `dtype` argument for this, passing a dictionary of type mappings.

In [89]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    dtype={
        'results': 'category',
        'risk': 'category',
        'inspection_type': 'category',
        'facility_type': 'category'
    }
)

In [90]:
dta.risk.head()

0      Risk 1 (High)
1      Risk 1 (High)
2      Risk 1 (High)
3    Risk 2 (Medium)
4       Risk 3 (Low)
Name: risk, dtype: category
Categories (4, object): [All, Risk 1 (High), Risk 2 (Medium), Risk 3 (Low)]

## Exercise

Put all of the above `read_csv` options together in a single call to `read_csv`.

In [None]:
# Type your solution here

In [92]:
# %load solutions/pandas_read_csv.py
import numpy as np
import pandas as pd


def float_to_zip(zip_code):
    # convert from the string in the file to a float
    try:
        zip_code = float(zip_code)
    except ValueError:  # some of them are empty
        return np.nan

    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code


dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    index_col='inspection_id',
    parse_dates=['inspection_date'],
    converters={
        'zip': float_to_zip
    },
    usecols=lambda col: col != 'location',
    dtype={
        'results': 'category',
        'risk': 'category',
        'inspection_type': 'category',
        'facility_type': 'category'
    }
)


assert float_to_zip('1234')
assert float_to_zip('123456')
assert np.isnan(float_to_zip(''))


## String Cleaning

Ok, let's start to dig into the data a little bit more. One of the things we're going to be really interested in exploring is the free text of the violations field.

The first thing to notice is that the violations field has null values in it.

In [93]:
dta.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 1965287 to 413268
Data columns (total 15 columns):
address            25000 non-null object
aka_name           24709 non-null object
city               24981 non-null object
dba_name           25000 non-null object
facility_type      24787 non-null category
inspection_date    25000 non-null datetime64[ns]
inspection_type    25000 non-null category
latitude           24865 non-null float64
license_           24996 non-null float64
longitude          24865 non-null float64
results            25000 non-null category
risk               24995 non-null category
state              24994 non-null object
violations         23908 non-null object
zip                24990 non-null object
dtypes: category(4), datetime64[ns](1), float64(3), object(7)
memory usage: 2.4+ MB


We may want to ask ourselves if these values are missing at random or if there is some reason there's no written violation field.

In [94]:
dta.loc[dta.violations.isnull()].head()

Unnamed: 0_level_0,address,aka_name,city,dba_name,facility_type,inspection_date,inspection_type,latitude,license_,longitude,results,risk,state,violations,zip
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
68091,2804 N CLARK ST,Wells Street Popcorn,CHICAGO,Wells Street Popcorn,Restaurant,2010-02-01,Canvass,41.932921,1954774.0,-87.645155,Pass,Risk 2 (Medium),IL,,60657
233722,3121 W CERMAK RD,TAQUERIA EL PALMAR,CHICAGO,TAQUERIA EL PALMAR,,2010-05-11,Canvass,41.851674,1243326.0,-87.703788,Fail,Risk 1 (High),IL,,60623
284278,3813-3815 W CHICAGO AVE,SUGA RAY'S SPORTS GRILL,CHICAGO,SUGA RAY'S SPORTS GRILL,Restaurant,2010-08-12,Canvass,41.895297,1922230.0,-87.7218,Out of Business,Risk 2 (Medium),IL,,60651
231272,1204 W 36TH PL,MOBILE TRUCK #13,CHICAGO,THUNDERBIRD CATERING,Mobile Food Dispenser,2010-03-22,License,41.828094,1476473.0,-87.655854,Pass,Risk 3 (Low),IL,,60609
277874,2826 N LINCOLN AVE,MGM Catering,CHICAGO,MGM Catering,Catering,2010-08-02,License,41.933101,2037141.0,-87.659683,Pass,Risk 1 (High),IL,,60657


It looks like we're ok. The next thing to notice is that the violation field actually has a lot of violations in the same field for the same visit.

In [97]:
with pd.option_context("display.max_colwidth", 500):
    print(dta.violations.head())

inspection_id
1965287    35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: MUST CLEAN THE WALLS AT WALL BASE NEAR THE MIXER IN REAR OF PREMISES AND THE PREP AREA OF FOOD SPILLS AND CLEAN THE WALL VENT IN PREP AREA ,INSTRUCTED TO CLEAN AND MAINTAIN AREA | 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: MUST CLEAN THE INTERIOR PANEL OF THE ICE MACHINE IN REAR OF PREMISES | 34. FLOORS: CONS...
1329698    33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: Non food contact surfaces of ice machine not clean, needs cleaning. \nNon food contact surfaces of cooler shelving/racks not clean, need cleaning. \nPrep table lower shelving not clean, need detailed cleaning(crevices). | 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: Floors under heavy equipment

Let's split these out to make a longer DataFrame where each violation is a single row. Pandas provides a nice way to munge string data through the `str` accessor on string columns.

```python
dta.violations.str.<TAB>
```

In [103]:
print(dta.violations.str)

<pandas.core.strings.StringMethods object at 0x11b100710>


## Exercise

Let's see how many violations we have per visit. What does the distribution of violations look like? Explore the methods on the `str` accessor and, perhaps, the `quantile` method.

In [110]:
# Type your solution here
df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
                  columns=['a', 'b'])
df


Unnamed: 0,a,b
0,1,1
1,2,10
2,3,100
3,4,100


In [116]:
df.quantile(0.6)

a     2.8
b    82.0
Name: 0.6, dtype: float64

In [114]:
df.quantile([.1, .5])

Unnamed: 0,a,b
0.1,1.3,3.7
0.5,2.5,55.0


In [105]:
# %load solutions/violation_distribution.py
import numpy as np
import pandas as pd


def float_to_zip(zip_code):
    # convert from the string in the file to a float
    try:
        zip_code = float(zip_code)
    except ValueError:  # some of them are empty
        return np.nan

    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code


dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    index_col='inspection_id',
    parse_dates=['inspection_date'],
    converters={
        'zip': float_to_zip
    },
    usecols=lambda col: col != 'location',
    dtype={
        'results': 'category',
        'risk': 'category',
        'inspection_type': 'category',
        'facility_type': 'category'
    }
)


quantiles = [0, .05, .25, .50, .75, .95, 1.00]
(dta.violations.str.count("\|") + 1).quantile(quantiles)


0.00     1.0
0.05     2.0
0.25     3.0
0.50     4.0
0.75     6.0
0.95    10.0
1.00    19.0
Name: violations, dtype: float64

In [112]:
x =np.array([-0.204708,1.965781,0.478943])

In [113]:
(x[1] - x[2])/0.5*0.4+x[2]

1.6684134

Ok, we have a manageable number of violations. Let's split the violations and then turn them into a long DataFrame with a single row for each violation within each visit.

In [106]:
dta.violations.str.count("\|") + 1

inspection_id
1965287     6.0
1329698     4.0
470787      9.0
68091       NaN
1335320     6.0
1228169     2.0
1285582     6.0
557486      5.0
74468       4.0
1522863     5.0
1365288     3.0
1150454     4.0
1453434     4.0
1098457     6.0
1978957     3.0
1971125     3.0
1434561     3.0
606496      6.0
1322114     8.0
1434800     2.0
1096484     2.0
1360289     6.0
1448016     4.0
250608      3.0
1324353     6.0
468107      4.0
1948939     5.0
1751612     8.0
606575      2.0
68160       8.0
           ... 
347209      5.0
1975445     2.0
1326783     9.0
1975775     3.0
612615      2.0
197247      NaN
413579      NaN
2049615     4.0
1230080     4.0
1393310     6.0
1946768    12.0
529379      3.0
1464890     2.0
1182230     4.0
1966555     4.0
1522771     5.0
626394      4.0
120509      2.0
604507      4.0
1184322     4.0
2059403     5.0
114871      7.0
657253     12.0
531556      2.0
325228      5.0
2059771     2.0
1965378     6.0
1490395     4.0
1326565     2.0
413268      NaN
Name: viol

In [117]:
violations = dta.violations.str.split("\|", expand=True)
violations.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
inspection_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1965287,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENS...,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GO...",36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES O...,22. DISH MACHINES: PROVIDED WITH ACCURATE THE...,40. REFRIGERATION AND METAL STEM THERMOMETERS...,,,,,,,,,,,,,
1329698,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GO...",36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES O...,38. VENTILATION: ROOMS AND EQUIPMENT VENTED A...,,,,,,,,,,,,,,,
470787,"6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...","11. ADEQUATE NUMBER, CONVENIENT, ACCESSIBLE, ...",19. OUTSIDE GARBAGE WASTE GREASE AND STORAGE ...,36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES O...,3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATU...,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENS...,32. FOOD AND NON-FOOD CONTACT SURFACES PROPER...,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONST...",40. REFRIGERATION AND METAL STEM THERMOMETERS...,,,,,,,,,,
68091,,,,,,,,,,,,,,,,,,,
1335320,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,38. VENTILATION: ROOMS AND EQUIPMENT VENTED A...,"9. WATER SOURCE: SAFE, HOT & COLD UNDER CITY ...",12. HAND WASHING FACILITIES: WITH SOAP AND SA...,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPE...,"30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABE...",,,,,,,,,,,,,


When we `unstack` the DataFrame, we're left with what's called a `MultiIndex`. This index has two *levels* now. One is the original `inspection_id`. The other is the, rather meaningless, column names.

In [118]:
violations.unstack().head()

   inspection_id
0  1965287          35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...
   1329698          33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
   470787           6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...
   68091                                                          NaN
   1335320          32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
dtype: object

Let's get rid of the empty rows first.

In [119]:
violations = violations.unstack().dropna()

Now we can drop the column name level, which we don't need.

In [123]:
violations.reset_index(level=0, drop=True, inplace=True)

In [124]:
violations.head()

inspection_id
1965287    35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...
1329698    33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
470787     6. HANDS WASHED AND CLEANED, GOOD HYGIENIC PRA...
1335320    32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
1228169    38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS...
dtype: object

One last cleaning step may be helpful here. When we split on the pipe ('`|`'), we likely kept some surrounding whitespace. We can remove that.

In [None]:
violations.str.startswith(" ").any()

In [None]:
violations.str.strip().head()

In [None]:
violations = violations.str.strip()

In [None]:
((violations.str.startswith(" ").any()) | 
 (violations.str.endswith(" ").any()))

Later, we'll see how to combine these violations back with our original data to do some analysis.

## Working with Dates and Categoricals

Above, we used the `str` accessor on the DataFrame. This isn't the only convenient accessor that pandas provides. There is also the `dt` accessor for datetime types and the `cat` accessor for categorical types.

```python
dta.inspection_date.dt.<TAB>
```

In [None]:
dta.inspection_date.head()

In [None]:
dta.inspection_date.dt.month.head()

Now, let's take a look at the categorical types.

```python
dta.risk.cat.<TAB>
```

In [None]:
dta.risk.head()

In [None]:
dta.risk.cat.codes.head()