# Data Validation

In the previous notebook, two pins were saved:

- City of Chicago - Business License Data (RAW): `chicago-business-license-data`
- ity of Chicago - Food Inspection Data (RAW): `chicago-food-inspection-data`

## Setup

In [1]:
import os

import ibis
import pins
import pandas as pd
import numpy as np
import pandera as pa
from sqlalchemy import create_engine

In [2]:
pd.options.display.max_columns = 999

In [3]:
# Set up the board
board = pins.board_connect()
user_name = "sam.edwardes"

In [4]:
# Database details
db_user = "posit"
db_password = os.getenv("DB_PASSWORD")
db_host = "posit-conf-2023-ds-workflowsf5086c0.cpbvczwgws3n.us-east-2.rds.amazonaws.com"
db_port = 5432
db_database = "python_workshop"

# Set up sqlalchemy for writing data
engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_password}@{db_host}/{db_database}")

# Set up ibis for reading data
con = ibis.postgres.connect(
    user=db_user,
    password=db_password,
    host=db_host,
    port=db_port,
    database=db_database
)

## Tips

- Use multiple cursors in VS Code to easily edit many lines at the same time (<https://code.visualstudio.com/docs/getstarted/tips-and-tricks#_column-box-selection>).
- Use `df["col_name"].value_counts()` to understand the distribution of categorical columns.
- Use `df["col_name"].hist` to understand the distribution of numeric columns.
- Use `df.info()` to understand column types and null values.
- Use [ydata-profiling](https://pypi.org/project/ydata-profiling/) to generate an automated data report.

```python
from ydata_profiling import ProfileReport
ProfileReport(df)
```

## Load raw data

Use `ibis` to read the data from Postgres.

In [5]:
business_license_raw = con.table("business_license_raw").to_pandas()

In [6]:
food_inspection_raw = con.table("food_inspection_raw").to_pandas()

## Data set (1): Business License Data

<https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>

In [7]:
business_license_raw

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,69910-20201016,2745698,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2020-08-15T00:00:00.000,2020-10-22T00:00:00.000,N,2020-10-16T00:00:00.000,2022-10-15T00:00:00.000,2020-10-22T00:00:00.000,2020-10-26T00:00:00.000,AAI,,,41.767160744,-87.644608148,"\n, \n(41.767160744200304, -87.6446081478298)"
1,69910-20221016,2864066,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2022-08-15T00:00:00.000,2022-10-20T00:00:00.000,N,2022-10-16T00:00:00.000,2024-10-15T00:00:00.000,2022-10-20T00:00:00.000,2022-10-21T00:00:00.000,AAI,,,41.767160744,-87.644608148,"\n, \n(41.767160744200304, -87.6446081478298)"
2,69911-20011116,1177990,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2001-10-03T00:00:00.000,2001-10-31T00:00:00.000,N,2001-11-16T00:00:00.000,2002-11-15T00:00:00.000,2002-01-08T00:00:00.000,2002-01-09T00:00:00.000,AAI,,,41.76710418,-87.644606795,"\n, \n(41.767104180059185, -87.64460679475337)"
3,69911-20021116,1283205,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2002-09-24T00:00:00.000,2002-10-17T00:00:00.000,N,2002-11-16T00:00:00.000,2003-11-15T00:00:00.000,2002-11-25T00:00:00.000,2002-11-25T00:00:00.000,AAI,,,41.76710418,-87.644606795,"\n, \n(41.767104180059185, -87.64460679475337)"
4,69911-20031116,1428793,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2003-09-26T00:00:00.000,2003-11-07T00:00:00.000,N,2003-11-16T00:00:00.000,2004-11-15T00:00:00.000,2003-12-22T00:00:00.000,2004-02-26T00:00:00.000,AAI,,,41.76710418,-87.644606795,"\n, \n(41.767104180059185, -87.64460679475337)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1104485,69910-20121016,2180517,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2012-08-15T00:00:00.000,2012-09-26T00:00:00.000,N,2012-10-16T00:00:00.000,2014-10-15T00:00:00.000,2012-09-26T00:00:00.000,2012-09-27T00:00:00.000,AAI,,,41.767160744,-87.644608148,"\n, \n(41.767160744200304, -87.6446081478298)"
1104486,69910-20141016,2351014,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2014-08-15T00:00:00.000,2014-11-12T00:00:00.000,N,2014-10-16T00:00:00.000,2016-10-15T00:00:00.000,2014-11-12T00:00:00.000,2014-11-14T00:00:00.000,AAI,,,41.767160744,-87.644608148,"\n, \n(41.767160744200304, -87.6446081478298)"
1104487,69910-20161016,2484690,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2016-08-15T00:00:00.000,2016-10-13T00:00:00.000,N,2016-10-16T00:00:00.000,2018-10-15T00:00:00.000,2016-10-13T00:00:00.000,2016-10-14T00:00:00.000,AAI,,,41.767160744,-87.644608148,"\n, \n(41.767160744200304, -87.6446081478298)"
1104488,69910-20181016,2618081,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2018-08-15T00:00:00.000,2018-10-31T00:00:00.000,N,2018-10-16T00:00:00.000,2020-10-15T00:00:00.000,2018-10-31T00:00:00.000,2018-10-31T00:00:00.000,AAI,,,41.767160744,-87.644608148,"\n, \n(41.767160744200304, -87.6446081478298)"


The business license data includes licenses for all Chicago businesses. For this analysis, we are only interested in the licenses where a food inspection may apply. To figure out which licenses are in scope:

- Perform an inner join on the business license and food inspection data.
- Identify all of the unique license codes where food inspections apply.
- Filter the data to include only those businesses.

In [8]:
food_inspection_raw

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,address,city,state,zip,inspection_date,inspection_type,results,violations,latitude,longitude,location
0,52234,Cafe 608,Cafe 608,2013328,Restaurant,Risk 1 (High),608 W BARRY AVE,CHICAGO,IL,60657,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.938006880423615,-87.6447545707008,"(41.938006880423615, -87.6447545707008)"
1,70269,mr.daniel's,mr.daniel's,1899292,Restaurant,Risk 1 (High),5645 W BELMONT AVE,CHICAGO,IL,60634,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.93844282365204,-87.76831838068422,"(41.93844282365204, -87.76831838068422)"
2,104236,TEMPO CAFE,TEMPO CAFE,80916,Restaurant,Risk 1 (High),6 E CHESTNUT ST,CHICAGO,IL,60611,2010-01-04T00:00:00.000,Canvass,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,41.89843137207629,-87.6280091630558,"(41.89843137207629, -87.6280091630558)"
3,67733,WOLCOTT'S,TROQUET,1992040,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
4,67732,WOLCOTT'S,TROQUET,1992039,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255958,2577904,ARAMARK AT BOEING,THE LANDING,2203996,Restaurant,Risk 1 (High),100 N RIVERSIDE PLZ,CHICAGO,IL,60606,2023-06-27T00:00:00.000,Canvass,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.88326585294293,-87.63876082126886,"(41.88326585294293, -87.63876082126886)"
255959,2577921,ANGRY CRAB,ANGRY CRAB,2476217,Restaurant,Risk 1 (High),1308 N MILWAUKEE,CHICAGO,IL,60622,2023-06-27T00:00:00.000,Complaint,Pass w/ Conditions,14. REQUIRED RECORDS AVAILABLE: SHELLSTOCK TAG...,41.90537246331243,-87.6698531917002,"(41.90537246331243, -87.6698531917002)"
255960,2577891,GIRL & THE GOAT,GIRL & THE GOAT,1992138,Restaurant,Risk 1 (High),809-813 W RANDOLPH ST,CHICAGO,IL,60607,2023-06-27T00:00:00.000,Canvass Re-Inspection,Pass,"53. TOILET FACILITIES: PROPERLY CONSTRUCTED, S...",41.88429093010678,-87.64784177325362,"(41.88429093010678, -87.64784177325362)"
255961,2577898,TARGET STORE T2781,TARGET/STARBUCKS/PIZZA HUT,2189184,Grocery Store,Risk 2 (Medium),1101 W JACKSON BLVD,CHICAGO,IL,60607,2023-06-27T00:00:00.000,Short Form Complaint,Pass,,41.877720207180666,-87.6546900011066,"(41.877720207180666, -87.6546900011066)"


In [9]:
in_scope_license_codes = (
    pd.merge(
        food_inspection_raw,
        business_license_raw,
        how="inner",
        left_on="license_",
        right_on="license_id"
    )
    .loc[:, "license_code"]
    .dropna()
    .unique()
)

in_scope_license_codes

array(['1006', '1475', '1010', '1584', '1474', '1586', '1483', '1481',
       '1013', '1315', '1007', '1005', '1470', '1676', '1020', '1585',
       '1050', '1012', '1011', '1683', '1008', '4404', '1625', '1505',
       '1133', '1061', '1781', '1931', '1932', '1604', '1329', '1375',
       '1002', '1605', '1571', '1684', '1569', '1330', '1477', '1570',
       '1053', '1480', '4408', '1016', '1014', '1023', '8344', '1056',
       '1476', '1003', '1473', '1370', '8343', '4406', '1033', '8340',
       '8345', '1316', '1594', '8342', '4405', '1525', '1784', '1524',
       '1032', '7013', '1471', '7014', '2101', '7036'], dtype=object)

**Data cleaning**

Apply some basic cleaning steps to the data.

In [10]:
business_license_tidy = (business_license_raw
    
    # Only keep in scope licenses
    .loc[business_license_raw["license_code"].isin(in_scope_license_codes)]

    # Filter on the relevant state and city only.
    .loc[business_license_raw["state"] == "IL"]
    .loc[business_license_raw["city"] == "CHICAGO"]

    # Convert conditional approval to a boolean value.
    .assign(conditional_approval=lambda x: x["conditional_approval"] == "Y")
    
    # Drop the "location" column, the same data is already stored in the "latitude"
    # and "longitude" columns.
    .drop(columns=["location"])

    # Reset the index.
    .reset_index(drop=True)
)

business_license_tidy

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude
0,69910-20201016,2745698,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2020-08-15T00:00:00.000,2020-10-22T00:00:00.000,False,2020-10-16T00:00:00.000,2022-10-15T00:00:00.000,2020-10-22T00:00:00.000,2020-10-26T00:00:00.000,AAI,,,41.767160744,-87.644608148
1,69910-20221016,2864066,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2022-08-15T00:00:00.000,2022-10-20T00:00:00.000,False,2022-10-16T00:00:00.000,2024-10-15T00:00:00.000,2022-10-20T00:00:00.000,2022-10-21T00:00:00.000,AAI,,,41.767160744,-87.644608148
2,69911-20011116,1177990,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2001-10-03T00:00:00.000,2001-10-31T00:00:00.000,False,2001-11-16T00:00:00.000,2002-11-15T00:00:00.000,2002-01-08T00:00:00.000,2002-01-09T00:00:00.000,AAI,,,41.76710418,-87.644606795
3,69911-20021116,1283205,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2002-09-24T00:00:00.000,2002-10-17T00:00:00.000,False,2002-11-16T00:00:00.000,2003-11-15T00:00:00.000,2002-11-25T00:00:00.000,2002-11-25T00:00:00.000,AAI,,,41.76710418,-87.644606795
4,69911-20031116,1428793,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2003-09-26T00:00:00.000,2003-11-07T00:00:00.000,False,2003-11-16T00:00:00.000,2004-11-15T00:00:00.000,2003-12-22T00:00:00.000,2004-02-26T00:00:00.000,AAI,,,41.76710418,-87.644606795
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
977597,69910-20121016,2180517,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2012-08-15T00:00:00.000,2012-09-26T00:00:00.000,False,2012-10-16T00:00:00.000,2014-10-15T00:00:00.000,2012-09-26T00:00:00.000,2012-09-27T00:00:00.000,AAI,,,41.767160744,-87.644608148
977598,69910-20141016,2351014,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2014-08-15T00:00:00.000,2014-11-12T00:00:00.000,False,2014-10-16T00:00:00.000,2016-10-15T00:00:00.000,2014-11-12T00:00:00.000,2014-11-14T00:00:00.000,AAI,,,41.767160744,-87.644608148
977599,69910-20161016,2484690,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2016-08-15T00:00:00.000,2016-10-13T00:00:00.000,False,2016-10-16T00:00:00.000,2018-10-15T00:00:00.000,2016-10-13T00:00:00.000,2016-10-14T00:00:00.000,AAI,,,41.767160744,-87.644608148
977600,69910-20181016,2618081,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2018-08-15T00:00:00.000,2018-10-31T00:00:00.000,False,2018-10-16T00:00:00.000,2020-10-15T00:00:00.000,2018-10-31T00:00:00.000,2018-10-31T00:00:00.000,AAI,,,41.767160744,-87.644608148


**Data validation**

Use pandera to validate the data and convert each column to the correct type.

In [11]:
business_license_schema = pa.DataFrameSchema({
    "id": pa.Column(str, coerce=True),
    "license_id": pa.Column(str, coerce=True, unique=True), # Primary Key
    "account_number": pa.Column(str, coerce=True),
    "site_number": pa.Column(str, coerce=True),
    "legal_name": pa.Column(str, coerce=True),
    "doing_business_as_name": pa.Column(str, coerce=True, nullable=True),
    "address": pa.Column(str, coerce=True),
    "city": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.eq("CHICAGO")
    ]),
    "state": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.eq("IL")
    ]),
    "zip_code": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check(lambda x: x.str.match(r'^\d{5}$').all())
    ]),
    "ward": pa.Column(str, coerce=True, nullable=True),
    "precinct": pa.Column(str, coerce=True, nullable=True),
    "ward_precinct": pa.Column(str, coerce=True, nullable=True),
    "police_district": pa.Column(pa.Category, coerce=True, nullable=True),
    "license_code": pa.Column(pa.Category, coerce=True, checks=[
        pa.Check.isin(in_scope_license_codes)
    ]),
    "license_description": pa.Column(str, coerce=True),
    "business_activity_id": pa.Column(str, coerce=True, nullable=True),
    "business_activity": pa.Column(pa.Category, coerce=True, nullable=True),
    "license_number": pa.Column(str, coerce=True),
    "application_type": pa.Column(pa.Category, coerce=True),
    "application_created_date": pa.Column(str, coerce=True, nullable=True),
    "application_requirements_complete": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "payment_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "conditional_approval": pa.Column(bool, coerce=True),
    "license_start_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "expiration_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "license_approved_for_issuance": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "date_issued": pa.Column(pa.DateTime, coerce=True),
    "license_status": pa.Column(pa.Category, coerce=True),
    "license_status_change_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "ssa": pa.Column(str, coerce=True, nullable=True),
    "latitude": pa.Column(pa.Float, coerce=True, nullable=True, checks=[
        pa.Check.between(38, 44)
    ]),
    "longitude": pa.Column(pa.Float, coerce=True, nullable=True, checks=[
        pa.Check.between(-89, -84)
    ]),
})



business_license_validated = business_license_schema.validate(business_license_tidy)
business_license_validated

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude
0,69910-20201016,2745698,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2020-08-15,2020-10-22,False,2020-10-16,2022-10-15,2020-10-22,2020-10-26,AAI,NaT,,41.767161,-87.644608
1,69910-20221016,2864066,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2022-08-15,2022-10-20,False,2022-10-16,2024-10-15,2022-10-20,2022-10-21,AAI,NaT,,41.767161,-87.644608
2,69911-20011116,1177990,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2001-10-03,2001-10-31,False,2001-11-16,2002-11-15,2002-01-08,2002-01-09,AAI,NaT,,41.767104,-87.644607
3,69911-20021116,1283205,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2002-09-24,2002-10-17,False,2002-11-16,2003-11-15,2002-11-25,2002-11-25,AAI,NaT,,41.767104,-87.644607
4,69911-20031116,1428793,21299,3,JOHN J SCOTT,J & R KLASSY SNACK SHOP,6958 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,69911,RENEW,,2003-09-26,2003-11-07,False,2003-11-16,2004-11-15,2003-12-22,2004-02-26,AAI,NaT,,41.767104,-87.644607
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
977597,69910-20121016,2180517,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2012-08-15,2012-09-26,False,2012-10-16,2014-10-15,2012-09-26,2012-09-27,AAI,NaT,,41.767161,-87.644608
977598,69910-20141016,2351014,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2014-08-15,2014-11-12,False,2014-10-16,2016-10-15,2014-11-12,2014-11-14,AAI,NaT,,41.767161,-87.644608
977599,69910-20161016,2484690,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2016-08-15,2016-10-13,False,2016-10-16,2018-10-15,2016-10-13,2016-10-14,AAI,NaT,,41.767161,-87.644608
977600,69910-20181016,2618081,21299,2,JOHN J SCOTT,J & R KLASSY CAR WASH,6956 S HALSTED ST 1ST,CHICAGO,IL,60621,16,15,16-15,7,1010,Limited Business License,,,69910,RENEW,,2018-08-15,2018-10-31,False,2018-10-16,2020-10-15,2018-10-31,2018-10-31,AAI,NaT,,41.767161,-87.644608


Save the data as a pin.

In [12]:
# Insert the data into postgres. Inserting large amounts of data can be slow, so
# iterate over 10,000 rows at a time.

n_rows = business_license_validated.shape[0]
step_size = 10_000

for i in range(0, n_rows, step_size):
    index_start = i
    index_end = min(n_rows, i + step_size - 1)
    
    if i == 0:
        if_exists = "replace"
    else:
        if_exists = "append"

    print(f"Inserting rows: {index_start:,} - {index_end:,}")
    
    business_license_validated \
        .loc[index_start:index_end, :] \
        .to_sql("business_license_validated", engine, if_exists=if_exists, index=False)

Inserting rows: 0 - 9,999
Inserting rows: 10,000 - 19,999
Inserting rows: 20,000 - 29,999
Inserting rows: 30,000 - 39,999
Inserting rows: 40,000 - 49,999
Inserting rows: 50,000 - 59,999
Inserting rows: 60,000 - 69,999
Inserting rows: 70,000 - 79,999
Inserting rows: 80,000 - 89,999
Inserting rows: 90,000 - 99,999
Inserting rows: 100,000 - 109,999
Inserting rows: 110,000 - 119,999
Inserting rows: 120,000 - 129,999
Inserting rows: 130,000 - 139,999
Inserting rows: 140,000 - 149,999
Inserting rows: 150,000 - 159,999
Inserting rows: 160,000 - 169,999
Inserting rows: 170,000 - 179,999
Inserting rows: 180,000 - 189,999
Inserting rows: 190,000 - 199,999
Inserting rows: 200,000 - 209,999
Inserting rows: 210,000 - 219,999
Inserting rows: 220,000 - 229,999
Inserting rows: 230,000 - 239,999
Inserting rows: 240,000 - 249,999
Inserting rows: 250,000 - 259,999
Inserting rows: 260,000 - 269,999
Inserting rows: 270,000 - 279,999
Inserting rows: 280,000 - 289,999
Inserting rows: 290,000 - 299,999
Insert

In [13]:
# Confirm number of rows
pd.read_sql_query("SELECT COUNT(*) FROM business_license_validated", engine)

Unnamed: 0,count
0,977602


## Data set (2): Food inspections

<https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

In [14]:
food_inspection_raw

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,address,city,state,zip,inspection_date,inspection_type,results,violations,latitude,longitude,location
0,52234,Cafe 608,Cafe 608,2013328,Restaurant,Risk 1 (High),608 W BARRY AVE,CHICAGO,IL,60657,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.938006880423615,-87.6447545707008,"(41.938006880423615, -87.6447545707008)"
1,70269,mr.daniel's,mr.daniel's,1899292,Restaurant,Risk 1 (High),5645 W BELMONT AVE,CHICAGO,IL,60634,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.93844282365204,-87.76831838068422,"(41.93844282365204, -87.76831838068422)"
2,104236,TEMPO CAFE,TEMPO CAFE,80916,Restaurant,Risk 1 (High),6 E CHESTNUT ST,CHICAGO,IL,60611,2010-01-04T00:00:00.000,Canvass,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,41.89843137207629,-87.6280091630558,"(41.89843137207629, -87.6280091630558)"
3,67733,WOLCOTT'S,TROQUET,1992040,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
4,67732,WOLCOTT'S,TROQUET,1992039,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255958,2577904,ARAMARK AT BOEING,THE LANDING,2203996,Restaurant,Risk 1 (High),100 N RIVERSIDE PLZ,CHICAGO,IL,60606,2023-06-27T00:00:00.000,Canvass,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.88326585294293,-87.63876082126886,"(41.88326585294293, -87.63876082126886)"
255959,2577921,ANGRY CRAB,ANGRY CRAB,2476217,Restaurant,Risk 1 (High),1308 N MILWAUKEE,CHICAGO,IL,60622,2023-06-27T00:00:00.000,Complaint,Pass w/ Conditions,14. REQUIRED RECORDS AVAILABLE: SHELLSTOCK TAG...,41.90537246331243,-87.6698531917002,"(41.90537246331243, -87.6698531917002)"
255960,2577891,GIRL & THE GOAT,GIRL & THE GOAT,1992138,Restaurant,Risk 1 (High),809-813 W RANDOLPH ST,CHICAGO,IL,60607,2023-06-27T00:00:00.000,Canvass Re-Inspection,Pass,"53. TOILET FACILITIES: PROPERLY CONSTRUCTED, S...",41.88429093010678,-87.64784177325362,"(41.88429093010678, -87.64784177325362)"
255961,2577898,TARGET STORE T2781,TARGET/STARBUCKS/PIZZA HUT,2189184,Grocery Store,Risk 2 (Medium),1101 W JACKSON BLVD,CHICAGO,IL,60607,2023-06-27T00:00:00.000,Short Form Complaint,Pass,,41.877720207180666,-87.6546900011066,"(41.877720207180666, -87.6546900011066)"


**Data cleaning**

Apply some basic cleaning steps to the data.

In [15]:
food_inspection_tidy = (food_inspection_raw

    # Filter on the relevant state and city only.
    .loc[food_inspection_raw["state"] == "IL"]
    .loc[food_inspection_raw["city"] == "CHICAGO"]

    # Drop columns that also exist in the business license data.
    .drop(columns=["address", "city", "state", "latitude", "longitude", "location"])

    # Convert categorical columns to be all upper case for consistency
    .assign(
        dba_name=lambda x: x["dba_name"].str.upper(),
        aka_name=lambda x: x["aka_name"].str.upper(),
        facility_type=lambda x: x["facility_type"].str.upper(),
        risk=lambda x: x["risk"].str.upper(),
        inspection_type=lambda x: x["inspection_type"].str.upper(),
        results=lambda x: x["results"].str.upper(),
        violations=lambda x: x["violations"].str.upper(),
    )

    # Specify the order of categorical columns.
    .assign(risk=lambda x: x["risk"].astype("category").cat.set_categories(["ALL", "RISK 1 (HIGH)", "RISK 2 (MEDIUM)", "RISK 3 (LOW)"], ordered=True))

    # The "violations" can have multiple violations separated by a "|". E.g.
    # "32. FOOD AND NON-FOOD ... REPLACED. | 33. FOOD AND NON-FOOD CONTACT E"
    # To make the data easier to work with split each violation into its own item.
    # The result is the violations column will contain a list of strings.
    .assign(violations=lambda x: x["violations"].str.split(pat=" \| "))

    # Reset the index.
    .reset_index(drop=True)
)

food_inspection_tidy

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,zip,inspection_date,inspection_type,results,violations
0,52234,CAFE 608,CAFE 608,2013328,RESTAURANT,RISK 1 (HIGH),60657,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
1,70269,MR.DANIEL'S,MR.DANIEL'S,1899292,RESTAURANT,RISK 1 (HIGH),60634,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
2,104236,TEMPO CAFE,TEMPO CAFE,80916,RESTAURANT,RISK 1 (HIGH),60611,2010-01-04T00:00:00.000,CANVASS,FAIL,[18. NO EVIDENCE OF RODENT OR INSECT OUTER OPE...
3,67733,WOLCOTT'S,TROQUET,1992040,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
4,67732,WOLCOTT'S,TROQUET,1992039,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
...,...,...,...,...,...,...,...,...,...,...,...
254886,2577904,ARAMARK AT BOEING,THE LANDING,2203996,RESTAURANT,RISK 1 (HIGH),60606,2023-06-27T00:00:00.000,CANVASS,PASS W/ CONDITIONS,"[3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL ..."
254887,2577921,ANGRY CRAB,ANGRY CRAB,2476217,RESTAURANT,RISK 1 (HIGH),60622,2023-06-27T00:00:00.000,COMPLAINT,PASS W/ CONDITIONS,[14. REQUIRED RECORDS AVAILABLE: SHELLSTOCK TA...
254888,2577891,GIRL & THE GOAT,GIRL & THE GOAT,1992138,RESTAURANT,RISK 1 (HIGH),60607,2023-06-27T00:00:00.000,CANVASS RE-INSPECTION,PASS,"[53. TOILET FACILITIES: PROPERLY CONSTRUCTED, ..."
254889,2577898,TARGET STORE T2781,TARGET/STARBUCKS/PIZZA HUT,2189184,GROCERY STORE,RISK 2 (MEDIUM),60607,2023-06-27T00:00:00.000,SHORT FORM COMPLAINT,PASS,


**Data validation**

Use pandera to validate the data and convert each column to the correct type.

In [16]:
food_inspection_schema = pa.DataFrameSchema({
    "inspection_id": pa.Column(str, coerce=True, unique=True), # Primary Key
    "dba_name": pa.Column(str, coerce=True),
    "aka_name": pa.Column(str, coerce=True, nullable=True),
    "license_": pa.Column(str, coerce=True, nullable=True), # Foreign Key
    "facility_type": pa.Column(pa.Category, coerce=True, nullable=True),
    "risk": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.isin(["ALL", "RISK 1 (HIGH)", "RISK 2 (MEDIUM)", "RISK 3 (LOW)"])
    ]),
    "zip": pa.Column(str, coerce=True, nullable=True),
    "inspection_date": pa.Column(pa.DateTime, coerce=True),
    "inspection_type": pa.Column(pa.Category, coerce=True, nullable=True),
    "results": pa.Column(pa.Category, coerce=True),
    "violations": pa.Column(pa.Object, coerce=True, nullable=True)
})

food_inspection_validated = food_inspection_schema.validate(food_inspection_tidy)
food_inspection_validated

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,zip,inspection_date,inspection_type,results,violations
0,52234,CAFE 608,CAFE 608,2013328,RESTAURANT,RISK 1 (HIGH),60657,2010-01-04,LICENSE RE-INSPECTION,PASS,
1,70269,MR.DANIEL'S,MR.DANIEL'S,1899292,RESTAURANT,RISK 1 (HIGH),60634,2010-01-04,LICENSE RE-INSPECTION,PASS,
2,104236,TEMPO CAFE,TEMPO CAFE,80916,RESTAURANT,RISK 1 (HIGH),60611,2010-01-04,CANVASS,FAIL,[18. NO EVIDENCE OF RODENT OR INSECT OUTER OPE...
3,67733,WOLCOTT'S,TROQUET,1992040,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04,LICENSE RE-INSPECTION,PASS,
4,67732,WOLCOTT'S,TROQUET,1992039,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04,LICENSE RE-INSPECTION,PASS,
...,...,...,...,...,...,...,...,...,...,...,...
254886,2577904,ARAMARK AT BOEING,THE LANDING,2203996,RESTAURANT,RISK 1 (HIGH),60606,2023-06-27,CANVASS,PASS W/ CONDITIONS,"[3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL ..."
254887,2577921,ANGRY CRAB,ANGRY CRAB,2476217,RESTAURANT,RISK 1 (HIGH),60622,2023-06-27,COMPLAINT,PASS W/ CONDITIONS,[14. REQUIRED RECORDS AVAILABLE: SHELLSTOCK TA...
254888,2577891,GIRL & THE GOAT,GIRL & THE GOAT,1992138,RESTAURANT,RISK 1 (HIGH),60607,2023-06-27,CANVASS RE-INSPECTION,PASS,"[53. TOILET FACILITIES: PROPERLY CONSTRUCTED, ..."
254889,2577898,TARGET STORE T2781,TARGET/STARBUCKS/PIZZA HUT,2189184,GROCERY STORE,RISK 2 (MEDIUM),60607,2023-06-27,SHORT FORM COMPLAINT,PASS,


Insert the data into postgresql.

In [17]:
# Insert the data into postgres. Inserting large amounts of data can be slow, so
# iterate over 10,000 rows at a time.

n_rows = food_inspection_validated.shape[0]
step_size = 10_000

for i in range(0, n_rows, step_size):
    index_start = i
    index_end = min(n_rows, i + step_size - 1)
    
    if i == 0:
        if_exists = "replace"
    else:
        if_exists = "append"

    print(f"Inserting rows: {index_start:,} - {index_end:,}")

    food_inspection_validated \
        .loc[index_start:index_end, :] \
        .to_sql("food_inspection_validated", engine, if_exists=if_exists, index=False)

Inserting rows: 0 - 9,999
Inserting rows: 10,000 - 19,999
Inserting rows: 20,000 - 29,999
Inserting rows: 30,000 - 39,999
Inserting rows: 40,000 - 49,999
Inserting rows: 50,000 - 59,999
Inserting rows: 60,000 - 69,999
Inserting rows: 70,000 - 79,999
Inserting rows: 80,000 - 89,999
Inserting rows: 90,000 - 99,999
Inserting rows: 100,000 - 109,999
Inserting rows: 110,000 - 119,999
Inserting rows: 120,000 - 129,999
Inserting rows: 130,000 - 139,999
Inserting rows: 140,000 - 149,999
Inserting rows: 150,000 - 159,999
Inserting rows: 160,000 - 169,999
Inserting rows: 170,000 - 179,999
Inserting rows: 180,000 - 189,999
Inserting rows: 190,000 - 199,999
Inserting rows: 200,000 - 209,999
Inserting rows: 210,000 - 219,999
Inserting rows: 220,000 - 229,999
Inserting rows: 230,000 - 239,999
Inserting rows: 240,000 - 249,999
Inserting rows: 250,000 - 254,891


In [18]:
# Confirm number of rows
pd.read_sql_query("SELECT COUNT(*) FROM food_inspection_validated", engine)

Unnamed: 0,count
0,254891


## Data set (3): Map Data

Generate map data for use in downstream applications.

In [19]:
map_cols = [
    "license_id",
    "legal_name", 
    "doing_business_as_name",
    "license_code",
    "license_description",
    "address", 
    "zip_code",
    "latitude",
    "longitude"
]

map_data = (
    business_license_validated
    .head(1_000)
    .loc[:, map_cols]
    .drop_duplicates()
    .reset_index(drop=True)
)

map_data

Unnamed: 0,license_id,legal_name,doing_business_as_name,license_code,license_description,address,zip_code,latitude,longitude
0,2745698,JOHN J SCOTT,J & R KLASSY CAR WASH,1010,Limited Business License,6956 S HALSTED ST 1ST,60621,41.767161,-87.644608
1,2864066,JOHN J SCOTT,J & R KLASSY CAR WASH,1010,Limited Business License,6956 S HALSTED ST 1ST,60621,41.767161,-87.644608
2,1177990,JOHN J SCOTT,J & R KLASSY SNACK SHOP,1006,Retail Food Establishment,6958 S HALSTED ST 1ST,60621,41.767104,-87.644607
3,1283205,JOHN J SCOTT,J & R KLASSY SNACK SHOP,1006,Retail Food Establishment,6958 S HALSTED ST 1ST,60621,41.767104,-87.644607
4,1428793,JOHN J SCOTT,J & R KLASSY SNACK SHOP,1006,Retail Food Establishment,6958 S HALSTED ST 1ST,60621,41.767104,-87.644607
...,...,...,...,...,...,...,...,...,...
995,2701063,JEROME PLUCINSKI,DEALS ON WHEELS DETAILING,1010,Limited Business License,5622 W 65TH ST MAIN,60638,41.774492,-87.762541
996,2819913,JEROME PLUCINSKI,DEALS ON WHEELS DETAILING,1010,Limited Business License,5622 W 65TH ST MAIN,60638,41.774492,-87.762541
997,1208587,ANITA KRAMER,ANITA KRAMER INTERIOR DESIGN,1012,Home Occupation,1701 N DAYTON ST E,60614,41.912815,-87.649400
998,1315652,ANITA KRAMER,ANITA KRAMER INTERIOR DESIGN,1012,Home Occupation,1701 N DAYTON ST E,60614,41.912815,-87.649400


The the map data as a pin for easy access by other applications.

In [20]:
# Pin the data to Connect
pin_name = f"{user_name}/chicago-business-map-data"

board.pin_write(
    map_data, 
    name=pin_name, 
    type="arrow", 
    versioned=True,
    title="City of Chicago - Business Map Data"
)

Writing pin:
Name: 'sam.edwardes/chicago-business-map-data'
Version: 20230629T120010Z-f1374


Meta(title='City of Chicago - Business Map Data', description=None, created='20230629T120010Z', pin_hash='f13742214f8bbf89', file='chicago-business-map-data.arrow', file_size=48418, type='arrow', api_version=1, version=VersionRaw(version='76556'), tags=None, name='sam.edwardes/chicago-business-map-data', user={}, local={})