# Data Validation

In the previous notebook, two pins were saved:

- City of Chicago - Business License Data (RAW): `chicago-business-license-data`
- ity of Chicago - Food Inspection Data (RAW): `chicago-food-inspection-data`

## Setup

In [1]:
import os

import ibis
import pins
import pandas as pd
import numpy as np
import pandera as pa
from sqlalchemy import create_engine

In [2]:
pd.options.display.max_columns = 999

In [3]:
# Set up the board
board = pins.board_connect()
user_name = "sam.edwardes"

In [4]:
# Database details
db_user = "posit"
db_password = os.environ["CONF23_DB_PASSWORD"]
db_host = os.environ["CONF23_DB_HOST"]
db_port = 5432
db_database = "python_workshop"

# Set up sqlalchemy for writing data
engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_password}@{db_host}/{db_database}")

# Set up ibis for reading data
con = ibis.postgres.connect(
    user=db_user,
    password=db_password,
    host=db_host,
    port=db_port,
    database=db_database
)

## Tips

- Use multiple cursors in VS Code to easily edit many lines at the same time (<https://code.visualstudio.com/docs/getstarted/tips-and-tricks#_column-box-selection>).
- Use `df["col_name"].value_counts()` to understand the distribution of categorical columns.
- Use `df["col_name"].hist` to understand the distribution of numeric columns.
- Use `df.info()` to understand column types and null values.
- Use [ydata-profiling](https://pypi.org/project/ydata-profiling/) to generate an automated data report.

```python
from ydata_profiling import ProfileReport
ProfileReport(df)
```

## Load raw data

Use `ibis` to read the data from Postgres.

In [5]:
business_license_raw = con.table("business_license_raw").to_pandas()

In [6]:
food_inspection_raw = con.table("food_inspection_raw").to_pandas()

## Data set (1): Business License Data

<https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>

In [7]:
business_license_raw

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,1045374-20050816,1606309,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2005-06-21T00:00:00.000,2005-07-22T00:00:00.000,N,2005-08-16T00:00:00.000,2006-08-15T00:00:00.000,2005-07-22T00:00:00.000,2005-07-25T00:00:00.000,AAI,,34,41.961998612,-87.655462297,"\n, \n(41.9619986123253, -87.65546229718817)"
1,1045378-20010216,1118018,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,2000-12-18T00:00:00.000,2000-12-19T00:00:00.000,2003-09-11T00:00:00.000,N,2001-02-16T00:00:00.000,2002-02-15T00:00:00.000,,2003-09-11T00:00:00.000,AAI,,27,41.943296755,-87.664647788,"\n, \n(41.94329675516324, -87.66464778785225)"
2,1045378-20020216,1422900,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,,2003-09-11T00:00:00.000,2003-09-11T00:00:00.000,N,2002-02-16T00:00:00.000,2003-02-15T00:00:00.000,,2003-09-11T00:00:00.000,AAI,,27,41.943296755,-87.664647788,"\n, \n(41.94329675516324, -87.66464778785225)"
3,1045378-20030216,1422901,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,,2003-09-11T00:00:00.000,2003-09-11T00:00:00.000,N,2003-02-16T00:00:00.000,2004-02-15T00:00:00.000,2003-09-11T00:00:00.000,2003-09-12T00:00:00.000,AAI,,27,41.943296755,-87.664647788,"\n, \n(41.94329675516324, -87.66464778785225)"
4,1045381-20020816,1251347,6760,1,SOUTH SHORE HOSPITAL CORPORATION,SOUTH SHORE HOSPITAL,8012 S CRANDON AVE 1ST,CHICAGO,IL,60617,8,7,8-7,4,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1045381,RENEW,,2002-06-28T00:00:00.000,2002-08-13T00:00:00.000,N,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-09-17T00:00:00.000,2002-09-18T00:00:00.000,AAI,,,41.749450604,-87.568778624,"\n, \n(41.74945060417943, -87.5687786244394)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1104878,1045371-20090416,1956030,52917,3,KATHLEEN E CZYZ,KATHLEEN E CZYZ,365 PARKVIEW TER,BUFFALO GROVE,IL,60089,,,,,1046,Automatic Amusement Device Operator,,,1045371,RENEW,,2009-02-17T00:00:00.000,2009-04-15T00:00:00.000,N,2009-04-16T00:00:00.000,2011-04-15T00:00:00.000,2009-04-15T00:00:00.000,2009-04-15T00:00:00.000,AAI,,,,,
1104879,1045371-20110416,2080951,52917,3,KATHLEEN E CZYZ,KATHLEEN E CZYZ,365 PARKVIEW TER,BUFFALO GROVE,IL,60089,,,,,1046,Automatic Amusement Device Operator,,,1045371,RENEW,,2011-02-15T00:00:00.000,2011-02-16T00:00:00.000,N,2011-04-16T00:00:00.000,2013-04-15T00:00:00.000,2011-02-16T00:00:00.000,2011-02-17T00:00:00.000,AAC,2012-12-29T00:00:00.000,,,,
1104880,1045374-20020816,1261763,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2002-06-28T00:00:00.000,2002-07-23T00:00:00.000,N,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-07-23T00:00:00.000,2002-07-24T00:00:00.000,AAI,,34,41.961998612,-87.655462297,"\n, \n(41.9619986123253, -87.65546229718817)"
1104881,1045374-20030816,1368827,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2003-06-24T00:00:00.000,2003-07-28T00:00:00.000,N,2003-08-16T00:00:00.000,2004-08-15T00:00:00.000,2003-07-28T00:00:00.000,2003-07-29T00:00:00.000,AAI,,34,41.961998612,-87.655462297,"\n, \n(41.9619986123253, -87.65546229718817)"


The business license data includes licenses for all Chicago businesses. For this analysis, we are only interested in the licenses where a food inspection may apply. To figure out which licenses are in scope:

- Perform an inner join on the business license and food inspection data.
- Identify all of the unique license codes where food inspections apply.
- Filter the data to include only those businesses.

In [8]:
food_inspection_raw

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,address,city,state,zip,inspection_date,inspection_type,results,violations,latitude,longitude,location
0,67733,WOLCOTT'S,TROQUET,1992040,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
1,67732,WOLCOTT'S,TROQUET,1992039,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
2,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,Restaurant,Risk 1 (High),8750 W BRYN WAWR AVE,CHICAGO,IL,60631,2010-01-04T00:00:00.000,License,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,,,
3,67757,DUNKIN DONUTS/BASKIN-ROBBINS,DUNKIN DONUTS/BASKIN-ROBBINS,1380279,Restaurant,Risk 2 (Medium),100 W RANDOLPH ST,CHICAGO,IL,60601,2010-01-04T00:00:00.000,Tag Removal,Pass,,41.88458626715456,-87.63101044588599,"(41.88458626715456, -87.63101044588599)"
4,52234,Cafe 608,Cafe 608,2013328,Restaurant,Risk 1 (High),608 W BARRY AVE,CHICAGO,IL,60657,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.938006880423615,-87.6447545707008,"(41.938006880423615, -87.6447545707008)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256207,2578135,LITTLE HANDS LEARNING CENTER ACADEMY,LITTLE HANDS LEARNING CENTER ACADEMY,2831108,Children's Services Facility,Risk 1 (High),10126 S WESTERN AVE,CHICAGO,IL,60643,2023-06-30T00:00:00.000,Canvass,Pass,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.70921709772374,-87.68185356852545,"(41.70921709772374, -87.68185356852545)"
256208,2578122,7-Eleven,7-Eleven,21286,Grocery Store,Risk 2 (Medium),6057 S KEDZIE AVE,CHICAGO,IL,60629,2023-06-30T00:00:00.000,Complaint,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.78275257014254,-87.70310188982967,"(41.78275257014254, -87.70310188982967)"
256209,2578132,"HAUTE BRATS, LLC",HAUTE BRATS,2841458,Restaurant,Risk 2 (Medium),6239 S ASHLAND AVE,CHICAGO,IL,60636,2023-06-30T00:00:00.000,Complaint,No Entry,,41.78012575548844,-87.66409001096028,"(41.78012575548844, -87.66409001096028)"
256210,2578116,WEST CHICAGO FUEL,WEST CHICAGO FUEL,2891134,Grocery Store,Risk 3 (Low),10 N KILBOURN AVE,CHICAGO,IL,60624,2023-06-30T00:00:00.000,License,Pass,49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - Com...,41.880908954758716,-87.73808517343443,"(41.880908954758716, -87.73808517343443)"


In [9]:
in_scope_license_codes = (
    pd.merge(
        food_inspection_raw,
        business_license_raw,
        how="inner",
        left_on="license_",
        right_on="license_id"
    )
    .loc[:, "license_code"]
    .dropna()
    .unique()
)

in_scope_license_codes

array(['1475', '1006', '1010', '1586', '1584', '1483', '1474', '1481',
       '1470', '1013', '1007', '1315', '1020', '1676', '1005', '1585',
       '1050', '1012', '1011', '1683', '1008', '1625', '4404', '1781',
       '1505', '1133', '1061', '1931', '1932', '1330', '1604', '1329',
       '1375', '1002', '1605', '1571', '1569', '1684', '1477', '1570',
       '1053', '1480', '4408', '1016', '1014', '1023', '8344', '1056',
       '1476', '1003', '1473', '1370', '4405', '1033', '8343', '4406',
       '8340', '8345', '1316', '1594', '8342', '1525', '1784', '1524',
       '1032', '7013', '1471', '7014', '2101', '7036'], dtype=object)

**Data cleaning**

Apply some basic cleaning steps to the data.

In [10]:
business_license_tidy = (business_license_raw
    
    # Only keep in scope licenses
    .loc[business_license_raw["license_code"].isin(in_scope_license_codes)]

    # Filter on the relevant state and city only.
    .loc[business_license_raw["state"] == "IL"]
    .loc[business_license_raw["city"] == "CHICAGO"]

    # Convert conditional approval to a boolean value.
    .assign(conditional_approval=lambda x: x["conditional_approval"] == "Y")
    
    # Drop the "location" column, the same data is already stored in the "latitude"
    # and "longitude" columns.
    .drop(columns=["location"])

    # Reset the index.
    .reset_index(drop=True)
)

business_license_tidy

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude
0,1045374-20050816,1606309,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2005-06-21T00:00:00.000,2005-07-22T00:00:00.000,False,2005-08-16T00:00:00.000,2006-08-15T00:00:00.000,2005-07-22T00:00:00.000,2005-07-25T00:00:00.000,AAI,,34,41.961998612,-87.655462297
1,1045378-20010216,1118018,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,2000-12-18T00:00:00.000,2000-12-19T00:00:00.000,2003-09-11T00:00:00.000,False,2001-02-16T00:00:00.000,2002-02-15T00:00:00.000,,2003-09-11T00:00:00.000,AAI,,27,41.943296755,-87.664647788
2,1045378-20020216,1422900,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,,2003-09-11T00:00:00.000,2003-09-11T00:00:00.000,False,2002-02-16T00:00:00.000,2003-02-15T00:00:00.000,,2003-09-11T00:00:00.000,AAI,,27,41.943296755,-87.664647788
3,1045378-20030216,1422901,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,,2003-09-11T00:00:00.000,2003-09-11T00:00:00.000,False,2003-02-16T00:00:00.000,2004-02-15T00:00:00.000,2003-09-11T00:00:00.000,2003-09-12T00:00:00.000,AAI,,27,41.943296755,-87.664647788
4,1045381-20020816,1251347,6760,1,SOUTH SHORE HOSPITAL CORPORATION,SOUTH SHORE HOSPITAL,8012 S CRANDON AVE 1ST,CHICAGO,IL,60617,8,7,8-7,4,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1045381,RENEW,,2002-06-28T00:00:00.000,2002-08-13T00:00:00.000,False,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-09-17T00:00:00.000,2002-09-18T00:00:00.000,AAI,,,41.749450604,-87.568778624
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
977942,1045370-20020816,1266103,203574,1,cafe mauro inc,CAFE MAURO,3633 W 26TH ST 1,CHICAGO,IL,60623,22,28,22-28,10,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1045370,RENEW,,2002-06-28T00:00:00.000,2002-08-02T00:00:00.000,False,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-02T00:00:00.000,2002-08-26T00:00:00.000,AAI,,25,41.844258654,-87.715945845
977943,1045370-20030816,1372670,203574,1,cafe mauro inc,CAFE MAURO,3633 W 26TH ST 1,CHICAGO,IL,60623,22,28,22-28,10,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1045370,RENEW,,2003-06-24T00:00:00.000,2003-08-18T00:00:00.000,False,2003-08-16T00:00:00.000,2004-08-15T00:00:00.000,2003-09-12T00:00:00.000,2003-09-12T00:00:00.000,AAI,,25,41.844258654,-87.715945845
977944,1045374-20020816,1261763,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2002-06-28T00:00:00.000,2002-07-23T00:00:00.000,False,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-07-23T00:00:00.000,2002-07-24T00:00:00.000,AAI,,34,41.961998612,-87.655462297
977945,1045374-20030816,1368827,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2003-06-24T00:00:00.000,2003-07-28T00:00:00.000,False,2003-08-16T00:00:00.000,2004-08-15T00:00:00.000,2003-07-28T00:00:00.000,2003-07-29T00:00:00.000,AAI,,34,41.961998612,-87.655462297


**Data validation**

Use pandera to validate the data and convert each column to the correct type.

In [11]:
business_license_schema = pa.DataFrameSchema({
    "id": pa.Column(str, coerce=True),
    "license_id": pa.Column(str, coerce=True, unique=True), # Primary Key
    "account_number": pa.Column(str, coerce=True),
    "site_number": pa.Column(str, coerce=True),
    "legal_name": pa.Column(str, coerce=True),
    "doing_business_as_name": pa.Column(str, coerce=True, nullable=True),
    "address": pa.Column(str, coerce=True),
    "city": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.eq("CHICAGO")
    ]),
    "state": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.eq("IL")
    ]),
    "zip_code": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check(lambda x: x.str.match(r'^\d{5}$').all())
    ]),
    "ward": pa.Column(str, coerce=True, nullable=True),
    "precinct": pa.Column(str, coerce=True, nullable=True),
    "ward_precinct": pa.Column(str, coerce=True, nullable=True),
    "police_district": pa.Column(pa.Category, coerce=True, nullable=True),
    "license_code": pa.Column(pa.Category, coerce=True, checks=[
        pa.Check.isin(in_scope_license_codes)
    ]),
    "license_description": pa.Column(str, coerce=True),
    "business_activity_id": pa.Column(str, coerce=True, nullable=True),
    "business_activity": pa.Column(pa.Category, coerce=True, nullable=True),
    "license_number": pa.Column(str, coerce=True),
    "application_type": pa.Column(pa.Category, coerce=True),
    "application_created_date": pa.Column(str, coerce=True, nullable=True),
    "application_requirements_complete": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "payment_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "conditional_approval": pa.Column(bool, coerce=True),
    "license_start_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "expiration_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "license_approved_for_issuance": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "date_issued": pa.Column(pa.DateTime, coerce=True),
    "license_status": pa.Column(pa.Category, coerce=True),
    "license_status_change_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "ssa": pa.Column(str, coerce=True, nullable=True),
    "latitude": pa.Column(pa.Float, coerce=True, nullable=True, checks=[
        pa.Check.between(38, 44)
    ]),
    "longitude": pa.Column(pa.Float, coerce=True, nullable=True, checks=[
        pa.Check.between(-89, -84)
    ]),
})



business_license_validated = business_license_schema.validate(business_license_tidy)
business_license_validated

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude
0,1045374-20050816,1606309,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2005-06-21,2005-07-22,False,2005-08-16,2006-08-15,2005-07-22,2005-07-25,AAI,NaT,34,41.961999,-87.655462
1,1045378-20010216,1118018,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,2000-12-18T00:00:00.000,2000-12-19,2003-09-11,False,2001-02-16,2002-02-15,NaT,2003-09-11,AAI,NaT,27,41.943297,-87.664648
2,1045378-20020216,1422900,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,,2003-09-11,2003-09-11,False,2002-02-16,2003-02-15,NaT,2003-09-11,AAI,NaT,27,41.943297,-87.664648
3,1045378-20030216,1422901,203578,1,AMBER RAE BORGLIN,AMBER RAE BORLING,1417 W ROSCOE ST APT.2,CHICAGO,IL,60657,32,47,32-47,19,1525,Massage Therapist,,,1045378,RENEW,,2003-09-11,2003-09-11,False,2003-02-16,2004-02-15,2003-09-11,2003-09-12,AAI,NaT,27,41.943297,-87.664648
4,1045381-20020816,1251347,6760,1,SOUTH SHORE HOSPITAL CORPORATION,SOUTH SHORE HOSPITAL,8012 S CRANDON AVE 1ST,CHICAGO,IL,60617,8,7,8-7,4,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1045381,RENEW,,2002-06-28,2002-08-13,False,2002-08-16,2003-08-15,2002-09-17,2002-09-18,AAI,NaT,,41.749451,-87.568779
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
977942,1045370-20020816,1266103,203574,1,cafe mauro inc,CAFE MAURO,3633 W 26TH ST 1,CHICAGO,IL,60623,22,28,22-28,10,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1045370,RENEW,,2002-06-28,2002-08-02,False,2002-08-16,2003-08-15,2002-08-02,2002-08-26,AAI,NaT,25,41.844259,-87.715946
977943,1045370-20030816,1372670,203574,1,cafe mauro inc,CAFE MAURO,3633 W 26TH ST 1,CHICAGO,IL,60623,22,28,22-28,10,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1045370,RENEW,,2003-06-24,2003-08-18,False,2003-08-16,2004-08-15,2003-09-12,2003-09-12,AAI,NaT,25,41.844259,-87.715946
977944,1045374-20020816,1261763,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2002-06-28,2002-07-23,False,2002-08-16,2003-08-15,2002-07-23,2002-07-24,AAI,NaT,34,41.961999,-87.655462
977945,1045374-20030816,1368827,56459,1,PAUL COLLURAFICI,TATTOO FACTORY,4408 N BROADWAY,CHICAGO,IL,60640,46,32,46-32,19,1013,Body Piercing,,,1045374,RENEW,,2003-06-24,2003-07-28,False,2003-08-16,2004-08-15,2003-07-28,2003-07-29,AAI,NaT,34,41.961999,-87.655462


Save the data as a pin.

In [12]:
# Insert the data into postgres. Inserting large amounts of data can be slow, so
# iterate over 10,000 rows at a time.

n_rows = business_license_validated.shape[0]
step_size = 10_000

for i in range(0, n_rows, step_size):
    index_start = i
    index_end = min(n_rows, i + step_size - 1)
    
    if i == 0:
        if_exists = "replace"
    else:
        if_exists = "append"

    print(f"Inserting rows: {index_start:,} - {index_end:,}")
    
    business_license_validated \
        .loc[index_start:index_end, :] \
        .to_sql("business_license_validated", engine, if_exists=if_exists, index=False)

Inserting rows: 0 - 9,999
Inserting rows: 10,000 - 19,999
Inserting rows: 20,000 - 29,999
Inserting rows: 30,000 - 39,999
Inserting rows: 40,000 - 49,999
Inserting rows: 50,000 - 59,999
Inserting rows: 60,000 - 69,999
Inserting rows: 70,000 - 79,999
Inserting rows: 80,000 - 89,999
Inserting rows: 90,000 - 99,999
Inserting rows: 100,000 - 109,999
Inserting rows: 110,000 - 119,999
Inserting rows: 120,000 - 129,999
Inserting rows: 130,000 - 139,999
Inserting rows: 140,000 - 149,999
Inserting rows: 150,000 - 159,999
Inserting rows: 160,000 - 169,999
Inserting rows: 170,000 - 179,999
Inserting rows: 180,000 - 189,999
Inserting rows: 190,000 - 199,999
Inserting rows: 200,000 - 209,999
Inserting rows: 210,000 - 219,999
Inserting rows: 220,000 - 229,999
Inserting rows: 230,000 - 239,999
Inserting rows: 240,000 - 249,999
Inserting rows: 250,000 - 259,999
Inserting rows: 260,000 - 269,999
Inserting rows: 270,000 - 279,999
Inserting rows: 280,000 - 289,999
Inserting rows: 290,000 - 299,999
Insert

In [13]:
# Confirm number of rows
pd.read_sql_query("SELECT COUNT(*) FROM business_license_validated", engine)

Unnamed: 0,count
0,977947


## Data set (2): Food inspections

<https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

In [14]:
food_inspection_raw

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,address,city,state,zip,inspection_date,inspection_type,results,violations,latitude,longitude,location
0,67733,WOLCOTT'S,TROQUET,1992040,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
1,67732,WOLCOTT'S,TROQUET,1992039,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
2,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,Restaurant,Risk 1 (High),8750 W BRYN WAWR AVE,CHICAGO,IL,60631,2010-01-04T00:00:00.000,License,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,,,
3,67757,DUNKIN DONUTS/BASKIN-ROBBINS,DUNKIN DONUTS/BASKIN-ROBBINS,1380279,Restaurant,Risk 2 (Medium),100 W RANDOLPH ST,CHICAGO,IL,60601,2010-01-04T00:00:00.000,Tag Removal,Pass,,41.88458626715456,-87.63101044588599,"(41.88458626715456, -87.63101044588599)"
4,52234,Cafe 608,Cafe 608,2013328,Restaurant,Risk 1 (High),608 W BARRY AVE,CHICAGO,IL,60657,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.938006880423615,-87.6447545707008,"(41.938006880423615, -87.6447545707008)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256207,2578135,LITTLE HANDS LEARNING CENTER ACADEMY,LITTLE HANDS LEARNING CENTER ACADEMY,2831108,Children's Services Facility,Risk 1 (High),10126 S WESTERN AVE,CHICAGO,IL,60643,2023-06-30T00:00:00.000,Canvass,Pass,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.70921709772374,-87.68185356852545,"(41.70921709772374, -87.68185356852545)"
256208,2578122,7-Eleven,7-Eleven,21286,Grocery Store,Risk 2 (Medium),6057 S KEDZIE AVE,CHICAGO,IL,60629,2023-06-30T00:00:00.000,Complaint,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW...",41.78275257014254,-87.70310188982967,"(41.78275257014254, -87.70310188982967)"
256209,2578132,"HAUTE BRATS, LLC",HAUTE BRATS,2841458,Restaurant,Risk 2 (Medium),6239 S ASHLAND AVE,CHICAGO,IL,60636,2023-06-30T00:00:00.000,Complaint,No Entry,,41.78012575548844,-87.66409001096028,"(41.78012575548844, -87.66409001096028)"
256210,2578116,WEST CHICAGO FUEL,WEST CHICAGO FUEL,2891134,Grocery Store,Risk 3 (Low),10 N KILBOURN AVE,CHICAGO,IL,60624,2023-06-30T00:00:00.000,License,Pass,49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - Com...,41.880908954758716,-87.73808517343443,"(41.880908954758716, -87.73808517343443)"


**Data cleaning**

Apply some basic cleaning steps to the data.

In [15]:
food_inspection_tidy = (food_inspection_raw

    # Filter on the relevant state and city only.
    .loc[food_inspection_raw["state"] == "IL"]
    .loc[food_inspection_raw["city"] == "CHICAGO"]

    # Drop columns that also exist in the business license data.
    .drop(columns=["address", "city", "state", "latitude", "longitude", "location"])

    # Convert categorical columns to be all upper case for consistency
    .assign(
        dba_name=lambda x: x["dba_name"].str.upper(),
        aka_name=lambda x: x["aka_name"].str.upper(),
        facility_type=lambda x: x["facility_type"].str.upper(),
        risk=lambda x: x["risk"].str.upper(),
        inspection_type=lambda x: x["inspection_type"].str.upper(),
        results=lambda x: x["results"].str.upper(),
        violations=lambda x: x["violations"].str.upper(),
    )

    # Specify the order of categorical columns.
    .assign(risk=lambda x: x["risk"].astype("category").cat.set_categories(["ALL", "RISK 1 (HIGH)", "RISK 2 (MEDIUM)", "RISK 3 (LOW)"], ordered=True))

    # The "violations" can have multiple violations separated by a "|". E.g.
    # "32. FOOD AND NON-FOOD ... REPLACED. | 33. FOOD AND NON-FOOD CONTACT E"
    # To make the data easier to work with split each violation into its own item.
    # The result is the violations column will contain a list of strings.
    .assign(violations=lambda x: x["violations"].str.split(pat=" \| "))

    # Reset the index.
    .reset_index(drop=True)
)

food_inspection_tidy

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,zip,inspection_date,inspection_type,results,violations
0,67733,WOLCOTT'S,TROQUET,1992040,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
1,67732,WOLCOTT'S,TROQUET,1992039,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
2,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,RESTAURANT,RISK 1 (HIGH),60631,2010-01-04T00:00:00.000,LICENSE,FAIL,[18. NO EVIDENCE OF RODENT OR INSECT OUTER OPE...
3,67757,DUNKIN DONUTS/BASKIN-ROBBINS,DUNKIN DONUTS/BASKIN-ROBBINS,1380279,RESTAURANT,RISK 2 (MEDIUM),60601,2010-01-04T00:00:00.000,TAG REMOVAL,PASS,
4,52234,CAFE 608,CAFE 608,2013328,RESTAURANT,RISK 1 (HIGH),60657,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
...,...,...,...,...,...,...,...,...,...,...,...
255135,2578135,LITTLE HANDS LEARNING CENTER ACADEMY,LITTLE HANDS LEARNING CENTER ACADEMY,2831108,CHILDREN'S SERVICES FACILITY,RISK 1 (HIGH),60643,2023-06-30T00:00:00.000,CANVASS,PASS,"[38. INSECTS, RODENTS, & ANIMALS NOT PRESENT -..."
255136,2578122,7-ELEVEN,7-ELEVEN,21286,GROCERY STORE,RISK 2 (MEDIUM),60629,2023-06-30T00:00:00.000,COMPLAINT,FAIL,"[1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNO..."
255137,2578132,"HAUTE BRATS, LLC",HAUTE BRATS,2841458,RESTAURANT,RISK 2 (MEDIUM),60636,2023-06-30T00:00:00.000,COMPLAINT,NO ENTRY,
255138,2578116,WEST CHICAGO FUEL,WEST CHICAGO FUEL,2891134,GROCERY STORE,RISK 3 (LOW),60624,2023-06-30T00:00:00.000,LICENSE,PASS,[49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - CO...


**Data validation**

Use pandera to validate the data and convert each column to the correct type.

In [16]:
food_inspection_schema = pa.DataFrameSchema({
    "inspection_id": pa.Column(str, coerce=True, unique=True), # Primary Key
    "dba_name": pa.Column(str, coerce=True),
    "aka_name": pa.Column(str, coerce=True, nullable=True),
    "license_": pa.Column(str, coerce=True, nullable=True), # Foreign Key
    "facility_type": pa.Column(pa.Category, coerce=True, nullable=True),
    "risk": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.isin(["ALL", "RISK 1 (HIGH)", "RISK 2 (MEDIUM)", "RISK 3 (LOW)"])
    ]),
    "zip": pa.Column(str, coerce=True, nullable=True),
    "inspection_date": pa.Column(pa.DateTime, coerce=True),
    "inspection_type": pa.Column(pa.Category, coerce=True, nullable=True),
    "results": pa.Column(pa.Category, coerce=True),
    "violations": pa.Column(pa.Object, coerce=True, nullable=True)
})

food_inspection_validated = food_inspection_schema.validate(food_inspection_tidy)
food_inspection_validated

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,zip,inspection_date,inspection_type,results,violations
0,67733,WOLCOTT'S,TROQUET,1992040,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04,LICENSE RE-INSPECTION,PASS,
1,67732,WOLCOTT'S,TROQUET,1992039,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04,LICENSE RE-INSPECTION,PASS,
2,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,RESTAURANT,RISK 1 (HIGH),60631,2010-01-04,LICENSE,FAIL,[18. NO EVIDENCE OF RODENT OR INSECT OUTER OPE...
3,67757,DUNKIN DONUTS/BASKIN-ROBBINS,DUNKIN DONUTS/BASKIN-ROBBINS,1380279,RESTAURANT,RISK 2 (MEDIUM),60601,2010-01-04,TAG REMOVAL,PASS,
4,52234,CAFE 608,CAFE 608,2013328,RESTAURANT,RISK 1 (HIGH),60657,2010-01-04,LICENSE RE-INSPECTION,PASS,
...,...,...,...,...,...,...,...,...,...,...,...
255135,2578135,LITTLE HANDS LEARNING CENTER ACADEMY,LITTLE HANDS LEARNING CENTER ACADEMY,2831108,CHILDREN'S SERVICES FACILITY,RISK 1 (HIGH),60643,2023-06-30,CANVASS,PASS,"[38. INSECTS, RODENTS, & ANIMALS NOT PRESENT -..."
255136,2578122,7-ELEVEN,7-ELEVEN,21286,GROCERY STORE,RISK 2 (MEDIUM),60629,2023-06-30,COMPLAINT,FAIL,"[1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNO..."
255137,2578132,"HAUTE BRATS, LLC",HAUTE BRATS,2841458,RESTAURANT,RISK 2 (MEDIUM),60636,2023-06-30,COMPLAINT,NO ENTRY,
255138,2578116,WEST CHICAGO FUEL,WEST CHICAGO FUEL,2891134,GROCERY STORE,RISK 3 (LOW),60624,2023-06-30,LICENSE,PASS,[49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - CO...


Insert the data into postgresql.

In [17]:
# Insert the data into postgres. Inserting large amounts of data can be slow, so
# iterate over 10,000 rows at a time.

n_rows = food_inspection_validated.shape[0]
step_size = 10_000

for i in range(0, n_rows, step_size):
    index_start = i
    index_end = min(n_rows, i + step_size - 1)
    
    if i == 0:
        if_exists = "replace"
    else:
        if_exists = "append"

    print(f"Inserting rows: {index_start:,} - {index_end:,}")

    food_inspection_validated \
        .loc[index_start:index_end, :] \
        .to_sql("food_inspection_validated", engine, if_exists=if_exists, index=False)

Inserting rows: 0 - 9,999
Inserting rows: 10,000 - 19,999
Inserting rows: 20,000 - 29,999
Inserting rows: 30,000 - 39,999
Inserting rows: 40,000 - 49,999
Inserting rows: 50,000 - 59,999
Inserting rows: 60,000 - 69,999
Inserting rows: 70,000 - 79,999
Inserting rows: 80,000 - 89,999
Inserting rows: 90,000 - 99,999
Inserting rows: 100,000 - 109,999
Inserting rows: 110,000 - 119,999
Inserting rows: 120,000 - 129,999
Inserting rows: 130,000 - 139,999
Inserting rows: 140,000 - 149,999
Inserting rows: 150,000 - 159,999
Inserting rows: 160,000 - 169,999
Inserting rows: 170,000 - 179,999
Inserting rows: 180,000 - 189,999
Inserting rows: 190,000 - 199,999
Inserting rows: 200,000 - 209,999
Inserting rows: 210,000 - 219,999
Inserting rows: 220,000 - 229,999
Inserting rows: 230,000 - 239,999
Inserting rows: 240,000 - 249,999
Inserting rows: 250,000 - 255,140


In [18]:
# Confirm number of rows
pd.read_sql_query("SELECT COUNT(*) FROM food_inspection_validated", engine)

Unnamed: 0,count
0,255140


## Data set (3): Map Data

Generate map data for use in downstream applications.

In [19]:
map_cols = [
    "license_id",
    "legal_name", 
    "doing_business_as_name",
    "license_code",
    "license_description",
    "address", 
    "zip_code",
    "latitude",
    "longitude"
]

map_data = (
    business_license_validated
    .head(1_000)
    .loc[:, map_cols]
    .drop_duplicates()
    .reset_index(drop=True)
)

map_data

Unnamed: 0,license_id,legal_name,doing_business_as_name,license_code,license_description,address,zip_code,latitude,longitude
0,1606309,PAUL COLLURAFICI,TATTOO FACTORY,1013,Body Piercing,4408 N BROADWAY,60640,41.961999,-87.655462
1,1118018,AMBER RAE BORGLIN,AMBER RAE BORLING,1525,Massage Therapist,1417 W ROSCOE ST APT.2,60657,41.943297,-87.664648
2,1422900,AMBER RAE BORGLIN,AMBER RAE BORLING,1525,Massage Therapist,1417 W ROSCOE ST APT.2,60657,41.943297,-87.664648
3,1422901,AMBER RAE BORGLIN,AMBER RAE BORLING,1525,Massage Therapist,1417 W ROSCOE ST APT.2,60657,41.943297,-87.664648
4,1251347,SOUTH SHORE HOSPITAL CORPORATION,SOUTH SHORE HOSPITAL,1006,Retail Food Establishment,8012 S CRANDON AVE 1ST,60617,41.749451,-87.568779
...,...,...,...,...,...,...,...,...,...
995,1994037,ELBA C ZEPEDA,LAS AMERICAS PRINTING,1010,Limited Business License,3621 W ARMITAGE AVE 1ST,60647,41.917162,-87.717616
996,2116653,ELBA C ZEPEDA,LAS AMERICAS PRINTING,1010,Limited Business License,3621 W ARMITAGE AVE 1ST,60647,41.917162,-87.717616
997,2285994,ELBA C ZEPEDA,LAS AMERICAS PRINTING,1010,Limited Business License,3621 W ARMITAGE AVE 1ST,60647,41.917162,-87.717616
998,2425101,ELBA C ZEPEDA,LAS AMERICAS PRINTING,1010,Limited Business License,3621 W ARMITAGE AVE 1ST,60647,41.917162,-87.717616


The the map data as a pin for easy access by other applications.

In [20]:
# Pin the data to Connect
pin_name = f"{user_name}/chicago-business-map-data"

board.pin_write(
    map_data, 
    name=pin_name, 
    type="arrow", 
    versioned=True,
    title="City of Chicago - Business Map Data"
)

Writing pin:
Name: 'sam.edwardes/chicago-business-map-data'
Version: 20230704T102635Z-2a6ae


Meta(title='City of Chicago - Business Map Data', description=None, created='20230704T102635Z', pin_hash='2a6ae93f42bb0110', file='chicago-business-map-data.arrow', file_size=49202, type='arrow', api_version=1, version=VersionRaw(version='76677'), tags=None, name='sam.edwardes/chicago-business-map-data', user={}, local={})