# Data Validation

In the previous notebook, two pins were saved:

- City of Chicago - Business License Data (RAW): `chicago-business-license-data`
- ity of Chicago - Food Inspection Data (RAW): `chicago-food-inspection-data`

## Setup

In [1]:
import pins

import pandas as pd
import numpy as np
import pandera as pa

In [2]:
pd.options.display.max_columns = 999

In [3]:
# Set up the board
board = pins.board_connect()
user_name = "sam.edwardes"

## Tips

- Use multiple cursors in VS Code to easily edit many lines at the same time (<https://code.visualstudio.com/docs/getstarted/tips-and-tricks#_column-box-selection>).
- Use `df["col_name"].value_counts()` to understand the distribution of categorical columns.
- Use `df["col_name"].hist` to understand the distribution of numeric columns.
- Use `df.info()` to understand column types and null values.
- Use [ydata-profiling](https://pypi.org/project/ydata-profiling/) to generate an automated data report.

```python
from ydata_profiling import ProfileReport
ProfileReport(df)
```

## Data set (1): Business License Data

<https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>

In [4]:
pin_name = f"{user_name}/chicago-business-license-data-raw"
business_license_raw = board.pin_read(pin_name)
business_license_raw

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,60645,50,28,50-28,24,1011,Home Repair,,,1000000,ISSUE,2000-06-19T00:00:00.000,2002-02-15T00:00:00.000,2002-02-15T00:00:00.000,N,2002-02-21T00:00:00.000,2002-11-15T00:00:00.000,2002-02-21T00:00:00.000,2002-02-22T00:00:00.000,AAI,,,41.998514371,-87.680010905,"\n, \n(41.99851437112669, -87.68001090539342)"
1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2001-06-25T00:00:00.000,2001-08-20T00:00:00.000,N,2001-08-16T00:00:00.000,2002-08-15T00:00:00.000,2001-08-20T00:00:00.000,2002-04-30T00:00:00.000,AAI,,,41.931960333,-87.722150366,"\n, \n(41.931960332638006, -87.72215036594574)"
2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2002-03-27T00:00:00.000,2002-04-17T00:00:00.000,N,2002-05-16T00:00:00.000,2003-05-15T00:00:00.000,2002-04-17T00:00:00.000,2002-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131,"\n, \n(41.88426142200001, -87.6495341312589)"
3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2002-06-28T00:00:00.000,2002-08-13T00:00:00.000,N,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-13T00:00:00.000,2002-08-14T00:00:00.000,AAI,,,41.931960333,-87.722150366,"\n, \n(41.931960332638006, -87.72215036594574)"
4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2003-03-25T00:00:00.000,2003-04-17T00:00:00.000,N,2003-05-16T00:00:00.000,2004-05-15T00:00:00.000,2003-04-17T00:00:00.000,2003-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131,"\n, \n(41.88426142200001, -87.6495341312589)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1104097,9999-20140916,2343163,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2014-07-15T00:00:00.000,2014-12-26T00:00:00.000,N,2014-09-16T00:00:00.000,2016-09-15T00:00:00.000,2014-12-26T00:00:00.000,2014-12-29T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1104098,9999-20160916,2478055,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2016-07-15T00:00:00.000,2016-09-08T00:00:00.000,N,2016-09-16T00:00:00.000,2018-09-15T00:00:00.000,2016-09-08T00:00:00.000,2016-09-09T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1104099,9999-20180916,2610578,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2018-07-15T00:00:00.000,2018-09-10T00:00:00.000,N,2018-09-16T00:00:00.000,2020-09-15T00:00:00.000,2018-09-10T00:00:00.000,2018-09-11T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1104100,9999-20200916,2739432,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2020-07-15T00:00:00.000,2020-08-05T00:00:00.000,N,2020-09-16T00:00:00.000,2022-09-15T00:00:00.000,2020-08-05T00:00:00.000,2020-08-06T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"


**Data cleaning**

Apply some basic cleaning steps to the data.

In [5]:
business_license_tidy = (business_license_raw

    # Filter on the relevant state and city only.
    .loc[business_license_raw["state"] == "IL"]
    .loc[business_license_raw["city"] == "CHICAGO"]

    # Convert conditional approval to a boolean value.
    .assign(conditional_approval=lambda x: x["conditional_approval"] == "Y")
    
    # Drop the "location" column, the same data is already stored in the "latitude"
    # and "longitude" columns.
    .drop(columns=["location"])

    # Reset the index.
    .reset_index(drop=True)
)

business_license_tidy

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude
0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,60645,50,28,50-28,24,1011,Home Repair,,,1000000,ISSUE,2000-06-19T00:00:00.000,2002-02-15T00:00:00.000,2002-02-15T00:00:00.000,False,2002-02-21T00:00:00.000,2002-11-15T00:00:00.000,2002-02-21T00:00:00.000,2002-02-22T00:00:00.000,AAI,,,41.998514371,-87.680010905
1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2001-06-25T00:00:00.000,2001-08-20T00:00:00.000,False,2001-08-16T00:00:00.000,2002-08-15T00:00:00.000,2001-08-20T00:00:00.000,2002-04-30T00:00:00.000,AAI,,,41.931960333,-87.722150366
2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2002-03-27T00:00:00.000,2002-04-17T00:00:00.000,False,2002-05-16T00:00:00.000,2003-05-15T00:00:00.000,2002-04-17T00:00:00.000,2002-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131
3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2002-06-28T00:00:00.000,2002-08-13T00:00:00.000,False,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-13T00:00:00.000,2002-08-14T00:00:00.000,AAI,,,41.931960333,-87.722150366
4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2003-03-25T00:00:00.000,2003-04-17T00:00:00.000,False,2003-05-16T00:00:00.000,2004-05-15T00:00:00.000,2003-04-17T00:00:00.000,2003-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1021679,9999-20140916,2343163,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2014-07-15T00:00:00.000,2014-12-26T00:00:00.000,False,2014-09-16T00:00:00.000,2016-09-15T00:00:00.000,2014-12-26T00:00:00.000,2014-12-29T00:00:00.000,AAI,,,41.892720807,-87.692331754
1021680,9999-20160916,2478055,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2016-07-15T00:00:00.000,2016-09-08T00:00:00.000,False,2016-09-16T00:00:00.000,2018-09-15T00:00:00.000,2016-09-08T00:00:00.000,2016-09-09T00:00:00.000,AAI,,,41.892720807,-87.692331754
1021681,9999-20180916,2610578,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2018-07-15T00:00:00.000,2018-09-10T00:00:00.000,False,2018-09-16T00:00:00.000,2020-09-15T00:00:00.000,2018-09-10T00:00:00.000,2018-09-11T00:00:00.000,AAI,,,41.892720807,-87.692331754
1021682,9999-20200916,2739432,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2020-07-15T00:00:00.000,2020-08-05T00:00:00.000,False,2020-09-16T00:00:00.000,2022-09-15T00:00:00.000,2020-08-05T00:00:00.000,2020-08-06T00:00:00.000,AAI,,,41.892720807,-87.692331754


**Data validation**

Use pandera to validate the data and convert each column to the correct type.

In [6]:
business_license_schema = pa.DataFrameSchema({
    "id": pa.Column(str, coerce=True),
    "license_id": pa.Column(str, coerce=True, unique=True), # Primary Key
    "account_number": pa.Column(str, coerce=True),
    "site_number": pa.Column(str, coerce=True),
    "legal_name": pa.Column(str, coerce=True),
    "doing_business_as_name": pa.Column(str, coerce=True, nullable=True),
    "address": pa.Column(str, coerce=True),
    "city": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.eq("CHICAGO")
    ]),
    "state": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.eq("IL")
    ]),
    "zip_code": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check(lambda x: x.str.match(r'^\d{5}$').all())
    ]),
    "ward": pa.Column(str, coerce=True, nullable=True),
    "precinct": pa.Column(str, coerce=True, nullable=True),
    "ward_precinct": pa.Column(str, coerce=True, nullable=True),
    "police_district": pa.Column(pa.Category, coerce=True, nullable=True),
    "license_code": pa.Column(pa.Category, coerce=True),
    "license_description": pa.Column(str, coerce=True),
    "business_activity_id": pa.Column(str, coerce=True, nullable=True),
    "business_activity": pa.Column(pa.Category, coerce=True, nullable=True),
    "license_number": pa.Column(str, coerce=True),
    "application_type": pa.Column(pa.Category, coerce=True),
    "application_created_date": pa.Column(str, coerce=True, nullable=True),
    "application_requirements_complete": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "payment_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "conditional_approval": pa.Column(bool, coerce=True),
    "license_start_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "expiration_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "license_approved_for_issuance": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "date_issued": pa.Column(pa.DateTime, coerce=True),
    "license_status": pa.Column(pa.Category, coerce=True),
    "license_status_change_date": pa.Column(pa.DateTime, coerce=True, nullable=True),
    "ssa": pa.Column(str, coerce=True, nullable=True),
    "latitude": pa.Column(pa.Float, coerce=True, nullable=True, checks=[
        pa.Check.between(38, 44)
    ]),
    "longitude": pa.Column(pa.Float, coerce=True, nullable=True, checks=[
        pa.Check.between(-89, -84)
    ]),
})



business_license_validated = business_license_schema.validate(business_license_tidy)
business_license_validated

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude
0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,60645,50,28,50-28,24,1011,Home Repair,,,1000000,ISSUE,2000-06-19T00:00:00.000,2002-02-15,2002-02-15,False,2002-02-21,2002-11-15,2002-02-21,2002-02-22,AAI,NaT,,41.998514,-87.680011
1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2001-06-25,2001-08-20,False,2001-08-16,2002-08-15,2001-08-20,2002-04-30,AAI,NaT,,41.931960,-87.722150
2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2002-03-27,2002-04-17,False,2002-05-16,2003-05-15,2002-04-17,2002-04-18,AAI,NaT,,41.884261,-87.649534
3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2002-06-28,2002-08-13,False,2002-08-16,2003-08-15,2002-08-13,2002-08-14,AAI,NaT,,41.931960,-87.722150
4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2003-03-25,2003-04-17,False,2003-05-16,2004-05-15,2003-04-17,2003-04-18,AAI,NaT,,41.884261,-87.649534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1021679,9999-20140916,2343163,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2014-07-15,2014-12-26,False,2014-09-16,2016-09-15,2014-12-26,2014-12-29,AAI,NaT,,41.892721,-87.692332
1021680,9999-20160916,2478055,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2016-07-15,2016-09-08,False,2016-09-16,2018-09-15,2016-09-08,2016-09-09,AAI,NaT,,41.892721,-87.692332
1021681,9999-20180916,2610578,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2018-07-15,2018-09-10,False,2018-09-16,2020-09-15,2018-09-10,2018-09-11,AAI,NaT,,41.892721,-87.692332
1021682,9999-20200916,2739432,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2020-07-15,2020-08-05,False,2020-09-16,2022-09-15,2020-08-05,2020-08-06,AAI,NaT,,41.892721,-87.692332


Save the data as a pin.

In [7]:
# Pin the data to Connect
pin_name = f"{user_name}/chicago-business-license-data-validated"

board.pin_write(
    business_license_validated, 
    name=pin_name, 
    type="arrow", 
    versioned=True,
    title="City of Chicago - Business License Data (VALIDATED)"
)

Writing pin:
Name: 'sam.edwardes/chicago-business-license-data-validated'
Version: 20230626T124605Z-e5b7d


Meta(title='City of Chicago - Business License Data (VALIDATED)', description=None, created='20230626T124605Z', pin_hash='e5b7dceee945bdd5', file='chicago-business-license-data-validated.arrow', file_size=122846682, type='arrow', api_version=1, version=VersionRaw(version='76400'), tags=None, name='sam.edwardes/chicago-business-license-data-validated', user={}, local={})

In [8]:
board.pin_versions(pin_name)

Unnamed: 0,version
0,75839
1,75842
2,76398
3,76400


## Data set (2): Food inspections

<https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

In [9]:
pin_name = f"{user_name}/chicago-food-inspection-data-raw"
food_inspection_raw = board.pin_read(pin_name)
food_inspection_raw

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,address,city,state,zip,inspection_date,inspection_type,results,violations,latitude,longitude,location
0,70269,mr.daniel's,mr.daniel's,1899292,Restaurant,Risk 1 (High),5645 W BELMONT AVE,CHICAGO,IL,60634,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.93844282365204,-87.76831838068422,"(41.93844282365204, -87.76831838068422)"
1,52234,Cafe 608,Cafe 608,2013328,Restaurant,Risk 1 (High),608 W BARRY AVE,CHICAGO,IL,60657,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.938006880423615,-87.6447545707008,"(41.938006880423615, -87.6447545707008)"
2,67733,WOLCOTT'S,TROQUET,1992040,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
3,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,Restaurant,Risk 1 (High),8750 W BRYN WAWR AVE,CHICAGO,IL,60631,2010-01-04T00:00:00.000,License,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,,,
4,67732,WOLCOTT'S,TROQUET,1992039,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255658,2577663,"WEST TOWN DAYCARE ,LLC","WEST TOWN DAYCARE, LLC",2570038,Children's Services Facility,Risk 1 (High),2751 W CORTEZ ST,CHICAGO,IL,60622,2023-06-22T00:00:00.000,License,Pass,,41.9000918605704,-87.69640926723595,"(41.9000918605704, -87.69640926723595)"
255659,2577673,MONTESSORI GIFTED PREP LLC.,MONTESSORI GIFTED PREP LLC.,2405576,Children's Services Facility,Risk 1 (High),4754 N LEAVITT ST,CHICAGO,IL,60625,2023-06-22T00:00:00.000,License,Pass,49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - Com...,41.96850906331291,-87.6841827473811,"(41.96850906331291, -87.6841827473811)"
255660,2577660,CHARMING CHILDREN LEARNING ACADEMY,CHARMING CHILDREN LEARNING ACADEMY,2641758,Children's Services Facility,Risk 1 (High),3337-3341 W Chicago AVE,CHICAGO,IL,60651,2023-06-22T00:00:00.000,Canvass Re-Inspection,Pass,10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI...,41.89540586598236,-87.71043217599598,"(41.89540586598236, -87.71043217599598)"
255661,2577674,BIRRIERIA DON LUIS,BIRRIERIA DON LUIS,2796827,Restaurant,Risk 1 (High),3544 E 106TH ST,CHICAGO,IL,60617,2023-06-22T00:00:00.000,Canvass Re-Inspection,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.70285190603626,-87.537139292445,"(41.70285190603626, -87.537139292445)"


**Data cleaning**

Apply some basic cleaning steps to the data.

In [10]:
food_inspection_tidy = (food_inspection_raw

    # Filter on the relevant state and city only.
    .loc[food_inspection_raw["state"] == "IL"]
    .loc[food_inspection_raw["city"] == "CHICAGO"]

    # Drop columns that also exist in the business license data.
    .drop(columns=["address", "city", "state", "latitude", "longitude", "location"])

    # Convert categorical columns to be all upper case for consistency
    .assign(
        dba_name=lambda x: x["dba_name"].str.upper(),
        aka_name=lambda x: x["aka_name"].str.upper(),
        facility_type=lambda x: x["facility_type"].str.upper(),
        risk=lambda x: x["risk"].str.upper(),
        inspection_type=lambda x: x["inspection_type"].str.upper(),
        results=lambda x: x["results"].str.upper(),
        violations=lambda x: x["violations"].str.upper(),
    )

    # Specify the order of categorical columns.
    .assign(risk=lambda x: x["risk"].astype("category").cat.set_categories(["ALL", "RISK 1 (HIGH)", "RISK 2 (MEDIUM)", "RISK 3 (LOW)"], ordered=True))

    # The "violations" can have multiple violations separated by a "|". E.g.
    # "32. FOOD AND NON-FOOD ... REPLACED. | 33. FOOD AND NON-FOOD CONTACT E"
    # To make the data easier to work with split each violation into its own item.
    # The result is the violations column will contain a list of strings.
    .assign(violations=lambda x: x["violations"].str.split(pat=" \| "))

    # Reset the index.
    .reset_index(drop=True)
)

food_inspection_tidy

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,zip,inspection_date,inspection_type,results,violations
0,70269,MR.DANIEL'S,MR.DANIEL'S,1899292,RESTAURANT,RISK 1 (HIGH),60634,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
1,52234,CAFE 608,CAFE 608,2013328,RESTAURANT,RISK 1 (HIGH),60657,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
2,67733,WOLCOTT'S,TROQUET,1992040,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
3,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,RESTAURANT,RISK 1 (HIGH),60631,2010-01-04T00:00:00.000,LICENSE,FAIL,[18. NO EVIDENCE OF RODENT OR INSECT OUTER OPE...
4,67732,WOLCOTT'S,TROQUET,1992039,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04T00:00:00.000,LICENSE RE-INSPECTION,PASS,
...,...,...,...,...,...,...,...,...,...,...,...
254587,2577663,"WEST TOWN DAYCARE ,LLC","WEST TOWN DAYCARE, LLC",2570038,CHILDREN'S SERVICES FACILITY,RISK 1 (HIGH),60622,2023-06-22T00:00:00.000,LICENSE,PASS,
254588,2577673,MONTESSORI GIFTED PREP LLC.,MONTESSORI GIFTED PREP LLC.,2405576,CHILDREN'S SERVICES FACILITY,RISK 1 (HIGH),60625,2023-06-22T00:00:00.000,LICENSE,PASS,[49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - CO...
254589,2577660,CHARMING CHILDREN LEARNING ACADEMY,CHARMING CHILDREN LEARNING ACADEMY,2641758,CHILDREN'S SERVICES FACILITY,RISK 1 (HIGH),60651,2023-06-22T00:00:00.000,CANVASS RE-INSPECTION,PASS,[10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPL...
254590,2577674,BIRRIERIA DON LUIS,BIRRIERIA DON LUIS,2796827,RESTAURANT,RISK 1 (HIGH),60617,2023-06-22T00:00:00.000,CANVASS RE-INSPECTION,PASS,[39. CONTAMINATION PREVENTED DURING FOOD PREPA...


**Data validation**

Use pandera to validate the data and convert each column to the correct type.

In [11]:
food_inspection_schema = pa.DataFrameSchema({
    "inspection_id": pa.Column(str, coerce=True, unique=True), # Primary Key
    "dba_name": pa.Column(str, coerce=True),
    "aka_name": pa.Column(str, coerce=True, nullable=True),
    "license_": pa.Column(str, coerce=True, nullable=True), # Foreign Key
    "facility_type": pa.Column(pa.Category, coerce=True, nullable=True),
    "risk": pa.Column(str, coerce=True, nullable=True, checks=[
        pa.Check.isin(["ALL", "RISK 1 (HIGH)", "RISK 2 (MEDIUM)", "RISK 3 (LOW)"])
    ]),
    "zip": pa.Column(str, coerce=True, nullable=True),
    "inspection_date": pa.Column(pa.DateTime, coerce=True),
    "inspection_type": pa.Column(pa.Category, coerce=True, nullable=True),
    "results": pa.Column(pa.Category, coerce=True),
    "violations": pa.Column(pa.Object, coerce=True, nullable=True)
})

food_inspection_validated = food_inspection_schema.validate(food_inspection_tidy)
food_inspection_validated

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,zip,inspection_date,inspection_type,results,violations
0,70269,MR.DANIEL'S,MR.DANIEL'S,1899292,RESTAURANT,RISK 1 (HIGH),60634,2010-01-04,LICENSE RE-INSPECTION,PASS,
1,52234,CAFE 608,CAFE 608,2013328,RESTAURANT,RISK 1 (HIGH),60657,2010-01-04,LICENSE RE-INSPECTION,PASS,
2,67733,WOLCOTT'S,TROQUET,1992040,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04,LICENSE RE-INSPECTION,PASS,
3,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,RESTAURANT,RISK 1 (HIGH),60631,2010-01-04,LICENSE,FAIL,[18. NO EVIDENCE OF RODENT OR INSECT OUTER OPE...
4,67732,WOLCOTT'S,TROQUET,1992039,RESTAURANT,RISK 1 (HIGH),60613,2010-01-04,LICENSE RE-INSPECTION,PASS,
...,...,...,...,...,...,...,...,...,...,...,...
254587,2577663,"WEST TOWN DAYCARE ,LLC","WEST TOWN DAYCARE, LLC",2570038,CHILDREN'S SERVICES FACILITY,RISK 1 (HIGH),60622,2023-06-22,LICENSE,PASS,
254588,2577673,MONTESSORI GIFTED PREP LLC.,MONTESSORI GIFTED PREP LLC.,2405576,CHILDREN'S SERVICES FACILITY,RISK 1 (HIGH),60625,2023-06-22,LICENSE,PASS,[49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - CO...
254589,2577660,CHARMING CHILDREN LEARNING ACADEMY,CHARMING CHILDREN LEARNING ACADEMY,2641758,CHILDREN'S SERVICES FACILITY,RISK 1 (HIGH),60651,2023-06-22,CANVASS RE-INSPECTION,PASS,[10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPL...
254590,2577674,BIRRIERIA DON LUIS,BIRRIERIA DON LUIS,2796827,RESTAURANT,RISK 1 (HIGH),60617,2023-06-22,CANVASS RE-INSPECTION,PASS,[39. CONTAMINATION PREVENTED DURING FOOD PREPA...


Save the data as a pin.

In [12]:
pin_name = f"{user_name}/chicago-food-inspection-data-validated"

board.pin_write(
    food_inspection_validated, 
    name=pin_name, 
    type="arrow", 
    versioned=True,
    title="City of Chicago - Food Inspection Data (VALIDATED)"
)

Writing pin:
Name: 'sam.edwardes/chicago-food-inspection-data-validated'
Version: 20230626T124636Z-c31f1


Meta(title='City of Chicago - Food Inspection Data (VALIDATED)', description=None, created='20230626T124636Z', pin_hash='c31f13e676db1f07', file='chicago-food-inspection-data-validated.arrow', file_size=81639314, type='arrow', api_version=1, version=VersionRaw(version='76401'), tags=None, name='sam.edwardes/chicago-food-inspection-data-validated', user={}, local={})

In [13]:
board.pin_versions(pin_name)

Unnamed: 0,version
0,75843
1,76390
2,76401
