# Raw Data Ingestion

This workshop will use data from the City of Chicago Open Data Portal: <https://data.cityofchicago.org>. The following datasets will be used:

1. Business license data: <https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>
2. Food inspections: <https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

## Setup

In [2]:
import os
from urllib.parse import urlencode

import pandas as pd
from sqlalchemy import create_engine, text

In [3]:
user = os.environ["USER"]
user

'sam.edwardes'

In [None]:
# General settings
if user == "sam.edwardes" or user =="gagan":
    max_rows_from_data_portal = 5_000_000
else:
    max_rows_from_data_portal = 5_000

# Jupyter / Pandas settings
pd.options.display.max_columns = 999

In [7]:
# Set up postgresql connection
db_user = "posit"
db_password = os.environ["CONF23_DB_PASSWORD"]
db_host = os.environ["CONF23_DB_HOST"]
db_port = 5432
db_database = "conf23_python"
engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_password}@{db_host}/{db_database}")
engine

Engine(postgresql+psycopg2://posit:***@database.conf23workflows.training.posit.co/conf23_python)

## Data set (1): Business License Data

<https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>

**Step 1:** Gew the raw data from the data portal

In [4]:
base_url = "https://data.cityofchicago.org/resource/r5kz-chrr.csv"
params = {"$order": "id", "$limit": max_rows_from_data_portal}
url = f"{base_url}?{urlencode(params)}"
url

'https://data.cityofchicago.org/resource/r5kz-chrr.csv?%24order=id&%24limit=5000000'

In [5]:
# Read everything as a string so that the formatting is preserved. We will use 
# pandera later to convert everything to the correct type.
business_license_data = pd.read_csv(url, dtype=str)
business_license_data

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,60645,50,28,50-28,24,1011,Home Repair,,,1000000,ISSUE,2000-06-19T00:00:00.000,2002-02-15T00:00:00.000,2002-02-15T00:00:00.000,N,2002-02-21T00:00:00.000,2002-11-15T00:00:00.000,2002-02-21T00:00:00.000,2002-02-22T00:00:00.000,AAI,,,41.998514371,-87.680010905,"\n, \n(41.99851437112669, -87.68001090539342)"
1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2001-06-25T00:00:00.000,2001-08-20T00:00:00.000,N,2001-08-16T00:00:00.000,2002-08-15T00:00:00.000,2001-08-20T00:00:00.000,2002-04-30T00:00:00.000,AAI,,,41.931960333,-87.722150366,"\n, \n(41.931960332638006, -87.72215036594574)"
2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2002-03-27T00:00:00.000,2002-04-17T00:00:00.000,N,2002-05-16T00:00:00.000,2003-05-15T00:00:00.000,2002-04-17T00:00:00.000,2002-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131,"\n, \n(41.88426142200001, -87.6495341312589)"
3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2002-06-28T00:00:00.000,2002-08-13T00:00:00.000,N,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-13T00:00:00.000,2002-08-14T00:00:00.000,AAI,,,41.931960333,-87.722150366,"\n, \n(41.931960332638006, -87.72215036594574)"
4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2003-03-25T00:00:00.000,2003-04-17T00:00:00.000,N,2003-05-16T00:00:00.000,2004-05-15T00:00:00.000,2003-04-17T00:00:00.000,2003-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131,"\n, \n(41.88426142200001, -87.6495341312589)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1108304,9999-20140916,2343163,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2014-07-15T00:00:00.000,2014-12-26T00:00:00.000,N,2014-09-16T00:00:00.000,2016-09-15T00:00:00.000,2014-12-26T00:00:00.000,2014-12-29T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1108305,9999-20160916,2478055,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2016-07-15T00:00:00.000,2016-09-08T00:00:00.000,N,2016-09-16T00:00:00.000,2018-09-15T00:00:00.000,2016-09-08T00:00:00.000,2016-09-09T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1108306,9999-20180916,2610578,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2018-07-15T00:00:00.000,2018-09-10T00:00:00.000,N,2018-09-16T00:00:00.000,2020-09-15T00:00:00.000,2018-09-10T00:00:00.000,2018-09-11T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1108307,9999-20200916,2739432,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2020-07-15T00:00:00.000,2020-08-05T00:00:00.000,N,2020-09-16T00:00:00.000,2022-09-15T00:00:00.000,2020-08-05T00:00:00.000,2020-08-06T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"


**Step 2:** Save data to Postgres

In [8]:
# Insert the data into postgres. Inserting large amounts of data can be slow, so
# iterate over 10,000 rows at a time.

n_rows = business_license_data.shape[0]
step_size = 10_000

for i in range(0, n_rows, step_size):
    index_start = i
    index_end = min(n_rows, i + step_size - 1)
    
    if i == 0:
        if_exists = "replace"
    else:
        if_exists = "append"

    print(f"Inserting rows: {index_start:,} - {index_end:,}")
    
    business_license_data \
        .loc[index_start:index_end, :] \
        .to_sql("business_license_raw", engine, if_exists=if_exists, index=False)

Inserting rows: 0 - 9,999
Inserting rows: 10,000 - 19,999
Inserting rows: 20,000 - 29,999
Inserting rows: 30,000 - 39,999
Inserting rows: 40,000 - 49,999
Inserting rows: 50,000 - 59,999
Inserting rows: 60,000 - 69,999
Inserting rows: 70,000 - 79,999
Inserting rows: 80,000 - 89,999
Inserting rows: 90,000 - 99,999
Inserting rows: 100,000 - 109,999
Inserting rows: 110,000 - 119,999
Inserting rows: 120,000 - 129,999
Inserting rows: 130,000 - 139,999
Inserting rows: 140,000 - 149,999
Inserting rows: 150,000 - 159,999
Inserting rows: 160,000 - 169,999
Inserting rows: 170,000 - 179,999
Inserting rows: 180,000 - 189,999
Inserting rows: 190,000 - 199,999
Inserting rows: 200,000 - 209,999
Inserting rows: 210,000 - 219,999
Inserting rows: 220,000 - 229,999
Inserting rows: 230,000 - 239,999
Inserting rows: 240,000 - 249,999
Inserting rows: 250,000 - 259,999
Inserting rows: 260,000 - 269,999
Inserting rows: 270,000 - 279,999
Inserting rows: 280,000 - 289,999
Inserting rows: 290,000 - 299,999
Insert

In [20]:
# Confirm number of rows
with engine.begin() as conn:
    query = text("SELECT COUNT(*) FROM business_license_raw")
    _ = pd.read_sql_query(query, conn)

print(_)

     count
0  1108309


## Data set (2): Food inspections

<https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

**Step 1:** Gew the raw data from the data portal

In [22]:
base_url = "https://data.cityofchicago.org/resource/4ijn-s7e5.csv"
params = {"$order": "inspection_date", "$limit": max_rows_from_data_portal}
url = f"{base_url}?{urlencode(params)}"
url

'https://data.cityofchicago.org/resource/4ijn-s7e5.csv?%24order=inspection_date&%24limit=5000000'

In [23]:
# Read everything as a string so that the formatting is preserved. We will use 
# pandera later to convert everything to the correct type.
food_inspection_data = pd.read_csv(url, dtype=str)
food_inspection_data

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,address,city,state,zip,inspection_date,inspection_type,results,violations,latitude,longitude,location
0,104236,TEMPO CAFE,TEMPO CAFE,80916,Restaurant,Risk 1 (High),6 E CHESTNUT ST,CHICAGO,IL,60611,2010-01-04T00:00:00.000,Canvass,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,41.89843137207629,-87.6280091630558,"(41.89843137207629, -87.6280091630558)"
1,67733,WOLCOTT'S,TROQUET,1992040,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
2,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,Restaurant,Risk 1 (High),8750 W BRYN WAWR AVE,CHICAGO,IL,60631,2010-01-04T00:00:00.000,License,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,,,
3,52234,Cafe 608,Cafe 608,2013328,Restaurant,Risk 1 (High),608 W BARRY AVE,CHICAGO,IL,60657,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.938006880423615,-87.6447545707008,"(41.938006880423615, -87.6447545707008)"
4,67757,DUNKIN DONUTS/BASKIN-ROBBINS,DUNKIN DONUTS/BASKIN-ROBBINS,1380279,Restaurant,Risk 2 (Medium),100 W RANDOLPH ST,CHICAGO,IL,60601,2010-01-04T00:00:00.000,Tag Removal,Pass,,41.88458626715456,-87.63101044588599,"(41.88458626715456, -87.63101044588599)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
257753,2579666,WENDYS OLD FASHIONED HAMBURGERS,WENDYS OLD FASHIONED HAMBURGERS,2308240,Restaurant,Risk 1 (High),758 W 117TH ST,CHICAGO,IL,60628,2023-08-07T00:00:00.000,Complaint Re-Inspection,Pass,40. PERSONAL CLEANLINESS - Comments: FOUND FOO...,41.68194036015946,-87.64195936290173,"(41.68194036015946, -87.64195936290173)"
257754,2579698,ONE MORE SUSHI EXPRESS,ONE MORE SUSHI EXPRESS,2918465,,Risk 1 (High),1519 W TAYLOR ST,CHICAGO,IL,60607,2023-08-07T00:00:00.000,License,Not Ready,,41.86917813329073,-87.66490728096332,"(41.86917813329073, -87.66490728096332)"
257755,2579644,7-ELEVEN #33731J,7-ELEVEN #33731J,2796489,Grocery Store,Risk 2 (Medium),954 W MONROE ST,CHICAGO,IL,60607,2023-08-07T00:00:00.000,Canvass,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...,41.880522170643594,-87.65185446297886,"(41.880522170643594, -87.65185446297886)"
257756,2579659,"BALMORAL HOME, INC.",BALMORAL HOME,2204252,Long Term Care,Risk 1 (High),2055 W BALMORAL AVE,CHICAGO,IL,60625,2023-08-07T00:00:00.000,Canvass,Fail,52. SEWAGE & WASTE WATER PROPERLY DISPOSED - C...,41.97952632945265,-87.68168537626536,"(41.97952632945265, -87.68168537626536)"


**Step 2:** Save to data to Postgres

In [24]:
# Insert the data into postgres. Inserting large amounts of data can be slow, so
# iterate over 10,000 rows at a time.

n_rows = food_inspection_data.shape[0]
step_size = 10_000

for i in range(0, n_rows, step_size):
    index_start = i
    index_end = min(n_rows, i + step_size - 1)
    
    if i == 0:
        if_exists = "replace"
    else:
        if_exists = "append"

    print(f"Inserting rows: {index_start:,} - {index_end:,}")

    food_inspection_data \
        .loc[index_start:index_end, :] \
        .to_sql("food_inspection_raw", engine, if_exists=if_exists, index=False)

Inserting rows: 0 - 9,999
Inserting rows: 10,000 - 19,999
Inserting rows: 20,000 - 29,999
Inserting rows: 30,000 - 39,999
Inserting rows: 40,000 - 49,999
Inserting rows: 50,000 - 59,999
Inserting rows: 60,000 - 69,999
Inserting rows: 70,000 - 79,999
Inserting rows: 80,000 - 89,999
Inserting rows: 90,000 - 99,999
Inserting rows: 100,000 - 109,999
Inserting rows: 110,000 - 119,999
Inserting rows: 120,000 - 129,999
Inserting rows: 130,000 - 139,999
Inserting rows: 140,000 - 149,999
Inserting rows: 150,000 - 159,999
Inserting rows: 160,000 - 169,999
Inserting rows: 170,000 - 179,999
Inserting rows: 180,000 - 189,999
Inserting rows: 190,000 - 199,999
Inserting rows: 200,000 - 209,999
Inserting rows: 210,000 - 219,999
Inserting rows: 220,000 - 229,999
Inserting rows: 230,000 - 239,999
Inserting rows: 240,000 - 249,999
Inserting rows: 250,000 - 257,758


In [25]:
# Confirm number of rows
with engine.begin() as conn:
    query = text("SELECT COUNT(*) FROM food_inspection_raw")
    _ = pd.read_sql_query(query, conn)

print(_)

    count
0  257758
