# Raw Data Ingestion

This workshop will use data from the City of Chicago Open Data Portal: <https://data.cityofchicago.org>. The following datasets will be used:

1. Business license data: <https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>
2. Food inspections: <https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

## Setup

In [1]:
import os
import requests
import re
from urllib.parse import urlencode

import pandas as pd
from sqlalchemy import create_engine, text

In [2]:
# Jupyter / Pandas settings
pd.options.display.max_columns = 999

In [3]:
# Set up postgresql connection
db_user = "posit"
db_password = os.environ["CONF23_DB_PASSWORD"]
db_host = os.environ["CONF23_DB_HOST"]
db_port = 5432
db_database = "conf23_python"
engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_password}@{db_host}/{db_database}")
engine

Engine(postgresql+psycopg2://posit:***@database.conf23workflows.training.posit.co/conf23_python)

Set dyanmic variables. To ensure that we do not have overload the database or the server, only the instructors scripts will run on the full data set.

In [4]:
connect_username = requests.get(
    f"{os.environ['CONNECT_SERVER']}/__api__/v1/user",
    headers={"Authorization": f"Key {os.environ['CONNECT_API_KEY']}"}
).json()["username"]

connect_username

'sam.edwardes'

In [5]:
if connect_username == "sam.edwardes":
    max_rows = 99_999_999
else:
    max_rows = 10_000

max_rows

99999999

## Data set (1): Food inspections

<https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

**Step 1:** Gew the raw data from the data portal

In [6]:
base_url = "https://data.cityofchicago.org/resource/4ijn-s7e5.csv"
params = {"$order": "inspection_date", "$limit": max_rows}
url = f"{base_url}?{urlencode(params)}"
url

'https://data.cityofchicago.org/resource/4ijn-s7e5.csv?%24order=inspection_date&%24limit=99999999'

In [7]:
# Read everything as a string so that the formatting is preserved. We will use 
# pandera later to convert everything to the correct type.
food_inspection_data = pd.read_csv(url, dtype=str)
food_inspection_data

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,address,city,state,zip,inspection_date,inspection_type,results,violations,latitude,longitude,location
0,52234,Cafe 608,Cafe 608,2013328,Restaurant,Risk 1 (High),608 W BARRY AVE,CHICAGO,IL,60657,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.938006880423615,-87.6447545707008,"(41.938006880423615, -87.6447545707008)"
1,70269,mr.daniel's,mr.daniel's,1899292,Restaurant,Risk 1 (High),5645 W BELMONT AVE,CHICAGO,IL,60634,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.93844282365204,-87.76831838068422,"(41.93844282365204, -87.76831838068422)"
2,67733,WOLCOTT'S,TROQUET,1992040,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
3,67732,WOLCOTT'S,TROQUET,1992039,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
4,104236,TEMPO CAFE,TEMPO CAFE,80916,Restaurant,Risk 1 (High),6 E CHESTNUT ST,CHICAGO,IL,60611,2010-01-04T00:00:00.000,Canvass,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,41.89843137207629,-87.6280091630558,"(41.89843137207629, -87.6280091630558)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
258048,2579943,EGGSPERIENCE,EGGSPERIENCE,2583306,Restaurant,Risk 1 (High),3231-3233 N BROADWAY,CHICAGO,IL,60657,2023-08-11T00:00:00.000,Canvass Re-Inspection,Pass,"55. PHYSICAL FACILITIES INSTALLED, MAINTAINED ...",41.941021951821966,-87.64428253705525,"(41.941021951821966, -87.64428253705525)"
258049,2579952,MAXWELL STREET HARRISON,ORIGINAL MAXWELL STREET,2247280,Restaurant,Risk 2 (Medium),3801 W HARRISON ST,CHICAGO,IL,60624,2023-08-11T00:00:00.000,Complaint Re-Inspection,Pass,49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - Com...,41.87340464534467,-87.72039962829741,"(41.87340464534467, -87.72039962829741)"
258050,2579939,LA FIESTA BAKERY,LA FIESTA BAKERY/TAQUERIA,1488177,Restaurant,Risk 1 (High),6424 S PULASKI RD,CHICAGO,IL,60629,2023-08-11T00:00:00.000,Canvass,Pass,37. FOOD PROPERLY LABELED; ORIGINAL CONTAINER ...,41.77607320206961,-87.72284124538348,"(41.77607320206961, -87.72284124538348)"
258051,2579946,GROTA RESTAURANT,GROTA RESTAURANT,6753,Restaurant,Risk 1 (High),3108-3112 N CENTRAL AVE,CHICAGO,IL,60634,2023-08-11T00:00:00.000,Canvass Re-Inspection,Pass,,41.93706710815131,-87.76659233846347,"(41.93706710815131, -87.76659233846347)"


**Step 2:** Save to data to Postgres

Determine a prefix for the table name so that we do not overwrite eachothers data.

In [None]:
# determine the table name
if connect_username == "sam.edwardes":
    table_name_prefix = ""
else:
    table_name_prefix = re.sub('[^0-9a-zA-Z]+', '_', connect_username) + "_"

In [8]:
table_name = f"{table_name_prefix}food_inspection_raw"
table_name

'food_inspection_raw'

In [9]:
# Insert the data into postgres. Inserting large amounts of data can be slow, so
# iterate over 10,000 rows at a time.

n_rows = food_inspection_data.shape[0]
step_size = 10_000

for i in range(0, n_rows, step_size):
    index_start = i
    index_end = min(n_rows, i + step_size - 1)
    
    if i == 0:
        if_exists = "replace"
    else:
        if_exists = "append"

    print(f"Inserting rows: {index_start:,} - {index_end:,}")

    food_inspection_data \
        .loc[index_start:index_end, :] \
        .to_sql(table_name, engine, if_exists=if_exists, index=False)

Inserting rows: 0 - 9,999
Inserting rows: 10,000 - 19,999
Inserting rows: 20,000 - 29,999
Inserting rows: 30,000 - 39,999
Inserting rows: 40,000 - 49,999
Inserting rows: 50,000 - 59,999
Inserting rows: 60,000 - 69,999
Inserting rows: 70,000 - 79,999
Inserting rows: 80,000 - 89,999
Inserting rows: 90,000 - 99,999
Inserting rows: 100,000 - 109,999
Inserting rows: 110,000 - 119,999
Inserting rows: 120,000 - 129,999
Inserting rows: 130,000 - 139,999
Inserting rows: 140,000 - 149,999
Inserting rows: 150,000 - 159,999
Inserting rows: 160,000 - 169,999
Inserting rows: 170,000 - 179,999
Inserting rows: 180,000 - 189,999
Inserting rows: 190,000 - 199,999
Inserting rows: 200,000 - 209,999
Inserting rows: 210,000 - 219,999
Inserting rows: 220,000 - 229,999
Inserting rows: 230,000 - 239,999
Inserting rows: 240,000 - 249,999
Inserting rows: 250,000 - 258,053


In [10]:
# Confirm number of rows
with engine.begin() as conn:
    query = text(f"SELECT COUNT(*) FROM {table_name}")
    data_from_sql = pd.read_sql_query(query, conn)

print(data_from_sql)

    count
0  258053


## Data set (2): Business License Data

<https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>

**Step 1:** Gew the raw data from the data portal

In [11]:
base_url = "https://data.cityofchicago.org/resource/r5kz-chrr.csv"
params = {"$order": "id", "$limit": max_rows}
url = f"{base_url}?{urlencode(params)}"
url

'https://data.cityofchicago.org/resource/r5kz-chrr.csv?%24order=id&%24limit=99999999'

In [12]:
# Read everything as a string so that the formatting is preserved. We will use 
# pandera later to convert everything to the correct type.
business_license_data = pd.read_csv(url, dtype=str)
business_license_data

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,60645,50,28,50-28,24,1011,Home Repair,,,1000000,ISSUE,2000-06-19T00:00:00.000,2002-02-15T00:00:00.000,2002-02-15T00:00:00.000,N,2002-02-21T00:00:00.000,2002-11-15T00:00:00.000,2002-02-21T00:00:00.000,2002-02-22T00:00:00.000,AAI,,,41.998514371,-87.680010905,"\n, \n(41.99851437112669, -87.68001090539342)"
1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2001-06-25T00:00:00.000,2001-08-20T00:00:00.000,N,2001-08-16T00:00:00.000,2002-08-15T00:00:00.000,2001-08-20T00:00:00.000,2002-04-30T00:00:00.000,AAI,,,41.931960333,-87.722150366,"\n, \n(41.931960332638006, -87.72215036594574)"
2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2002-03-27T00:00:00.000,2002-04-17T00:00:00.000,N,2002-05-16T00:00:00.000,2003-05-15T00:00:00.000,2002-04-17T00:00:00.000,2002-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131,"\n, \n(41.88426142200001, -87.6495341312589)"
3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2002-06-28T00:00:00.000,2002-08-13T00:00:00.000,N,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-13T00:00:00.000,2002-08-14T00:00:00.000,AAI,,,41.931960333,-87.722150366,"\n, \n(41.931960332638006, -87.72215036594574)"
4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2003-03-25T00:00:00.000,2003-04-17T00:00:00.000,N,2003-05-16T00:00:00.000,2004-05-15T00:00:00.000,2003-04-17T00:00:00.000,2003-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131,"\n, \n(41.88426142200001, -87.6495341312589)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1108776,9999-20140916,2343163,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2014-07-15T00:00:00.000,2014-12-26T00:00:00.000,N,2014-09-16T00:00:00.000,2016-09-15T00:00:00.000,2014-12-26T00:00:00.000,2014-12-29T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1108777,9999-20160916,2478055,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2016-07-15T00:00:00.000,2016-09-08T00:00:00.000,N,2016-09-16T00:00:00.000,2018-09-15T00:00:00.000,2016-09-08T00:00:00.000,2016-09-09T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1108778,9999-20180916,2610578,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2018-07-15T00:00:00.000,2018-09-10T00:00:00.000,N,2018-09-16T00:00:00.000,2020-09-15T00:00:00.000,2018-09-10T00:00:00.000,2018-09-11T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1108779,9999-20200916,2739432,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2020-07-15T00:00:00.000,2020-08-05T00:00:00.000,N,2020-09-16T00:00:00.000,2022-09-15T00:00:00.000,2020-08-05T00:00:00.000,2020-08-06T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"


In [13]:
# To speed things up, we will only keep the license data for establishments
# that have food inspection data.
business_license_data = pd.merge(
    business_license_data,
    food_inspection_data[["license_"]].drop_duplicates(),
    how="inner",
    left_on="license_id",
    right_on="license_"
).drop(
    columns="license_"
)

business_license_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34928 entries, 0 to 34927
Data columns (total 34 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   id                                 34928 non-null  object
 1   license_id                         34928 non-null  object
 2   account_number                     34928 non-null  object
 3   site_number                        34928 non-null  object
 4   legal_name                         34928 non-null  object
 5   doing_business_as_name             34928 non-null  object
 6   address                            34928 non-null  object
 7   city                               34928 non-null  object
 8   state                              34928 non-null  object
 9   zip_code                           34898 non-null  object
 10  ward                               34791 non-null  object
 11  precinct                           30594 non-null  object
 12  ward

**Step 2:** Save data to Postgres

In [14]:
table_name = f"{table_name_prefix}business_license_raw"
table_name

'business_license_raw'

In [15]:
# Insert the data into postgres. Inserting large amounts of data can be slow, so
# iterate over 10,000 rows at a time.

# TODO: dynamically insert user name into table if not PROD

n_rows = business_license_data.shape[0]
step_size = 10_000

for i in range(0, n_rows, step_size):
    index_start = i
    index_end = min(n_rows, i + step_size - 1)
    
    if i == 0:
        if_exists = "replace"
    else:
        if_exists = "append"

    print(f"Inserting rows: {index_start:,} - {index_end:,}")
    
    business_license_data \
        .loc[index_start:index_end, :] \
        .to_sql(table_name, engine, if_exists=if_exists, index=False)

Inserting rows: 0 - 9,999
Inserting rows: 10,000 - 19,999
Inserting rows: 20,000 - 29,999
Inserting rows: 30,000 - 34,928


In [16]:
# Confirm number of rows
with engine.begin() as conn:
    query = text(f"SELECT COUNT(*) FROM {table_name}")
    data_from_sql = pd.read_sql_query(query, conn)

print(data_from_sql)

   count
0  34928
