# Raw Data Ingestion

This workshop will use data from the City of Chicago Open Data Portal: <https://data.cityofchicago.org>. The following datasets will be used:

1. Business license data: <https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>
2. Food inspections: <https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

## Setup

In [1]:
from urllib.parse import urlencode
import pins
import pandas as pd

In [2]:
pd.options.display.max_columns = 999

In [3]:
# Set up the board
board = pins.board_connect()
user_name = "sam.edwardes"

## Data set (1): Business License Data

<https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>

**Step 1:** Gew the raw data from the data portal

In [4]:
base_url = "https://data.cityofchicago.org/resource/r5kz-chrr.csv"

params = {
    "$order": "id", 
    "$limit": 5_000_000
}

url = f"{base_url}?{urlencode(params)}"
print(url)


https://data.cityofchicago.org/resource/r5kz-chrr.csv?%24order=id&%24limit=5000000


In [5]:
# Read everything as a string so that the formatting is preserved. We will use 
# pandera later to convert everything to the correct type.
business_license_data = pd.read_csv(url, dtype=str)
business_license_data

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,ward,precinct,ward_precinct,police_district,license_code,license_description,business_activity_id,business_activity,license_number,application_type,application_created_date,application_requirements_complete,payment_date,conditional_approval,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,60645,50,28,50-28,24,1011,Home Repair,,,1000000,ISSUE,2000-06-19T00:00:00.000,2002-02-15T00:00:00.000,2002-02-15T00:00:00.000,N,2002-02-21T00:00:00.000,2002-11-15T00:00:00.000,2002-02-21T00:00:00.000,2002-02-22T00:00:00.000,AAI,,,41.998514371,-87.680010905,"\n, \n(41.99851437112669, -87.68001090539342)"
1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2001-06-25T00:00:00.000,2001-08-20T00:00:00.000,N,2001-08-16T00:00:00.000,2002-08-15T00:00:00.000,2001-08-20T00:00:00.000,2002-04-30T00:00:00.000,AAI,,,41.931960333,-87.722150366,"\n, \n(41.931960332638006, -87.72215036594574)"
2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2002-03-27T00:00:00.000,2002-04-17T00:00:00.000,N,2002-05-16T00:00:00.000,2003-05-15T00:00:00.000,2002-04-17T00:00:00.000,2002-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131,"\n, \n(41.88426142200001, -87.6495341312589)"
3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,31,999,31-999,25,1010,Limited Business License,,,1000049,RENEW,,2002-06-28T00:00:00.000,2002-08-13T00:00:00.000,N,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-13T00:00:00.000,2002-08-14T00:00:00.000,AAI,,,41.931960333,-87.722150366,"\n, \n(41.931960332638006, -87.72215036594574)"
4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,27,1,27-1,12,1006,Retail Food Establishment,775,Retail Sales of Perishable Foods,1000049,RENEW,,2003-03-25T00:00:00.000,2003-04-17T00:00:00.000,N,2003-05-16T00:00:00.000,2004-05-15T00:00:00.000,2003-04-17T00:00:00.000,2003-04-18T00:00:00.000,AAI,,,41.884261422,-87.649534131,"\n, \n(41.88426142200001, -87.6495341312589)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1104097,9999-20140916,2343163,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2014-07-15T00:00:00.000,2014-12-26T00:00:00.000,N,2014-09-16T00:00:00.000,2016-09-15T00:00:00.000,2014-12-26T00:00:00.000,2014-12-29T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1104098,9999-20160916,2478055,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2016-07-15T00:00:00.000,2016-09-08T00:00:00.000,N,2016-09-16T00:00:00.000,2018-09-15T00:00:00.000,2016-09-08T00:00:00.000,2016-09-09T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1104099,9999-20180916,2610578,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2018-07-15T00:00:00.000,2018-09-10T00:00:00.000,N,2018-09-16T00:00:00.000,2020-09-15T00:00:00.000,2018-09-10T00:00:00.000,2018-09-11T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"
1104100,9999-20200916,2739432,26256,1,CHURCH & CHAPEL METAL ARTS INC,CHURCH & CHAPEL METAL ARTS INC,2616 W GRAND AVE 1ST,CHICAGO,IL,60612,36,17,36-17,12,1010,Limited Business License,,,9999,RENEW,,2020-07-15T00:00:00.000,2020-08-05T00:00:00.000,N,2020-09-16T00:00:00.000,2022-09-15T00:00:00.000,2020-08-05T00:00:00.000,2020-08-06T00:00:00.000,AAI,,,41.892720807,-87.692331754,"\n, \n(41.89272080716665, -87.69233175444906)"


**Step 2:** Save as a pin to Connect

In [6]:
# Pin the data to Connect
pin_name = f"{user_name}/chicago-business-license-data-raw"

board.pin_write(
    business_license_data, 
    name=pin_name, 
    # Use arrow so that types are preserved
    type="arrow", 
    versioned=True,
    title="City of Chicago - Business License Data (RAW)"
)


Writing pin:
Name: 'sam.edwardes/chicago-business-license-data-raw'
Version: 20230626T113541Z-8311e


Meta(title='City of Chicago - Business License Data (RAW)', description=None, created='20230626T113541Z', pin_hash='8311e270b4fa355a', file='chicago-business-license-data-raw.arrow', file_size=211331426, type='arrow', api_version=1, version=VersionRaw(version='76395'), tags=None, name='sam.edwardes/chicago-business-license-data-raw', user={}, local={})

In [7]:
board.pin_versions(pin_name)

Unnamed: 0,version
0,75833
1,75868
2,75871
3,76320
4,76392
5,76394
6,76395


## Data set (2): Food inspections

<https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

**Step 1:** Gew the raw data from the data portal

In [8]:
base_url = "https://data.cityofchicago.org/resource/4ijn-s7e5.csv"

params = {
    "$order": "inspection_date", 
    "$limit": 5_000_000
}

url = f"{base_url}?{urlencode(params)}"
print(url)

https://data.cityofchicago.org/resource/4ijn-s7e5.csv?%24order=inspection_date&%24limit=5000000


In [9]:
# Read everything as a string so that the formatting is preserved. We will use 
# pandera later to convert everything to the correct type.
food_inspection_data = pd.read_csv(url, dtype=str)
food_inspection_data

Unnamed: 0,inspection_id,dba_name,aka_name,license_,facility_type,risk,address,city,state,zip,inspection_date,inspection_type,results,violations,latitude,longitude,location
0,70269,mr.daniel's,mr.daniel's,1899292,Restaurant,Risk 1 (High),5645 W BELMONT AVE,CHICAGO,IL,60634,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.93844282365204,-87.76831838068422,"(41.93844282365204, -87.76831838068422)"
1,52234,Cafe 608,Cafe 608,2013328,Restaurant,Risk 1 (High),608 W BARRY AVE,CHICAGO,IL,60657,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.938006880423615,-87.6447545707008,"(41.938006880423615, -87.6447545707008)"
2,67733,WOLCOTT'S,TROQUET,1992040,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
3,67738,MICHAEL'S ON MAIN CAFE,MICHAEL'S ON MAIN CAFE,2008948,Restaurant,Risk 1 (High),8750 W BRYN WAWR AVE,CHICAGO,IL,60631,2010-01-04T00:00:00.000,License,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,,,
4,67732,WOLCOTT'S,TROQUET,1992039,Restaurant,Risk 1 (High),1834 W MONTROSE AVE,CHICAGO,IL,60613,2010-01-04T00:00:00.000,License Re-Inspection,Pass,,41.961605669949854,-87.67596676683779,"(41.961605669949854, -87.67596676683779)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255658,2577663,"WEST TOWN DAYCARE ,LLC","WEST TOWN DAYCARE, LLC",2570038,Children's Services Facility,Risk 1 (High),2751 W CORTEZ ST,CHICAGO,IL,60622,2023-06-22T00:00:00.000,License,Pass,,41.9000918605704,-87.69640926723595,"(41.9000918605704, -87.69640926723595)"
255659,2577673,MONTESSORI GIFTED PREP LLC.,MONTESSORI GIFTED PREP LLC.,2405576,Children's Services Facility,Risk 1 (High),4754 N LEAVITT ST,CHICAGO,IL,60625,2023-06-22T00:00:00.000,License,Pass,49. NON-FOOD/FOOD CONTACT SURFACES CLEAN - Com...,41.96850906331291,-87.6841827473811,"(41.96850906331291, -87.6841827473811)"
255660,2577660,CHARMING CHILDREN LEARNING ACADEMY,CHARMING CHILDREN LEARNING ACADEMY,2641758,Children's Services Facility,Risk 1 (High),3337-3341 W Chicago AVE,CHICAGO,IL,60651,2023-06-22T00:00:00.000,Canvass Re-Inspection,Pass,10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI...,41.89540586598236,-87.71043217599598,"(41.89540586598236, -87.71043217599598)"
255661,2577674,BIRRIERIA DON LUIS,BIRRIERIA DON LUIS,2796827,Restaurant,Risk 1 (High),3544 E 106TH ST,CHICAGO,IL,60617,2023-06-22T00:00:00.000,Canvass Re-Inspection,Pass,39. CONTAMINATION PREVENTED DURING FOOD PREPAR...,41.70285190603626,-87.537139292445,"(41.70285190603626, -87.537139292445)"


**Step 2:** Save as a pin to Connect

In [10]:
pin_name = f"{user_name}/chicago-food-inspection-data-raw"

board.pin_write(
    food_inspection_data, 
    name=pin_name, 
    # Use arrow so that types are preserved
    type="arrow", 
    versioned=True,
    title="City of Chicago - Food Inspection Data (RAW)"
)

Writing pin:
Name: 'sam.edwardes/chicago-food-inspection-data-raw'
Version: 20230626T113641Z-542ac


Meta(title='City of Chicago - Food Inspection Data (RAW)', description=None, created='20230626T113641Z', pin_hash='542ac8c1d4d82ca2', file='chicago-food-inspection-data-raw.arrow', file_size=104791362, type='arrow', api_version=1, version=VersionRaw(version='76396'), tags=None, name='sam.edwardes/chicago-food-inspection-data-raw', user={}, local={})

In [11]:
board.pin_versions(pin_name)

Unnamed: 0,version
0,75834
1,75872
2,76321
3,76393
4,76396
