# Reading data

In this exercise we will cover how to use pandas to read data from external data sources. To perform our analysis, we will need to use two different data sets:

1. Business licenses data: <https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>
2. Food inspections data: <https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

Both data sets are hosted on <https://data.cityofchicago.org>.

---

### Activity 1 - read data

#### ðŸ”„ Task

- Explore the two data sources.
- Can you load the data into pandas?

#### âœ… Solution

The City of Chicago data portal uses:

> The Socrata Open Data API (SODA) provides programmatic access to this dataset including the ability to filter, query, and aggregate data.

There is a Python package to interact with SODA, but it is no longer maintained: <https://github.com/xmunoz/sodapy>.

Instead of using the Python library, we can call the SODA API directly. Consulting the documentation provides us with some examples how how to use <https://dev.socrata.com/foundry/data.cityofchicago.org/4ijn-s7e5>.

For example, we can use `curl` to request the data in JSON form. This has a few problems though:

- The data is in JSON. We can work with this, but CSV may be more convenient.
- We only get the first 1,000 rows.
- We are using the shell instead of Python




In [20]:
%%sh
curl 'https://data.cityofchicago.org/resource/r5kz-chrr.json'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed


[{"id":"16570-20000216","license_id":"76522","account_number":"51755","site_number":"1","legal_name":"THORNDALE CONSTRUCTION","doing_business_as_name":"THORNDALE CONSTRUCTION","address":"11243  CHESAPEAKE PLAC   1ST","city":"WESTCHESTER","state":"IL","zip_code":"60154","license_code":"1010","license_description":"Limited Business License","license_number":"16570","application_type":"RENEW","application_requirements_complete":"2000-06-16T00:00:00.000","payment_date":"2009-08-21T00:00:00.000","conditional_approval":"N","license_start_date":"2000-02-16T00:00:00.000","expiration_date":"2001-02-15T00:00:00.000","license_approved_for_issuance":"2003-12-22T00:00:00.000","date_issued":"2009-08-24T00:00:00.000","license_status":"AAI"}
,{"id":"25710-19960216","license_id":"119268","account_number":"52896","site_number":"1","legal_name":"PAT HAMILTON, INC","doing_business_as_name":"PAT HAMILTON CO.","address":"17021 S MAGNOLIA DR  1ST","city":"HAZEL CREST","state":"IL","zip_code":"60429","license

100 1153k    0 1153k    0     0   794k      0 --:--:--  0:00:01 --:--:--  796k


We can instead use Python and Pandas to make the request. Pandas has a built in method to read CSV data directly from a URL. So our first task will be to construct the URL in Python.

In [21]:
from urllib.parse import urlencode


base_url = "https://data.cityofchicago.org"

# Note the .csv extension
path = "resource/r5kz-chrr.csv"

# To make our code easier to read we can define the parameters in a dict. To know
# what parameters are available you must consult the docs: https://dev.socrata.com/docs/queries/
params = {
    "$order": "id", 
    "$limit": 5
}

# Then use an f-string to construct the URL. You can use the built in urlencode
# function to correctly format the params.
url = f"{base_url}/{path}?{urlencode(params)}"
url

'https://data.cityofchicago.org/resource/r5kz-chrr.csv?%24order=id&%24limit=5'

Then, install pandas:

```bash
python -m pip install pandas
```

You can then pass in the newly constructed URL directly to Pandas.

In [22]:
import pandas as pd

df = pd.read_csv(url)
df

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,...,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,60645,...,2002-02-21T00:00:00.000,2002-11-15T00:00:00.000,2002-02-21T00:00:00.000,2002-02-22T00:00:00.000,AAI,,,41.998514,-87.680011,"\n, \n(41.99851437112669, -87.68001090539342)"
1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,...,2001-08-16T00:00:00.000,2002-08-15T00:00:00.000,2001-08-20T00:00:00.000,2002-04-30T00:00:00.000,AAI,,,41.93196,-87.72215,"\n, \n(41.931960332638006, -87.72215036594574)"
2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,...,2002-05-16T00:00:00.000,2003-05-15T00:00:00.000,2002-04-17T00:00:00.000,2002-04-18T00:00:00.000,AAI,,,41.884261,-87.649534,"\n, \n(41.88426142200001, -87.6495341312589)"
3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,...,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-13T00:00:00.000,2002-08-14T00:00:00.000,AAI,,,41.93196,-87.72215,"\n, \n(41.931960332638006, -87.72215036594574)"
4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,...,2003-05-16T00:00:00.000,2004-05-15T00:00:00.000,2003-04-17T00:00:00.000,2003-04-18T00:00:00.000,AAI,,,41.884261,-87.649534,"\n, \n(41.88426142200001, -87.6495341312589)"


---

### Activity 2 - write data to SQL

#### ðŸ”„ Task

- Save the raw data to the Postgres SQL database so that we do not need to hit the API every time we need to interact with the raw data.
- You can connect to the database using the following credentials:
  - host: `posit-conf-2023-ds-workflowsf5086c0.cpbvczwgws3n.us-east-2.rds.amazonaws.com`
  - user: `posit`
  - password: ???
  - database: `python_workshop`

ðŸš¨ Please prefix any tables you create with your name! For example:

- `sam_business_license_raw`
- `sam_food_inspections_raw`


#### âœ… Solution

There are many different ways to interact with SQL databases in Python. For writing data, we prefer to use [SQLAlchemy](https://docs.sqlalchemy.org/en/14/dialects/postgresql.html) with Pandas. You will need to make sure you have the following packages installed.

```bash
python -m pip install SQLAlchemy psycopg2-binary keyring
```

You will first need to create a connection to the database.

In [23]:
import keyring
from sqlalchemy import create_engine

db_user = "posit"
db_password = keyring.get_password("conf23_db", "posit")
db_host = "posit-conf-2023-ds-workflowsf5086c0.cpbvczwgws3n.us-east-2.rds.amazonaws.com"
db_port = 5432
db_database = "python_workshop"
engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_password}@{db_host}/{db_database}")

You can then use pandas built in SQL functions to write data to SQL.

In [24]:
df

Unnamed: 0,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,zip_code,...,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,60645,...,2002-02-21T00:00:00.000,2002-11-15T00:00:00.000,2002-02-21T00:00:00.000,2002-02-22T00:00:00.000,AAI,,,41.998514,-87.680011,"\n, \n(41.99851437112669, -87.68001090539342)"
1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,...,2001-08-16T00:00:00.000,2002-08-15T00:00:00.000,2001-08-20T00:00:00.000,2002-04-30T00:00:00.000,AAI,,,41.93196,-87.72215,"\n, \n(41.931960332638006, -87.72215036594574)"
2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,...,2002-05-16T00:00:00.000,2003-05-15T00:00:00.000,2002-04-17T00:00:00.000,2002-04-18T00:00:00.000,AAI,,,41.884261,-87.649534,"\n, \n(41.88426142200001, -87.6495341312589)"
3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,60647,...,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-13T00:00:00.000,2002-08-14T00:00:00.000,AAI,,,41.93196,-87.72215,"\n, \n(41.931960332638006, -87.72215036594574)"
4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,60607,...,2003-05-16T00:00:00.000,2004-05-15T00:00:00.000,2003-04-17T00:00:00.000,2003-04-18T00:00:00.000,AAI,,,41.884261,-87.649534,"\n, \n(41.88426142200001, -87.6495341312589)"


In [26]:
table_name = "samedwardes_business_license_test"
df.to_sql(table_name, engine, if_exists="replace")

5

You can verify that it worked by reading the data from SQL:

In [27]:
pd.read_sql(f"SELECT * FROM {table_name}", engine)

Unnamed: 0,index,id,license_id,account_number,site_number,legal_name,doing_business_as_name,address,city,state,...,license_start_date,expiration_date,license_approved_for_issuance,date_issued,license_status,license_status_change_date,ssa,latitude,longitude,location
0,0,1000000-20020221,1000000,200001,1,MARK BOSTON,COLORS IN MOTION,6421 N DAMEN AVE,CHICAGO,IL,...,2002-02-21T00:00:00.000,2002-11-15T00:00:00.000,2002-02-21T00:00:00.000,2002-02-22T00:00:00.000,AAI,,,41.998514,-87.680011,"\n, \n(41.99851437112669, -87.68001090539342)"
1,1,1000049-20010816,1162772,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,...,2001-08-16T00:00:00.000,2002-08-15T00:00:00.000,2001-08-20T00:00:00.000,2002-04-30T00:00:00.000,AAI,,,41.93196,-87.72215,"\n, \n(41.931960332638006, -87.72215036594574)"
2,2,1000049-20020516,1233615,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,...,2002-05-16T00:00:00.000,2003-05-15T00:00:00.000,2002-04-17T00:00:00.000,2002-04-18T00:00:00.000,AAI,,,41.884261,-87.649534,"\n, \n(41.88426142200001, -87.6495341312589)"
3,3,1000049-20020816,1265665,200068,1,ANTONIA CASTREJON,ILLUSIONS HAIR DESIGN,3800 W DIVERSEY AVE,CHICAGO,IL,...,2002-08-16T00:00:00.000,2003-08-15T00:00:00.000,2002-08-13T00:00:00.000,2002-08-14T00:00:00.000,AAI,,,41.93196,-87.72215,"\n, \n(41.931960332638006, -87.72215036594574)"
4,4,1000049-20030516,1342680,10141,2,"PEPE""S RETAIL MEATS, INC.",PEREZ MEXICAN FOOD,853-855 W RANDOLPH ST 1ST,CHICAGO,IL,...,2003-05-16T00:00:00.000,2004-05-15T00:00:00.000,2003-04-17T00:00:00.000,2003-04-18T00:00:00.000,AAI,,,41.884261,-87.649534,"\n, \n(41.88426142200001, -87.6495341312589)"


### Activity 3 - put it all together

#### ðŸ”„ Task

- Clean up your code from activity 1 to pull both datasets from the City of Chicago into pandas dataframes
- Write both dataframes into the Postgres database.


ðŸš¨ Writing data to postgres can be slow. Do not insert all of the rows in one go. Instead you should write a loop that inserts 10,000 rows at a time.


#### âœ… Solution

See [materials/example/01-etl-raw-data/notebook.ipynb](../example/01-etl-raw-data/notebook.ipynb) for a complete solution.