# Reading data

In this exercise we will cover how to use pandas to read data from external data sources. To perform our analysis, we will need to use two different data sets:

1. Business licenses data: <https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr>
2. Food inspections data: <https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5>

Both data sets are hosted on <https://data.cityofchicago.org>.

---

### Task 1 - read data

#### 🔄 Task

- Download the **business license** data
- Convert the data into a pandas dataframe

#### 🧑‍💻 Code

The City of Chicago data portal uses makes data available over an API. The API has lots of features, you can read more about how to use it here: <https://dev.socrata.com/foundry/data.cityofchicago.org/r5kz-chrr>.


To download the data, many persons first instict is to download via:

- Clicking through your web browser.
- Via the curl command in the terminal.

```bash
curl 'https://data.cityofchicago.org/resource/r5kz-chrr?$limit=10'
```

There is a better way though! Using base Python and pandas we can construct a URL, and convert it directly into a dataframe. Pandas has a built in method to read CSV data directly from a URL. So our first task will be to construct the URL in Python.

In [None]:
from urllib.parse import urlencode

base_url = "https://data.cityofchicago.org"

# Note the .csv extension
path = "resource/r5kz-chrr.csv"

# To make our code easier to read we can define the parameters in a dict. To know
# what parameters are available you must consult the docs: https://dev.socrata.com/docs/queries/
params = {
    "$order": "id", 
    "$limit": 500
}

# Then use an f-string to construct the URL. You can use the built in urlencode
# function to correctly format the params.
url = f"{base_url}/{path}?{urlencode(params)}"
url

You can then pass in the newly constructed URL directly to Pandas.

In [None]:
import pandas as pd

business_license_raw = pd.read_csv(url)
business_license_raw

---

### Task 2 - write data to SQL

#### 🔄 Task

- Save `business_license_raw` to the Postgres SQL database.
- This way, we do not need to hit the API every time we need to interact with the raw data.
- 🚨 Please prefix any tables you create with your name! For example: `sam_edwardes_business_license_raw`

#### 🧑‍💻 Code

There are many different ways to interact with SQL databases in Python. For writing data, we prefer to use [SQLAlchemy](https://docs.sqlalchemy.org/en/14/dialects/postgresql.html) with Pandas. You will need to make sure you have the following packages installed.

You will first need to create a connection to the database.

In [None]:
import os

from sqlalchemy import create_engine, text

db_user = "posit"
db_password = os.environ["CONF23_DB_PASSWORD"]
db_host = os.environ["CONF23_DB_HOST"]
db_port = 5432
db_database = "conf23_python"
engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_password}@{db_host}/{db_database}")
engine

Give the table a unique name, prefixed with your name (e.g. `sam_edwardes_business_license_raw`)

In [None]:
# your code here

Use the build in `DataFrame.to_sql` to write the data to Postgres (<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html>).

In [None]:
# your code here

You can verify that it worked by reading the data from SQL:

In [None]:
with engine.begin() as conn:
    data_from_sql = pd.read_sql(text(f"SELECT * FROM {table_name}"), conn)

data_from_sql

---

### Task 3 - Publish the solution notebook to Connect

#### 🔄 Task

- Publish the solution notebook to Posit Connect.
- Share the notebook with the rest of the workshop.
- Schedule the notebook to run once every week.
- 🚨 We are going to publish the solution notebook, because it reads and writes the data for both tables that we require.

#### 🧑‍💻 Code

- First, lets take a look at the solution. You can find it in this directory: `~/ds-workflows-python/materials/solutions/01-etl-raw-data`.
- Notice that we have a requirements.txt file for a virtual environment. For every content we deploy to Connect, we will create a virtual environment.

Run the following code from the terminal:

```bash
# Deactivate your current virtual environment
deactivate

# Navigate to the correct directory
cd ~/ds-workflows-python/materials/solutions/01-etl-raw-data/

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip wheel setuptools

# Install all of the requirements
python -m pip install -r requirements.txt

# Check that you have the required environment variables set
echo $CONNECT_SERVER
echo $CONNECT_API_KEY

# Publish the notebook
rsconnect deploy notebook --title "01 - Raw Data ETL" notebook.ipynb

**Pro Tip**

Create an alias:

```bash
alias py-venv='python -m venv .venv && source .venv/bin/activate && python -m pip install --upgrade pip wheel setuptools'
```

Instead of typing all of the commands above, you can use this shortcut to create and activate a virtual environment.

```bash
py-venv
```