# Data Exploration and Validation

In this exercise we will cover how to use Ibis, Pandas, and Pandera to explore, tidy, and validate the data.

### Task 1 - load data from SQL

#### 🔄 Task

- Use `ibis` to load the data from SQL into a pandas dataframe.
- 🚨 Only load the first 50,000 rows. This will speed our our ETL and testing.

#### 🧑‍💻 Code

In the first exercise we used SQLAlchemy to interact with SQL. Ibis is another Python package for interacting with SQL databases. Ibis is specially designed for analytics workloads.

In [None]:
import os

import ibis

# Set up ibis for reading data
con = ibis.postgres.connect(
    user="posit",
    password=os.environ["CONF23_DB_PASSWORD"],
    host=os.environ["CONF23_DB_HOST"],
    port=5432,
    database="conf23_python"
)

con

In [None]:
row_limit = 50_000

Load the business license data.

In [None]:
business_license_raw = con.table(name="business_license_raw") \
    .limit(row_limit) \
    .to_pandas()
    
business_license_raw

Load the food inspection data.

In [None]:
food_inspection_raw = con.table(name="food_inspection_raw") \
    .limit(row_limit) \
    .to_pandas()
    
food_inspection_raw

💡 See the `ibis` docs for all of the different methods you can use to modify your SQL query: <https://ibis-project.org/tutorials/ibis-for-pandas-users>

In [None]:
# For example, only get rows `JUICE BAR` and sort by `inspection_date`
table = con.table(name="food_inspection_raw")
table = table.filter([table.dba_name.upper() == "JUICE BAR"])
table = table.order_by(["inspection_date"])
table.to_pandas()


### Task 2 - Explore the data

#### 🔄 Task

Begin exploring the data. You will want to understand.

- What columns exist in the data?
- How do the two data sets relate to one another?
- What is the type of each column (e.g. string, number, category, date)?
- Which columns could be useful for the model.
- What is the cardinality of categorical data?
- Is all of the data in scope?
- What steps will I need to perform to clean the data?

🚨 We are not performing feature engineering at this stage. But it is a good time to start thinking about what features you can create from the data.

#### 🧑‍💻 Code

In [None]:
import pandas as pd

##### Business license data

Distribution of business locations:

In [None]:
# your code here

Most common license types:

In [None]:
# your code here

Do businesses have multiple licenses?

In [None]:
# your code here

Does each license only one row in the table?

In [None]:
# your code here

Does all the data relate to Chicago?

In [None]:
# your code here

In [None]:
# your code here

##### Food inspection data 

What are the different risk levels?

In [None]:
# your code here

What are the most common violations?

In [None]:
# your code here

What are the most common outcomes?

In [None]:
# your code here

What are the most common facility types?

In [None]:
# your code here

### Activity 3 - Tidy Data

#### 🔄 Task

Now that you have a basic understanding of the data, the next step is to tidy the data.

**Tip**

Use multiple cursors in VS Code to easily edit many lines at the same time (<https://code.visualstudio.com/docs/getstarted/tips-and-tricks#_column-box-selection>).

#### 🧑‍💻 Code

See solution 2 notebook for examples.

**Business license data**

In [None]:
business_license_tidy = business_license_raw.copy()
business_license_tidy

Filter the tidy data to only keep the state of `IL`.

In [None]:
# your code here

Filter the tidy data to only keep the city of `CHICAGO`.

In [None]:
# your code here

Convert the `conditional_approval` column from a `str` to a `bool` value.

In [None]:
business_license_tidy["conditional_approval"].value_counts()

In [None]:
# your code here

Drop the "location" column, the same data is already stored in the `latitude` and `longitude` columns.

In [None]:
# your code here

In [None]:
# Reset the index
business_license_tidy = business_license_tidy.reset_index(drop=True)
business_license_tidy

### Activity 4 - Validate Data (Quick Start)

#### 🔄 Task

In the previous activity we tidied the dataset. For some projects, this may be enough. However, for this project we plan to refresh the data on a regular basis. We would like to gain additional comfort that the data we are using is correct. Data validation can help prove that our data tidying was correct, and find any potential issues if the upstream data changes.

[Pandera](https://pandera.readthedocs.io/en/stable/) is a Python library for validating Pandas dataframes. There are two steps:

1. Define a schema for your data. For example:
   - Define the type for each column
   - Confirm if null values are allowed
   - Define custom checks
2. Run your data through the schema validator.

You will find these links useful when defining your schema:

- List of built in checks: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html#pandera.api.checks.Check
- List of schema level options: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.container.DataFrameSchema.html

Take a few minutes and play around with the example below:

- Can you run the code as is?
- Try channging some of the values in the `DataFrame` so that the schema validation fails.
- Try updating the schema so that it passes again.

#### 🧑‍💻 Code

In [None]:
import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 11, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -5.2],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

df


In [None]:
# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(11)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

schema

In [None]:
validated_df = schema(df)
validated_df

### Activity 5 - Validate Data (Real Data)

#### 🔄 Task

Now that you understand how Pandera works, lets validate our tidy data! In your notebook where you tidy the data, create Schema to validate both data sets. There are a lot of columns in this data, so we will focus on validating just a few interesting and key rows.

Tips:

- Most of the columns have null values.
- Use the `nullable` keyword to allow for missing values.
- Use the `coerce` keyword option to automatically convert columns to the correct type.

#### 🧑‍💻 Code

See the solution 2 notebook for examples.

In [None]:
business_license_tidy[["city", "zip_code", "license_start_date", "latitude"]].sample(5)

In [None]:
# Write a validation that validates the following columns:
#
# - `city` is always equal to "CHICAGO"
# - `latitude` is between 38 and 44 (tip, use between)
# - `license_start_date` is a date type (tip, use coerse and pa.DateTime)
# - `zip_code` is in a valid format (tip, use a lambda)

# These links will help you find the built in checks:
#
# - List of built in checks: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html#pandera.api.checks.Check
# - List of schema level options: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.container.DataFrameSchema.html

business_license_schema = pa.DataFrameSchema(
    strict=False,
    columns={
        "city": pa.Column(),
        "zip_code": pa.Column(),
        "license_start_date": pa.Column(),
        "latitude": pa.Column(),
    }
)

business_license_validated = business_license_schema.validate(business_license_tidy)
business_license_validated

### Activity 6 - Publish the Solution notebook

#### 🔄 Task

- Publish the solution notebook to Posit Connect.
    - The solution notebook will read, tidy, and validate both data sets.
- Share the notebook with the rest of the workshop.
- Schedule the notebook to run once every week.

#### 🧑‍💻 Code

```bash
# Navigate to the correct directory
cd ~/ds-workflows-python/materials/solutions/02-etl-data-validation/

# Create a virtual environment
# Use our alias!
py-venv
python -m pip install -r requirements.txt

# Deploy the notebook
rsconnect deploy notebook --title "02 - Chicago Food Inspections - Data Validation" notebook.ipynb
```