# Data Exploration and Validation

In this exercise we will cover how to use Ibis, Pandas, and Pandera to explore, tidy, and validate the data.

### Task 1 - load data from SQL

#### 🔄 Task

- Use `ibis` to load the data from SQL into a pandas dataframe.

🚨 Only load the first 10,000 rows. This will speed our our ETL and testing.

#### 🧑‍💻 Code

In the first exercise we used SQLAlchemy to interact with SQL. Ibis is another Python package for interacting with SQL databases. Ibis is specially designed for analytics workloads.

In [None]:
import os

import ibis

# Set up ibis for reading data
con = ibis.postgres.connect(
    user="posit",
    password=os.environ["CONF23_DB_PASSWORD"],
    host=os.environ["CONF23_DB_HOST"],
    port=5432,
    database="conf23_python"
)

Load the business license data.

In [None]:
business_license_raw = con.table(name="business_license_raw").limit(10_000).to_pandas()
business_license_raw

Load the food inspection data.

In [None]:
food_inspection_raw = con.table(name="food_inspection_raw").limit(10_000).to_pandas()
food_inspection_raw

### Task 2 - Explore the data

#### 🔄 Task

Begin exploring the data. You will want to understand.

- What columns exist in the data?
- How do the two data sets relate to one another?
- What is the type of each column (e.g. string, number, category, date)?
- Which columns could be useful for the model.
- What is the cardinality of categorical data?
- Is all of the data in scope?
- What steps will I need to perform to clean the data?

🚨 We are not performing feature engineering at this stage. But it is a good time to start thinking about what features you can create from the data.

#### 🧑‍💻 Code

In [None]:
import pandas as pd

##### Business license data

Distribution of business locations:

In [None]:
# your code here

Most common license types:

In [None]:
# your code here

Do businesses have multiple licenses?

In [None]:
# your code here

Does each license only one row in the table?

In [None]:
# your code here

Does all the data relate to Chicago?

In [None]:
# your code here

In [None]:
# your code here

##### Food inspection data 

What are the different risk levels?

In [None]:
# your code here

What are the most common violations?

In [None]:
# your code here

What are the most common outcomes?

In [None]:
# your code here

What are the most common facility types?

In [None]:
# your code here

### Activity 3 - Tidy Data

#### 🔄 Task

Now that you have a basic understanding of the data, the next step is to tidy the data. Create a new notebook that:

- Reads in the raw data from the postgres database.
- Tidy's the dataset.

Tips:

- Remove unnecessary rows.
- Remove unnecessary columns.

#### 🧑‍💻 Code

See notebook [example/02-etl-data-validation/notebook.ipynb](../example/02-etl-data-validation/notebook.ipynb) for examples.

**Business license data**

In [None]:
business_license_tidy = business_license_raw.copy()

In [None]:
# Your data cleaning steps here...

In [None]:
business_license_tidy = business_license_tidy.reset_index(drop=True)

**Food inspection data**

In [None]:
food_inspection_tidy = food_inspection_raw.copy()

In [None]:
# For example:
# food_inspection_tidy = food_inspection_tidy.loc[food_inspection_tidy["city"] == "CHICAGO"]

In [None]:
# Your data cleaning steps here...

In [None]:
food_inspection_tidy = food_inspection_tidy.reset_index(drop=True)

### Activity 4 - Validate Data (Quick Start)

#### 🔄 Task

In the previous activity we tidied the dataset. For some projects, this may be enough. However, for this project we plan to refresh the data on a regular basis. We would like to gain additional comfort that the data we are using is correct. Data validation can help prove that our data tidying was correct, and find any potential issues if the upstream data changes.

[Pandera](https://pandera.readthedocs.io/en/stable/) is a Python library for validating Pandas dataframes. There are two steps:

1. Define a schema for your data:
   - Define the type for each column
   - Confirm if null values are allowed
   - Define custom checks
2. Run your data through the schema validator.

Take 5 minutes, and work through quick start section of the Pandera Docs: https://pandera.readthedocs.io/en/stable/index.html#quick-start.

#### 🧑‍💻 Code

In [None]:
import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)
print(validated_df)

### Activity 5 - Validate Data (Real Data)

#### 🔄 Task

Now that you understand how Pandera works, lets validate our tidy data! In your notebook where you tidy the data, create Schema to validate both data sets.

Tips:

- Most of the columns have null values.
- Use the `coerce` keyword option to automatically convert columns to the correct type.
- For categorical data, confirm that only the expected categories exist.
- Think about custom checks that you can add to validate the data.

Once your data is validated, write the validated data back to the SQL database.

🚨 Please prefix any tables you create with your name! For example:

- `sam_business_license_validated`
- `sam_food_inspections_validated`

#### 🧑‍💻 Code

See notebook [example/02-etl-data-validation/notebook.ipynb](../example/02-etl-data-validation/notebook.ipynb) for examples.

In [None]:
# Your code here

### Activity 6 - Publish the Solution notebook

#### 🔄 Task

Publish your Jupyter Notebook to Connect and schedule it to re-run every Sunday at 3:00 AM. This time, make you are using a `requirements.txt` and a virtual environment.

#### 🧑‍💻 Code

```bash
# Navigate to the correct directory
cd ~/ds-workflows-python/materials/solutions/02-etl-data-validation/

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate
python -m pip install upgrade pip wheel setuptools
python -m pip install -r requirements.txt

# Deploy the notebook
rsconnect deploy notebook --title "02 - Data Validation" notebook.ipynb
```