## Introduction to Data Cleaning with the Relational Model

In this exercise, we apply the relational model for data cleaning using DuckDB. This exercise assumes that you have repaired the following data errors in the example City of Chicago Food Inspection data using a tool such as OpenRefine:
* License numbers are all numeric values
* Inspection dates are all valid dates in a consistent format
* City, State, and Zip have been parsed into separate fields

### Learning Objectives

* Apply strategies for data cleaning using the relational model including implementing a relational schema, integrity constarints, and queries

### Prerequisites

* Install `pandas` and `duckdb` Python libraries
* **Optional**:  Install the [DuckDB command line interface](https://duckdb.org/docs/installation/?version=stable&environment=cli) to use `duckdb` from the terminal.


In [None]:
! pip install pandas duckdb

In [None]:
import duckdb
import pandas as pd

### Inspect the Data

The cleaned data is in `datasets/inspections/food-inspections-dirty.csv`. We can use the DuckDB Python API to query the CSV data directly and return a Pandas dataframe:

In [None]:
duckdb.sql("SELECT * FROM '../datasets/inspections/food-inspections-dirty.csv' limit 5").df()

### Create the Schema

The next step is to create a relational schema that can be used to import the dataset.

* Create a file named `schema.sql` that will be used to create a database.
* The file must define the schema for a table named `INSPECTIONS` that will store the inspections data. 
* The table should have a primary key.

In [None]:
with duckdb.connect("inspections.db") as con:
    with open("schema.sql", "r") as f:
        # Create the DB schema
        con.sql(f.read())
        # Import data
        con.sql("INSERT INTO inspections (SELECT * FROM read_csv('../datasets/inspections/food-inspections-cleaned.csv'))")

**Discussion**
* What are the column names and datatypes?
* What can we use for the primary key?

### Query the data

Define a query that can be used to count the number of rows in the `INSPECTIONS` table:

In [None]:
q1 = """
    SELECT COUNT(*) AS COUNT from INSPECTIONS
"""

In [None]:
#test
with duckdb.connect("inspections.db") as con:
    count = con.sql(q1).fetchone()[0]
    assert(count == 720)
    print("Success")

Inspect the contents of the `Violations` column. Note that the column contains multiple nested records delimited with pipe (`|`). 


According to the first normal form, columns should not contain nested values. So our next step is to normalize the violations data. But first, we need to understand what each record contains.  Use the DuckDB `regexp_split_to_table()` operation to split each violation into a separate row.

In [None]:
q3 = """
   SELECT inspection_id, regexp_split_to_table(Violations, '\\|') 
   FROM INSPECTIONS
   LIMIT 10
"""
with duckdb.connect("inspections.db") as con:
    df = con.sql(q3).df()

df

We can see that the violations have a consistent format:

```
(30). (FOOD IN ORIGINAL CONTAINER, PROPERLY LABELED: CUSTOMER ADVISORY POSTED AS NEEDED) - Comments: (LABEL ALL BULK CONTAINERS IN PREP AREA)
```

The violations refer to a list of standard codes and descriptions found on the [Food Inspection Report Form](https://www.chicago.gov/dam/city/depts/cdph/food_env/general/Food_Protection/Blankinspectionreport.pdf) and consist of:
1. A numeric violation code or identifier
2. A standard description of the violation
3. An optional comment from the inspector at the time of inspection

The next step is to create the new table(s) and use SQL or Python to parse and transform the source data.


**Discussion**
* What should the new table(s) be?
* How should we handle duplicated data such as the violation code and description?

### Transform the Data

Update your `schema.sql` to create additional table(s) to store the violations data:
* `VIOLATIONS`: Table to store unique information about violations. It should have a primary key that serves as a foreign key for the `INSPECTION_VIOLATIONS` table.
* `INSPECTION_VIOLATIONS`: Table to store **unique** information about violations associated with each inspection. It should have a primary key and foreign keys to `INSPECTIONS` and `VIOLATIONS`.

First, we need to write a query that can parse the `Violations` column into its separate parts:
* For each inspection, use `regexp_split_to_table` to split the violations records by the `|` delimiter
* For each violation record, use `regexp_extract` to extract the individual parts (violation code, description, and comment)

In [None]:
q4 = """
    select inspection_id, regexp_extract(
        trim(regexp_split_to_table(inspections.violations, '\\|')),
        '^(\\d+)\\.(.*?)(?:-\\s+Comments?:(.*))?$',
        ['violation_id', 'description', 'comments'], 's'
    ) as violation
    from inspections
"""

with duckdb.connect("inspections.db") as con:
    df = con.sql(q4).df()
df

This rather complex query uses DuckDB's [text functions](https://duckdb.org/docs/sql/functions/char.html) to parse the column:
* `regexp_split_to_table` splits the delimited string into new rows
* `regexp_extract` uses regular expression capture groups to match parse of the violation string

The regular expression consists of:
* `^(\d+)\.`: The first capture group starts at the beginning of the line (`^`) and consists of one or more numeric digits (`\d+`) followed by a literal period (`\.`)
* `(.*?)`: The second capture group consists of everything between the period and an optional `- Comments:` block
* `(?:-\s+Comments?:(.*))?$'`: The third capture group `(?:...)?` is actually an optional `(...)?` non-capturing group `(?:...)`. This says that `- Comments: ` part is optional, since some violations do not contain a comment.
* `(.*)`: The fourth capture group `(.*)` consists of everything between the `- Comments:` and the end of the line `$`

`regexp_extract` returns a JSON structure that can now be referenced in queries:

```
{'violation_id': '', 'description': '', 'comments': ''}
```

Note that the `s` option to `regexp_extract` specifies that matches are not [newline sensitive](https://duckdb.org/docs/sql/functions/regular_expressions.html#options-for-regular-expression-functions).

In [None]:
q5 = """
    select inspection_id, violation['violation_id'] as violation_id, trim(violation['description']) as description
    from
    (
        select inspection_id, regexp_extract(
            trim(regexp_split_to_table(inspections.violations, '\\|')),
            '^(\\d+)\\.(.*?)(?:-\\s+Comments?:(.*))?$',
            ['violation_id', 'description', 'comments'], 's'
        ) as violation
        from inspections
    ) as inspections
"""
with duckdb.connect("inspections.db") as con:
    df = con.sql(q5).df()
df

We can see here that the violations codes and descriptions are indeed duplicated. The next step is to determine whether they are truly unique.

In [None]:
q5 = """
    select distinct violation['violation_id'] as violation_id, trim(violation['description']) as description
    from
    (
        select regexp_extract(
            trim(regexp_split_to_table(inspections.violations, '\\|')),
            '^(\\d+)\\.(.*?)(?:-\\s+Comments?:(.*))?$',
            ['violation_id', 'description', 'comments'], 's'
        ) as violation
        from inspections
    ) as inspections
    order by cast(violation_id as integer);
"""
with duckdb.connect("inspections.db") as con:
    df = con.sql(q5).df()
df

At this point, we're ready to populate the new `VIOLATIONS` table. If the `violation_id` is specified as the primary key, we will see if these are truly unique codes:

In [None]:
q6 = """
    insert into violations
    select distinct violation['violation_id'] as violation_id, trim(violation['description']) as description
    from
    (
        select regexp_extract(
            trim(regexp_split_to_table(inspections.violations, '\\|')),
            '^(\\d+)\\.(.*?)(?:-\\s+Comments?:(.*))?$',
            ['violation_id', 'description', 'comments'], 's'
        ) as violation
        from inspections
    ) as inspections
    order by cast(violation_id as integer);
"""
with duckdb.connect("inspections.db") as con:
    con.sql(q6)

In [None]:
with duckdb.connect("inspections.db") as con:
    df = con.sql("SELECT * FROM VIOLATIONS").df()
df

The next step is to populate the `INSPECTIONS_VIOLATIONS` table. This should be easy now, since we already have the query to parse the violations data.

The table should contain the `INSPECTION_ID` referencing the INSPECTIONS table, `VIOLATION_ID` referencing the `VIOLATIONS` table, and inspection-specific `COMMENT`:

In [None]:
q7 = """
    insert into inspection_violations
    select inspection_id, violation['violation_id'] as violation_id, trim(violation['comments']) as comment
    from
    (
        select DISTINCT inspection_id, regexp_extract(
            trim(regexp_split_to_table(inspections.violations, '\\|')),
            '^(\\d+)\\.(.*?)(?:-\\s+Comments?:\\s+(.*))?$',
            ['violation_id', 'description', 'comments'], 's'
        ) as violation
        from inspections 
    ) 
    order by inspection_id, cast(violation_id as integer);
"""
with duckdb.connect("inspections.db") as con:
    con.sql(q7)

In [None]:
with duckdb.connect("inspections.db") as con:
    df = con.sql("SELECT * FROM INSPECTION_VIOLATIONS").df()
df

### Integrity Constraints

We can use our knowledge of the domain to define integrity constraints that can be used to further identify errors in the data.  In fact, we have already defined constaints.  For example, specifing the `violation_id` as the primary key on the violations table indicates that $ViolationID -> Violation Description$. If this constraint had been violated, our inserts into the table would have failed.

Below we use SQL queries to identify potential errors based on several functional dependencies.

#### Zip -> City, State

Based on our knowledge of US ZIP Codes, we expect that, across all inspection records, the ZIP Code will determine the City and State. We can confirm this with the following query:

In [None]:
c1 = """
    SELECT i1.zip, i1.city, i2.city, i1.state, i2.state 
    FROM inspections i1, inspections i2
    WHERE i1.zip = i2.zip
    AND (i1.city != i2.city or i1.state != i2.state)
"""
with duckdb.connect("inspections.db") as con:
    df = con.sql(c1).df()
df

This query returns rows where the ZIP Code matches between records but the city or state does not.  We see a few instances of case differences, which could be corrected by consistently using upper or lower case.  We also see what appeare to be typos in the data that should be corrected.

#### DBAName -> Zip

We also expect that a given business name should determine its location:

In [None]:
c2 = """
    SELECT i1.DBA_Name, i1.zip, i2.zip 
    FROM inspections i1, inspections i2
    WHERE i1.DBA_Name = i2.DBA_Name
    AND i1.zip != i2.zip
"""
with duckdb.connect("inspections.db") as con:
    df = con.sql(c2).df()

df

We see several potential issues here, but the cause is unclear. Perhaps a restaurant moved between inspection dates? Perhaps food trucks are inspected at different locations? Let's include the inspection date and address:

(Perhaps DBAName, Inspection Date -> Zip?)

In [None]:
c3 = """
    SELECT i1.DBA_Name, i1.inspection_date, i2.inspection_date, i1.address, i1.address, i1.zip, i2.zip 
    FROM inspections i1, inspections i2
    WHERE i1.DBA_Name = i2.DBA_Name
    AND i1.zip != i2.zip
"""
with duckdb.connect("inspections.db") as con:
    df = con.sql(c3).df()

df

Here we can see that the addresses are the same but the ZIP Codes differ, suggesting an error in the data.

## Summary

The relational model is a powerful tool for data cleaning.
...