## Introduction to Data Cleaning with the Relational Model

### Learning Objectives

* Apply strategies for data cleaning using the relational model including implementing a relational schema, integrity constarints, and queries

### Prerequisites

* Optional:  To use the command line interface (CLI), install DuckDB for your operating system from https://duckdb.org/docs/installation/?version=stable&environment=cli
* Install `pandas` and `duckdb` Python libraries

In [1]:
! pip install pandas duckdb

Collecting pandas
  Using cached pandas-2.2.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting duckdb
  Downloading duckdb-1.1.3-cp312-cp312-macosx_12_0_arm64.whl.metadata (762 bytes)
Collecting numpy>=1.26.0 (from pandas)
  Using cached numpy-2.1.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.3-cp312-cp312-macosx_11_0_arm64.whl (11.4 MB)
Downloading duckdb-1.1.3-cp312-cp312-macosx_12_0_arm64.whl (15.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.5/15.5 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached numpy-2.1.3-cp312-cp312-macosx_14_0_arm64.whl (5.1 MB)
Using cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Using cached tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing col

In [2]:
import duckdb
import pandas as pd

### 1. Inspect the Data

Read the [About](https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/about_data) page on the City of Chicago Data Portal. Use the Data Portal, this workspace environment, and/or `duckdb` API to inspect the dataset focusing on structure and attributes.

For example, you can use DuckDB's built-in ability to query the CSV data directly and return a dataframe:

In [37]:
duckdb.sql("SELECT * FROM '../datasets/inspections/food-inspections-dirty.csv' limit 5").df()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City State Zip,Inspection Date,Inspection Type,Results,Violations
0,2078801,THREE CHEFS RESTURANT,THREE CHEFS RESTURANT,2009471,Restaurant,Risk 1 (High),8125 S HALSTED ST,",IL,60620",08/22/2017,Complaint,Fail,2. FACILITIES TO MAINTAIN PROPER TEMPERATURE -...
1,2078380,THE GODDESS GOLD COAST,THE GODDESS AND GROCER,2397687,Restaurant,Risk 1 (HIGH),1127 N STATE ST,"CCHICAGO,IL,60610",08/14/2017,Complaint,Pass w/ Conditions,2. FACILITIES TO MAINTAIN PROPER TEMPERATURE -...
2,2015423,HAROLD'S CHICKEN 57 LLC.,HAROLD'S CHICKEN,2363519,Restaurant,Risk 1 (High),6606 W NORTH AVE,"Chicago,IL,60707",07/31/2017,Complaint,Pass,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...
3,2071437,CHINA CAFE,CHINA CAFE,2535883,Restaurant,Risk 1 (High),2300-2302 S WENTWORTH AVE BLDG,",IL,60616",07/26/2017,License,Pass,
4,2071348,MARKET SELECT,MARKET SELECT,2523569,Grocery Store,Risk 3 (Low),912 N ASHLAND AVE,",IL,60622",07/25/2017,License,Pass,37. TOILET ROOM DOORS SELF CLOSING: DRESSING R...


### 2. Create Your Schema
* Create a file named `schema.sql` that can be used to create your database.
* Your schema must contain a table named `INSPECTIONS` that stores the inspection data. This table must have a primary key.

In [79]:
with duckdb.connect("inspections.db") as con:
    with open("schema.sql", "r") as f:
        # Create the DB schema
        con.sql(f.read())
        # Import data
        con.sql("INSERT INTO inspections (SELECT * FROM read_csv('../datasets/inspections/food-inspections-cleaned.csv'))")

Implement a function `count_inspections()` that uses `duckdb.sql()` to query the database and return the total number of inspections.

In [80]:
def count_inspections(con):
    """
    Use duckdb.sql() to execute a query that returns a count of the total number of inspections.
    Return the count as a single number.
    """
    return int(con.sql("SELECT COUNT(*) AS COUNT from INSPECTIONS").df()["COUNT"][0])

In [81]:
#test
with duckdb.connect("inspections.db") as con:
    assert(count_inspections(con) == 720)
    print("Success")

Success


### 3. Transform the Data

As you may have noticed, the `Violations` field contains nested records for each of the violations found during a single inspection. Note that the violations refer to a list of standard [codes and descriptions](https://www.chicago.gov/city/en/depts/cdph/provdrs/food_safety/svcs/understand_healthcoderequirementsforfoodestablishments.html) but also contain information specific to the inspection.

Update your `schema.sql` to create additional tables to store the violations data.

* `VIOLATIONS`: Table to store unique information about violations. It must have a primary key that serves as a foreign key for the `INSPECTION_VIOLATIONS` table.
* `INSPECTION_VIOLATIONS`: Table to store **unique** information about violations associated with each inspection (hint: check for duplicates). It must have a primary key and foreign keys to `INSPECTIONS` and `VIOLATIONS`.

Write a function `parse_violations()` that uses the DuckDB API to parse the `INSPECTIONS.Violations` field and populate the two new tables. This function is tested based on the number of records in `VIOLATIONS` and `INSPECTION_VIOLATIONS` tables.

Note that in the source data each delimited violation contains an `identifier`, `description`, and *optional* `comment` that is specific to the inspection. This information all needs to be represented in your database. 

Also, be aware of potential whitespace variations in violations that would otherwise be identical.

In [82]:
def parse_violations(con):
    violations_sql = """
        insert into violations
        select distinct violation['violation_id'] as violation_id, trim(violation['description']) as description
        from
        (
            select inspection_id, regexp_extract(
                trim(regexp_split_to_table(inspections.violations, '\\|')),
                '^(\\d+)\\.(.*?)(?:-\\s+Comments?:(.*))?$',
                ['violation_id', 'description', 'comments']
            ) as violation
            from inspections
        ) 
        where violation_id != ''
        order by cast(violation_id as integer);
    """

    inspection_violations_sql = """
        insert into inspection_violations
        select inspection_id, violation['violation_id']  as violation_id, trim(violation['comments']) as comment
        from
        (
            select DISTINCT inspection_id, regexp_extract(
                trim(regexp_split_to_table(inspections.violations, '\\|')),
                '^(\\d+)\\.(.*?)(?:-\\s+Comments?:\\s+(.*))?$',
                ['violation_id', 'description', 'comments']
            ) as violation
            from inspections 
        ) 
        where violation_id != ''
        order by inspection_id, cast(violation_id as integer);
    """
    con.sql(violations_sql)
    con.sql(inspection_violations_sql)    

In [83]:
#test
with duckdb.connect("inspections.db") as con:
    con.sql("DELETE FROM INSPECTION_VIOLATIONS")
    con.sql("DELETE FROM VIOLATIONS")
    parse_violations(con)    
    assert(con.sql("SELECT COUNT(*) FROM INSPECTION_VIOLATIONS").fetchone()[0] == 1457)
    assert(con.sql("SELECT COUNT(*) FROM VIOLATIONS").fetchone()[0] == 37)
    print("Success")

Success


### 4. Functional Dependencies

DBAName -> Zip

In [88]:
c1 = """
SELECT i1.DBA_Name, i1.zip, i2.zip FROM inspections i1, inspections i2
WHERE i1.DBA_Name = i2.DBA_Name
AND i1.zip != i2.zip
"""
with duckdb.connect("inspections.db") as con:
    df = con.sql(c1).df()
    print(df.head())

                         dba_name    zip  zip_1
0  LA COCINITA FOOD TRUCK CHICAGO  60201  60077
1             ABUNDANT RESTAURANT  60409  60429
2             PIKO STREET KITCHEN  60076  60193
3  LA COCINITA FOOD TRUCK CHICAGO  60201  60077
4  LA COCINITA FOOD TRUCK CHICAGO  60201  60077


Zip -> City, State

In [87]:
c1 = """
SELECT i1.zip, i1.city, i2.city, i1.state, i2.state FROM inspections i1, inspections i2
WHERE i1.zip = i2.zip
AND (i1.city != i2.city or i1.state != i2.state)
"""
with duckdb.connect("inspections.db") as con:
    df = con.sql(c1).df()
    print(df.head())

     zip        city           city_1 state state_1
0  60610    CCHICAGO  CHESTNUT STREET    IL      IL
1  60153     MAYWOOD          Maywood    IL      IL
2  60618     Chicago         CCHICAGO    IL      IL
3  60606  312CHICAGO          Chicago    IL      IL
4  60618     Chicago         CCHICAGO    IL      IL
