# DineSafe Toronto: Exploratory Data Analysis (EDA)

This notebook explores restaurant inspection data from the City of Toronto's DineSafe program.

## Data Download Instructions

Before running this notebook, ensure the latest DineSafe dataset is available in `data/raw/`. To download the dataset:

```bash
python src/download_data.py
```

This will:
* Fetch metadata for the package from Toronto's Open Data portal
* Automatically find the latest available resources
* Save the CSV with a timestamped filename to `data/raw/`

Once the data is saved, this notebook will automatically detect and load the most recent file.

## Load the latest raw DineSafe CSV Data


(generated using the `download_data.py` script)

In [14]:
import pandas as pd
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
RAW_DIR = PROJECT_ROOT / "data" / "raw"

csv_files = list(RAW_DIR.glob("dinesafe_*.csv")) # finds all files matching this pattern

if not csv_files:
    raise FileNotFoundError(f"No raw DineSave CSV files found in {RAW_DIR.resolve()}") # .resolve() shows the absolute path

latest_file = max(csv_files, key=lambda f: f.stat().st_mtime) # sort by last modified time, then pick the latest

print(f"Loading {latest_file.name}")
df = pd.read_csv(latest_file)

Loading dinesafe_20250606_120907.csv


# Data Cleaning

We'll first inspect the data for any:
* Wrong data types
* Missing values
* Inconsistent category labels
* Duplicate rows

and then perform the necessary actions on the data.

## Initial Data Inspection

In [40]:
# Count the amount of non-null values in each column, and get their respective data type

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129695 entries, 0 to 129694
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   _id                        129695 non-null  int64  
 1   Establishment ID           129695 non-null  int64  
 2   Inspection ID              127150 non-null  float64
 3   Establishment Name         129695 non-null  object 
 4   Establishment Type         129695 non-null  object 
 5   Establishment Address      129695 non-null  object 
 6   Establishment Status       129695 non-null  object 
 7   Min. Inspections Per Year  129695 non-null  object 
 8   Infraction Details         80635 non-null   object 
 9   Inspection Date            127150 non-null  object 
 10  Severity                   80635 non-null   object 
 11  Action                     80635 non-null   object 
 12  Outcome                    425 non-null     object 
 13  Amount Fined               29

In [None]:
# Gets the total amount of null values in each column, sorted in descending order

df.isnull().sum().sort_values(ascending=False)

Amount Fined                 129398
Outcome                      129270
Infraction Details            49060
Action                        49060
Severity                      49060
Inspection Date                2545
Inspection ID                  2545
Longitude                         0
Latitude                          0
_id                               0
Establishment ID                  0
Min. Inspections Per Year         0
Establishment Status              0
Establishment Address             0
Establishment Type                0
Establishment Name                0
unique_id                         0
dtype: int64

In [None]:
# Calculates the percentage of the `Amount Fined` column that's null

df['Amount Fined'].isnull().sum()/len(df)

np.float64(0.9977100119511161)

In [66]:
# Calculates the percentage of the `Outcome` column that's null

df['Amount Fined'].isnull().mean()

np.float64(0.9977100119511161)

In [62]:
# Calculates the percentage of the 'Infraction Details' column that's null

df['Infraction Details'].isnull().mean()

np.float64(0.3782720999267512)

In [64]:
# Counts the different values we have for the 'Outcome' column

df.groupby('Outcome').Outcome.count()

Outcome
Cancelled                           17
Charges Withdrawn                   20
Conviction - Fined                 261
Conviction - Suspended Sentence      1
Pending                            126
Name: Outcome, dtype: int64

In [36]:
# Counts any duplicate rows

df.duplicated().sum()

np.int64(0)

### Key Findings
- No duplicate rows found
- All the rows have the majority of key columns complete, such as `Latitude`, `Longitude` (great for some mapping), `_id`, `Establishment ID`, etc.
- `Amount Fined`: 99.8% missing - likely only used when fines are issued
- 'Outcome' column is 99.8% missing, but has 5 distinct values:
    - Most frequent: **Conviction - Fined** (261 cases)
    - Others: Cancelled, Charges Withdrawn, Conviction - Suspended Sentence, Pending
- Might consider moving the two sparsely populated columns to their own feature group
- `Inspection Date` is not in DateTime format
- `Inspection ID` is stored as a float, likely should be integer

## Cleaning Plan

- Convert columns to correct types
- Handle missing values (perhaps impute)
- Normalize names
- Possibly drop sparse columns (likely not useful for EDA, but rather a targeted legal analysis)

## Cleaning Operations

We now apply the changes based on our inspection.

In [69]:
# Convert types
df['Inspection Date'] = pd.to_datetime(df['Inspection Date'])
df['Inspection ID'] = df['Inspection ID'].astype('Int64')

# Normalize text
df['Establishment Type'] = df['Establishment Type'].str.strip().str.title()

# Drop low-value columns (temporarily)
# df = df.drop(columns=['Outcome', 'Amount Fined'])
# Not dropping these columns yet, but excluding from visualizations and analysis

# Check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129695 entries, 0 to 129694
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   _id                        129695 non-null  int64         
 1   Establishment ID           129695 non-null  int64         
 2   Inspection ID              127150 non-null  Int64         
 3   Establishment Name         129695 non-null  object        
 4   Establishment Type         129695 non-null  object        
 5   Establishment Address      129695 non-null  object        
 6   Establishment Status       129695 non-null  object        
 7   Min. Inspections Per Year  129695 non-null  object        
 8   Infraction Details         80635 non-null   object        
 9   Inspection Date            127150 non-null  datetime64[ns]
 10  Severity                   80635 non-null   object        
 11  Action                     80635 non-null   object  