# DineSafe Toronto: Feature Engineering

This notebook creates new columns or aggregations that'll help answer deeper questions:
- Which establishments are higher risk? (risk scores)
- Do some establishments repeatedly violate safety codes? 
- Which establishments have crucial infractions? (binary flags)
- How do violations vary over time, geography, and establishment type?

## Load the latest raw DineSafe CSV Data

In [14]:
import pandas as pd
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"

csv_files = list(PROCESSED_DIR.glob("dinesafe_*.csv"))

if not csv_files:
    raise FileNotFoundError(f"No processed DineSave CSV files found in {[PROCESSED_DIR.resolve()]}")

latest_file = max(csv_files, key=lambda f: f.stat().st_mtime)

print(f"Loading {latest_file.name}")
df = pd.read_csv(latest_file)

Loading dinesafe_20250606_120907.csv


In [15]:
# Convert types
df['Inspection Date'] = pd.to_datetime(df['Inspection Date'])
df['Inspection ID'] = df['Inspection ID'].astype('Int64')

## Create Time-Based Features

In [16]:
df["Inspection Month"] = pd.to_datetime(df["Inspection Date"]).dt.month

In [17]:
df["Inspection Year"] = pd.to_datetime(df["Inspection Date"]).dt.year

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129695 entries, 0 to 129694
Data columns (total 19 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   _id                        129695 non-null  int64         
 1   Establishment ID           129695 non-null  int64         
 2   Inspection ID              127150 non-null  Int64         
 3   Establishment Name         129695 non-null  object        
 4   Establishment Type         129695 non-null  object        
 5   Establishment Address      129695 non-null  object        
 6   Establishment Status       129695 non-null  object        
 7   Min. Inspections Per Year  129469 non-null  float64       
 8   Infraction Details         80635 non-null   object        
 9   Inspection Date            127150 non-null  datetime64[ns]
 10  Severity                   80635 non-null   object        
 11  Action                     80635 non-null   object  