# DineSafe Toronto: Feature Engineering

This notebook creates new columns or aggregations that'll help answer deeper questions:
- Which establishments are higher risk? (risk scores)
- Do some establishments repeatedly violate safety codes? 
- Which establishments have crucial infractions? (binary flags)
- How do violations vary over time, geography, and establishment type?

## Load the latest raw DineSafe CSV Data

In [1]:
import pandas as pd
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
RAW_DIR = PROJECT_ROOT / "data" / "raw"

csv_files = list(RAW_DIR.glob("dinesafe_*.csv")) # finds all files matching this pattern

if not csv_files:
    raise FileNotFoundError(f"No raw DineSave CSV files found in {RAW_DIR.resolve()}") # .resolve() shows the absolute path

latest_file = max(csv_files, key=lambda f: f.stat().st_mtime) # sort by last modified time, then pick the latest

print(f"Loading {latest_file.name}")
df = pd.read_csv(latest_file)

Loading dinesafe_20250606_120907.csv


## Create Time-Based Features

In [3]:
df["Inspection Month"] = pd.to_datetime(df["Inspection Date"]).dt.month

In [4]:
df["Inspection Year"] = pd.to_datetime(df["Inspection Date"]).dt.year