# Dataset Pipeline — Data Preparation Only
This notebook prepares two original datasets:

- **NYC Yellow Taxi Trips (2023)** via NYC Open Data API (`4b4i-vvec.json`) — predict high_tip (≥20%).
- **Chicago Food Inspections** via Chicago Open Data API (`4ijn-s7e5.json`) — predict pass_binary (Pass vs Not Pass).

Includes: API downloads, cleaning, feature engineering, EDA with matplotlib, dashboards, and Markdown summaries.


In [None]:
import sys, platform, random
import numpy as np, pandas as pd, matplotlib.pyplot as plt, requests
np.random.seed(42); random.seed(42)
print({'python': sys.version.split()[0], 'platform': platform.platform()})

In [None]:
from pathlib import Path
Path("data").mkdir(exist_ok=True)
TLC_BASE = "https://data.cityofnewyork.us/resource/4b4i-vvec.json"
params = {"$limit": 5000}
rows = requests.get(TLC_BASE, params=params, timeout=60).json()
taxi_raw = pd.DataFrame(rows)
print("Taxi rows:", taxi_raw.shape)
taxi_raw.to_csv("data/taxi_raw_2023.csv", index=False)

In [None]:
taxi = taxi_raw.copy()
for c in ["tpep_pickup_datetime","tpep_dropoff_datetime"]:
    taxi[c] = pd.to_datetime(taxi[c], errors="coerce")
num_cols = ["passenger_count","trip_distance","fare_amount","tip_amount","total_amount"]
for c in num_cols: taxi[c] = pd.to_numeric(taxi[c], errors="coerce")
taxi["duration_min"] = (taxi["tpep_dropoff_datetime"] - taxi["tpep_pickup_datetime"]).dt.total_seconds()/60
taxi = taxi[(taxi["trip_distance"]>0.1)&(taxi["trip_distance"]<50)&(taxi["duration_min"]>=1)]
taxi = taxi[taxi["payment_type"]=="1"]
taxi["speed_mph"] = taxi["trip_distance"]/(taxi["duration_min"]/60)
taxi["tip_percent"] = taxi["tip_amount"]/(taxi["fare_amount"]+1e-6)
taxi["hour"] = taxi["tpep_pickup_datetime"].dt.hour
taxi["is_weekend"] = taxi["tpep_pickup_datetime"].dt.dayofweek.isin([5,6]).astype(int)
taxi["high_tip"] = (taxi["tip_percent"]>=0.2).astype(int)
print("Taxi cleaned:", taxi.shape)
taxi.head()

In [None]:
fig, axes = plt.subplots(1,3,figsize=(12,3))
axes[0].hist(taxi["trip_distance"].dropna(), bins=40); axes[0].set_title("Trip Distance (miles)")
axes[1].hist(taxi["duration_min"].dropna(), bins=40); axes[1].set_title("Trip Duration (minutes)")
axes[2].hist(taxi["tip_percent"].dropna().clip(0,1), bins=40); axes[2].set_title("Tip % (clipped)")
plt.tight_layout(); plt.show()

In [None]:
CHI_BASE = "https://data.cityofchicago.org/resource/4ijn-s7e5.json"
params = {"$limit": 5000}
rows = requests.get(CHI_BASE, params=params, timeout=60).json()
chi_raw = pd.DataFrame(rows)
print("Chicago rows:", chi_raw.shape)
chi_raw.to_csv("data/chi_food_raw.csv", index=False)

In [None]:
import re
chi = chi_raw.copy()
chi["inspection_date"] = pd.to_datetime(chi["inspection_date"], errors="coerce")
chi["pass_binary"] = (chi["results"].astype(str).str.lower()=="pass").astype(int)
chi["violation_count"] = chi["violations"].apply(lambda s: len(re.findall(r"\b\d{2}\.?\d*\b", str(s))) if pd.notna(s) else 0)
chi["viol_text_len"] = chi["violations"].fillna("").str.len()
chi["month"] = chi["inspection_date"].dt.month
chi["risk_level"] = chi["risk"].astype(str).str.lower().map({"risk 1 (high)":1,"risk 2 (medium)":2,"risk 3 (low)":3})
print("Chicago cleaned:", chi.shape)
chi.head()

In [None]:
fig, axes = plt.subplots(1,2,figsize=(8,3))
axes[0].hist(chi["violation_count"], bins=30); axes[0].set_title("Violation Count")
axes[1].hist(chi["viol_text_len"], bins=30); axes[1].set_title("Violation Text Length")
plt.tight_layout(); plt.show()

## 📊 Data Preparation Summary

- **NYC Taxi:** Target = high_tip (≥20%). Features: duration, speed, tip %, hour, weekend flag, location freqs.  
- **Chicago Inspections:** Target = pass_binary. Features: violation count, text length, risk level, month.  

Both datasets are cleaned, explored, and saved as CSVs ready for modeling.

### Slide Version
- NYC Taxi: predict high tips (≥20%) from trip features.  
- Chicago Inspections: predict Pass vs Not Pass from inspection records.  
