# Product Overview

This repository aims to provide tools to the NYC Department of Health and Mental Hygiene (DOHMH) that aid them in:
- Identifying problematic areas of NYC where they frequently receive tickets
- Forecasting future (1-week ahead) ticket flow
- Comparing predicted patterns with historical data

DOHMH was selected for several reasons:

- **Interesting problem domain:** The DOHMH dataset contains compelling issues such as rodent complaints and food safety inspections
- **Manageable data size:** With approximately 1M records, it's large enough to be meaningful but digestible enough to work with efficiently without requiring big data infrastructure

## Technical Problem Statement

**Core question:** How many tickets will be opened by next week?

## Technical Deliverables

1. **Predictive Models** - Machine learning models to forecast ticket volume and characteristics
2. **Interactive Dashboard** - Visualization tool to explore predictions and run historical scenario analyses

## Business Value
DOHMH agents’ schedules are often set at the last minute because their work is request-based. A tool that provides well-informed estimates of future ticket demand can make a significant difference in planning and resource allocation. While point estimates are useful for technical users, ranges are generally more impactful. Therefore, this tool also provides quantile estimates to offer greater clarity and actionable insights.

---


## Original Model Architecture Plan

The initial vision for this project included **three types of predictive models** working in concert:

### 1. Forecast Model
Predicts the number of new requests expected on a daily basis (1-7 days ahead) by location and complaint type.

### 2. Severity/Triage Model
Assigns priority scores to tickets based on the probability of key risk factors:
- **Inspection requirement:** Whether the resolution requires an inspection
- **SLA breach:** Whether the ticket will exceed its service level agreement due date

### 3. Duration Model
Estimates how long it will take for a ticket to be closed once it's picked up (with censoring for tickets still in progress).

### Prioritization Strategy

The three models would work together to create a comprehensive resource allocation strategy:

1. **Risk-adjusted hours:** Combine severity probability with duration estimates: `Severity × Duration = Risk-adjusted hours`

2. **Spatial aggregation:** Summarize tickets at the H3 hexagon level to calculate total severity-weighted hours for each geographic cell

3. **Dynamic prioritization:** Take the number of available inspectors as input and show which H3 cells should be prioritized based on sorted severity-weighted workload

### Implementation Status

Due to time constraints and the complexity of these problems (particularly the severity/triage and duration models), **the current implementation focuses on the forecasting model**. The severity and duration models, while architecturally designed, remain future work.

---


## Environment Setup

The NYC Open Data API (Socrata) requires authentication credentials. This notebook assumes that `SOCRATA_APP_TOKEN`, `SOCRATA_API_KEY_ID`, and `SOCRATA_API_KEY_SECRET` are set in a `.env` file in the project root directory.

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

### Load Packages

In [2]:
import os
import sys

PACKAGE_PATH = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, PACKAGE_PATH)

from src import train

  from .autonotebook import tqdm as notebook_tqdm


## Train

The training pipeline orchestrates the end-to-end process of building forecast models for NYC 311 service requests. This includes data fetching, preprocessing, feature engineering, and training multiple quantile regression models to provide both point estimates and uncertainty bounds.

### Pipeline Overview

The `train_models()` function executes the following steps:

1. **Data Fetching (Optional):**
   - NYC 311 Service Requests from Socrata Open Data API (DOHMH agency only)
   - ACS Census Population data (2013-2023) via `censusdata` package
   - NOAA Weather data (2010-2025) via NCEI/NOAA Climate Data API

2. **Data Preprocessing:**
   - Merges 311 requests with external data sources (weather, population)
   - Cleans and standardizes data types
   - Filters relevant date ranges and removes invalid records

3. **Feature Engineering:**
   - Creates temporal features: lags (1, 4 weeks), rolling averages (4, 12 weeks), momentum
   - Adds weather features: temperature, precipitation, heating/cooling degree days, 3-day/7-day rain
   - Includes spatial features: neighborhood rolling averages, log population
   - Builds categorical features: month, heat/freeze flags, COVID era flag
   - Constructs weekly forecast panel at the (week, H3 hex, complaint family) grain

4. **Model Training:**
   - Trains 4 separate LightGBM models for different quantile forecasts:
     - **Mean model:** Point estimate of expected requests
     - **50th percentile:** Median forecast
     - **10th percentile:** Lower uncertainty bound
     - **90th percentile:** Upper uncertainty bound
   - Each model predicts 1-week ahead service request counts
   - Uses optimal hyperparameters from prior tuning (stored in `model_optimal_params.json`)

5. **Model Persistence:**
   - Saves trained model bundles with timestamp (format: `YYYYMMDD_HHMMSS`)
   - Each bundle includes preprocessor and trained model
   - Stored in `models/{timestamp}/full_bundle/`

### Parameters

- **`run_fetch`** (bool, default=False): 
  - If `True`, fetches fresh data from external APIs (311, ACS, NOAA)
  - If `False`, uses existing local data files
  - Note: API credentials required via environment variables

- **`save_data`** (bool, default=False): 
  - If `True`, saves preprocessed data and feature-engineered datasets to S3 bucket
  - If `False`, only processes data in-memory
  - Note: S3 bucket (`s3://hbc-technical-assessment-gk/`) requires AWS credentials with write access

- **`save_models`** (bool, default=False): 
  - If `True`, persists trained model bundles to local disk
  - If `False`, models exist only in-memory during the session
  - Recommended: `True` for production runs

### Recommended Settings

For typical retraining runs (using existing data, saving new models):

```python
train.train_models(run_fetch=False, save_data=False, save_models=True)
```

**Rationale:**
- **`run_fetch=False`**: Data fetching is time-intensive (15-30 minutes) and should be done separately when fresh data is needed
- **`save_data=False`**: S3 bucket write access is restricted; local data files are sufficient for development/retraining
- **`save_models=True`**: Persisting models enables deployment to Streamlit app and future inference

### Continuous Retraining Strategy

This pipeline is designed to support **automated continuous retraining** for production use:

#### Scheduled Retraining Workflow

1. **Weekly/Monthly Schedule:**
   - Set up a cron job or orchestration tool (Airflow, Prefect, GitHub Actions) to run the pipeline periodically
   - Recommended cadence: Weekly (Monday mornings) to incorporate the latest week's closed tickets

2. **Full Pipeline Run:**
   ```python
   # Fetch latest data, retrain models, save everything
   train.train_models(run_fetch=True, save_data=True, save_models=True)
   ```

3. **Model Versioning:**
   - Each training run creates a timestamped model directory
   - Update `config.MODEL_TIMESTAMP` to point to the latest version

4. **Validation & Deployment:**
   - Run automated tests on new models (accuracy thresholds, prediction ranges)
   - Compare performance metrics against previous version

This approach ensures models stay current with evolving patterns (seasonality, COVID transitions, demographic changes) without manual intervention.

In [3]:
# train.train_models(run_fetch = False, save_data = False, save_models = True)


Starting training pipeline...
Preprocessing data...
Loading DOHMH data...
Data Shape: (1029875, 27)
Preprocessing DOHMH data...
Data Shape: (929358, 44)
Merging census data...
Data Shape: (929358, 48)
Merging weather data...
Data Shape: (909090, 59)

Final Data Shape: (909090, 59)
Building forecast panel...
Training models...
Training mean model...
Training model for horizon 1
X shape pre-filtering: (547535, 29)
X shape post-filtering: (547535, 29)
Train dates [2009-12-29 00:00:00 to 2023-12-26 00:00:00], Test dates [2024-01-02 00:00:00 to 2025-07-29 00:00:00]
X training shape: (487161, 29)
X test shape: (60374, 29)
CV (neg_mean_absolute_error) scores: [-0.94076096 -0.6097193  -0.6403069  -0.6437102  -0.67708157 -0.7194397
 -0.7614965  -0.76723794 -0.75462297 -0.75051399 -0.71579421 -0.83070775
 -0.85251991 -0.838712  ]
CV mean: -0.7501874206680085
train metrics
  h=1: RMSE=1.149, MAE=0.722, Poisson Dev=1.091
test metrics
  h=1: RMSE=1.427, MAE=0.819, Poisson Dev=1.245

Training 90th p

## Streamlit App

The Streamlit app provides an interactive dashboard for exploring model performance and forecasting results. Users can visualize predictions versus actual service request patterns across different geographic areas and complaint types.

### Key Features

- **Weekly Analysis:** View predictions and actual service requests for any week in the past 52 weeks
- **Complaint Family Breakdown:** Explore 10 different complaint categories including:
  - Food Safety
  - Vector Control (rodents, mosquitoes, pigeons)
  - Housing Health
  - Animal Control
  - Air/Smoke/Mold
  - Hazmat/Lead/Asbestos
  - Childcare/Recreation
  - COVID-19 violations
  - Water Quality
  - Miscellaneous
- **Multiple Prediction Views:** Compare actual values against:
  - Mean predictions
  - 50th percentile (median) predictions
  - 10th and 90th percentile predictions (uncertainty bounds)
- **Geographic Visualization:** Interactive H3 hexagon maps showing spatial distribution of service requests
- **Performance Metrics:** Summary statistics including total requests, hexagon-level aggregations

### How to Run

```bash
streamlit run streamlit_app/app.py
```

The first time you run it, it wil take 10-15 seconds to load as it read the data for the first time.

### Prerequisites

- Trained models must be available in the `models/` directory
- Preprocessed data (`streamlit_data.parquet`) must be in `streamlit_app/resources/`
- All required packages from `requirements.txt` must be installed
