# Step 5: Feature Engineering for NYC 311 Modeling

This notebook demonstrates the complete feature engineering pipeline for three modeling tracks:
1. **Forecast** - Time-series forecasting of ticket arrivals
2. **Triage** - Ticket prioritization at creation time
3. **Duration** - Survival modeling for time-to-close

All features are **leakage-safe** and use **H3-based spatial grouping**.


In [16]:
import os
import sys

PACKAGE_PATH = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, PACKAGE_PATH)
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime

from src import preprocessing
from src import features
from src import config
from src import forecast

pd.set_option('display.max_columns', 50)
sns.set_style('whitegrid')
from importlib import reload

## Usage Instructions

This notebook uses the feature engineering module from `src/features.py`.

To run this notebook:
1. Ensure you have data in `data/landing/311-service-requests/`
2. Run `pip install -r requirements.txt` to install dependencies
3. Execute cells sequentially

For detailed documentation, see `src/FEATURE_ENGINEERING_README.md`


## Load Data

In [3]:
forecast_panel = pd.read_parquet(config.PRESENTATION_DATA_PATH + '/model_fitting_data.parquet')
df_streamlit = pd.read_parquet(config.PRESENTATION_DATA_PATH + '/streamlit_data.parquet')


## Inputs

In [4]:
numerical_columns = [
            'lag1', 'lag4', 'roll4', 'roll12',
            'momentum', 'weeks_since_last',
            'tavg', 'prcp', 'heating_degree', 'cooling_degree',
            'rain_3d', 'rain_7d', 'log_pop', 'nbr_roll4', 'nbr_roll12'
        ]

categorical_columns = ['week_of_year', 'month',  'heat_flag', 'freeze_flag', 'hex6', 'complaint_family']
horizons = [1]

df_input = forecast_panel.copy()
horizon = 1
input_columns = df_input.columns


## Hyper Parameter Tuning

In [5]:
dict_params_mean = forecast.tune(
    df_input,
    horizon,
    input_columns,
    numerical_columns,
    categorical_columns,
)

Tuning forecast model for poisson (mean) with 30 trials...
X shape pre-filtering: (442520, 27)
X shape post-filtering: (425330, 27)


[I 2025-10-05 22:44:44,829] A new study created in memory with name: no-name-826ab2b9-d675-4b2f-8883-3fd5bbfafbc2


Train dates [2010-03-16 00:00:00 to 2023-12-26 00:00:00], Test dates [2024-01-02 00:00:00 to 2025-07-29 00:00:00]
X training shape: (392634, 27)
X test shape: (32696, 27)


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-10-05 22:44:49,684] Trial 0 finished with value: 1.0701026137583192 and parameters: {'learning_rate': 0.030710573677773714, 'num_leaves': 79, 'max_depth': 8, 'min_child_samples': 38, 'subsample': 0.6468055921327309, 'colsample_bytree': 0.6467983561008608}. Best is trial 0 with value: 1.0701026137583192.
[I 2025-10-05 22:44:55,024] Trial 1 finished with value: 1.0753129332585867 and parameters: {'learning_rate': 0.011900590783184251, 'num_leaves': 76, 'max_depth': 7, 'min_child_samples': 41, 'subsample': 0.6061753482887408, 'colsample_bytree': 0.8909729556485984}. Best is trial 0 with value: 1.0701026137583192.
[I 2025-10-05 22:44:57,938] Trial 2 finished with value: 1.0736037455818181 and parameters: {'learning_rate': 0.12106896936002161, 'num_leaves': 56, 'max_depth': 6, 'min_child_samples': 25, 'subsample': 0.6912726728878613, 'colsample_bytree': 0.7574269294896714}. Best is trial 0 with value: 1.0701026137583192.
[I 2025-10-05 22:45:01,755] Trial 3 finished with value: 1.070

In [6]:
dict_params_90 = forecast.tune(
    df_input,
    horizon,
    input_columns,
    numerical_columns,
    categorical_columns,
    alpha = 0.9
)

Tuning forecast model for quantile (α=0.9) with 30 trials...
X shape pre-filtering: (442520, 27)
X shape post-filtering: (425330, 27)


[I 2025-10-05 22:47:01,178] A new study created in memory with name: no-name-7d2109da-da2c-4b3b-9b80-12489419626e


Train dates [2010-03-16 00:00:00 to 2023-12-26 00:00:00], Test dates [2024-01-02 00:00:00 to 2025-07-29 00:00:00]
X training shape: (392634, 27)
X test shape: (32696, 27)


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-10-05 22:47:07,274] Trial 0 finished with value: 0.20372893941406953 and parameters: {'learning_rate': 0.030710573677773714, 'num_leaves': 79, 'max_depth': 8, 'min_child_samples': 38, 'subsample': 0.6468055921327309, 'colsample_bytree': 0.6467983561008608}. Best is trial 0 with value: 0.20372893941406953.
[I 2025-10-05 22:47:12,954] Trial 1 finished with value: 0.20440326062157987 and parameters: {'learning_rate': 0.011900590783184251, 'num_leaves': 76, 'max_depth': 7, 'min_child_samples': 41, 'subsample': 0.6061753482887408, 'colsample_bytree': 0.8909729556485984}. Best is trial 0 with value: 0.20372893941406953.
[I 2025-10-05 22:47:17,859] Trial 2 finished with value: 0.20529805025981399 and parameters: {'learning_rate': 0.12106896936002161, 'num_leaves': 56, 'max_depth': 6, 'min_child_samples': 25, 'subsample': 0.6912726728878613, 'colsample_bytree': 0.7574269294896714}. Best is trial 0 with value: 0.20372893941406953.
[I 2025-10-05 22:47:23,091] Trial 3 finished with value:

In [7]:
dict_params_50 = forecast.tune(
    df_input,
    horizon,
    input_columns,
    numerical_columns,
    categorical_columns,
    alpha = 0.5
)

Tuning forecast model for quantile (α=0.5) with 30 trials...


[I 2025-10-05 22:49:52,986] A new study created in memory with name: no-name-d1939878-4b78-4f81-9ba3-84c7d9c9d6ad


X shape pre-filtering: (442520, 27)
X shape post-filtering: (425330, 27)
Train dates [2010-03-16 00:00:00 to 2023-12-26 00:00:00], Test dates [2024-01-02 00:00:00 to 2025-07-29 00:00:00]
X training shape: (392634, 27)
X test shape: (32696, 27)


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-10-05 22:50:00,020] Trial 0 finished with value: 0.27948264260104966 and parameters: {'learning_rate': 0.030710573677773714, 'num_leaves': 79, 'max_depth': 8, 'min_child_samples': 38, 'subsample': 0.6468055921327309, 'colsample_bytree': 0.6467983561008608}. Best is trial 0 with value: 0.27948264260104966.
[I 2025-10-05 22:50:06,357] Trial 1 finished with value: 0.27970310623193195 and parameters: {'learning_rate': 0.011900590783184251, 'num_leaves': 76, 'max_depth': 7, 'min_child_samples': 41, 'subsample': 0.6061753482887408, 'colsample_bytree': 0.8909729556485984}. Best is trial 0 with value: 0.27948264260104966.
[I 2025-10-05 22:50:11,504] Trial 2 finished with value: 0.2778100019892509 and parameters: {'learning_rate': 0.12106896936002161, 'num_leaves': 56, 'max_depth': 6, 'min_child_samples': 25, 'subsample': 0.6912726728878613, 'colsample_bytree': 0.7574269294896714}. Best is trial 2 with value: 0.2778100019892509.
[I 2025-10-05 22:50:17,253] Trial 3 finished with value: 0

In [8]:
dict_params_10 = forecast.tune(
    df_input,
    horizon,
    input_columns,
    numerical_columns,
    categorical_columns,
    alpha = 0.1
)

Tuning forecast model for quantile (α=0.1) with 30 trials...


[I 2025-10-05 22:52:52,763] A new study created in memory with name: no-name-82964169-1e7a-4fbc-8aff-9e093690b902


X shape pre-filtering: (442520, 27)
X shape post-filtering: (425330, 27)
Train dates [2010-03-16 00:00:00 to 2023-12-26 00:00:00], Test dates [2024-01-02 00:00:00 to 2025-07-29 00:00:00]
X training shape: (392634, 27)
X test shape: (32696, 27)


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-10-05 22:52:59,656] Trial 0 finished with value: 0.07132541919557504 and parameters: {'learning_rate': 0.030710573677773714, 'num_leaves': 79, 'max_depth': 8, 'min_child_samples': 38, 'subsample': 0.6468055921327309, 'colsample_bytree': 0.6467983561008608}. Best is trial 0 with value: 0.07132541919557504.
[I 2025-10-05 22:53:05,827] Trial 1 finished with value: 0.07128314450565694 and parameters: {'learning_rate': 0.011900590783184251, 'num_leaves': 76, 'max_depth': 7, 'min_child_samples': 41, 'subsample': 0.6061753482887408, 'colsample_bytree': 0.8909729556485984}. Best is trial 1 with value: 0.07128314450565694.
[I 2025-10-05 22:53:10,592] Trial 2 finished with value: 0.07133880026108735 and parameters: {'learning_rate': 0.12106896936002161, 'num_leaves': 56, 'max_depth': 6, 'min_child_samples': 25, 'subsample': 0.6912726728878613, 'colsample_bytree': 0.7574269294896714}. Best is trial 1 with value: 0.07128314450565694.
[I 2025-10-05 22:53:16,163] Trial 3 finished with value:

### Convert to a Single Dict

In [19]:
dict_params = {}
dict_params['mean'] = dict_params_mean['best_params']
dict_params['90'] = dict_params_90['best_params']
dict_params['50'] = dict_params_50['best_params']
dict_params['10'] = dict_params_10['best_params']



In [None]:
with open("../src/resources/model_optimal_params.json", "w") as f:
    json.dump(dict_params, f, indent=4)