# Feature Engineering

This notebook compares multiple aggregation strategies for 311 noise complaints
and produces the final dataset used for modeling.

We evaluate tradeoffs between:
- spatial granularity
- temporal resolution
- data sparsity

Final output:
- `../data/processed/aggregated_data.csv`

## Imports & Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("../data/processed/noise_cleaned.csv")

## Neighborhood Uniqueness Check

In [None]:
neighborhood_borough = df.groupby('Neighborhood')['Borough'].nunique()
duplicate_neighborhoods = neighborhood_borough[neighborhood_borough > 1]

duplicate_neighborhoods.head(10)


Series([], Name: Borough, dtype: int64)

After ZIP-based corrections, all neighborhood names map to exactly one borough.

This confirms that the data cleaning process successfully resolved
geographic ambiguity present in the raw 311 dataset.

## Aggregation Comparison Definitions

In [None]:
agg1 = (
    df.groupby(['Borough', 'Neighborhood', 'Day_of_Week', 'Time_Bucket'])
    .size()
    .reset_index(name='Complaint_Count')
)

agg2 = (
    df.groupby(['Borough', 'Neighborhood', 'Season', 'Day_of_Week', 'Time_Bucket'])
    .size()
    .reset_index(name='Complaint_Count')
)

agg3 = (
    df.groupby(['Borough', 'Neighborhood', 'Month', 'Day_of_Week', 'Time_Bucket'])
    .size()
    .reset_index(name='Complaint_Count')
)

agg4 = (
    df.groupby(['Borough', 'Incident Zip', 'Season', 'Day_of_Week', 'Time_Bucket'])
    .size()
    .reset_index(name='Complaint_Count')
)


## Strategy Comparision Table

In [None]:
comparison = pd.DataFrame({
    'Strategy': [
        'Neighborhood × Day × Time',
        'Neighborhood × Season × Day × Time',
        'Neighborhood × Month × Day × Time',
        'Zip × Season × Day × Time'
    ],
    'Groups': [len(agg1), len(agg2), len(agg3), len(agg4)],
    'Avg Complaints': [
        agg1['Complaint_Count'].mean(),
        agg2['Complaint_Count'].mean(),
        agg3['Complaint_Count'].mean(),
        agg4['Complaint_Count'].mean()
    ],
    'Sparse Groups (%)': [
        (agg1['Complaint_Count'] < 5).mean() * 100,
        (agg2['Complaint_Count'] < 5).mean() * 100,
        (agg3['Complaint_Count'] < 5).mean() * 100,
        (agg4['Complaint_Count'] < 5).mean() * 100
    ]
})

comparison


Unnamed: 0,Strategy,Groups,Avg Complaints,Sparse Groups (%)
0,Neighborhood × Day × Time,1470,3131.839456,0.0
1,Neighborhood × Season × Day × Time,5880,782.959864,0.0
2,Neighborhood × Month × Day × Time,17640,260.986621,0.147392
3,Zip × Season × Day × Time,26072,177.918878,6.351642


Stragtegy 0: Neighborhood X Day X Time
- Very Dense which is good
- Too coarse, looses seasonality which matters for urban noise

Strategy 1: Neighborhood × Season × Day × Time
- No Sparsity
- Still enough data per group for stable baselines
- Captures Seasonal Patterns

Strategy 2: Neighborhood × Month × Day × Time
- Data is more fragmented, more sparisty

Strategy 3: Zip x Season X Day X Time
- Data is too sparse (6.35%)
- Too grandular

# Seasonal Signal Check

In [None]:
seasonal_totals = df.groupby('Season').size().reindex(
    ['Winter', 'Spring', 'Summer', 'Fall']
)

seasonal_cv = seasonal_totals.std() / seasonal_totals.mean()

seasonal_totals, seasonal_cv

(Season
 Winter     888954
 Spring    1092125
 Summer    1405612
 Fall      1252647
 dtype: int64,
 np.float64(0.19084235972915742))

## Strategy Selection Logic

In [None]:
if seasonal_cv > 0.10 and (agg2['Complaint_Count'] < 5).mean() < 0.70:
    groupby_cols = ['Borough', 'Neighborhood', 'Season', 'Day_of_Week', 'Time_Bucket']
    strategy_name = "with_season"
elif (agg1['Complaint_Count'] < 5).mean() < 0.60:
    groupby_cols = ['Borough', 'Neighborhood', 'Day_of_Week', 'Time_Bucket']
    strategy_name = "no_season"
else:
    groupby_cols = ['Borough', 'Incident Zip', 'Season', 'Day_of_Week', 'Time_Bucket']
    strategy_name = "zip_level"

groupby_cols, strategy_name

(['Borough', 'Neighborhood', 'Season', 'Day_of_Week', 'Time_Bucket'],
 'with_season')

## Weekly Aggregation

In [None]:
df['Year_Week'] = (
    df['Year'].astype(str) + '_W' +
    df['Week'].astype(str).str.zfill(2)
)

weekly_counts = (
    df.groupby(groupby_cols + ['Year_Week'])
    .size()
    .reset_index(name='Weekly_Complaints')
)

final_agg = (
    weekly_counts
    .groupby(groupby_cols)
    .agg(
        Avg_Complaints_Per_Week=('Weekly_Complaints', 'mean'),
        Std_Complaints=('Weekly_Complaints', 'std'),
        Num_Weeks=('Weekly_Complaints', 'count')
    )
    .reset_index()
)

final_agg = final_agg[final_agg['Num_Weeks'] >= 3]
final_agg.shape


(5880, 8)

In [None]:
output_file = "../data/processed/aggregated_data.csv"
final_agg.to_csv(output_file, index=False)

output_file, final_agg.columns.tolist()


('../data/processed/aggregated_data.csv',
 ['Borough',
  'Neighborhood',
  'Season',
  'Day_of_Week',
  'Time_Bucket',
  'Avg_Complaints_Per_Week',
  'Std_Complaints',
  'Num_Weeks'])

# Seasonal Variation Check

In [None]:
# Check if season actually matters enough to include
seasonal_totals = df.groupby('Season').size().reindex(['Winter', 'Spring', 'Summer', 'Fall'])
seasonal_cv = seasonal_totals.std() / seasonal_totals.mean()

print(seasonal_totals.to_string())
print(f"\nCoefficient of variation: {seasonal_cv:.3f}")


Season
Winter     888954
Spring    1092125
Summer    1405612
Fall      1252647

Coefficient of variation: 0.191


# Create Weekly Averages

In [None]:
# Use season-based grouping (CV = 0.191 confirms seasonal variation is significant)
groupby_cols = ['Borough', 'Neighborhood', 'Season', 'Day_of_Week', 'Time_Bucket']

# Step 1: count per week
df['Year_Week'] = df['Year'].astype(str) + '_W' + df['Week'].astype(str).str.zfill(2)

weekly_counts = df.groupby(groupby_cols + ['Year_Week']).size().reset_index(name='Weekly_Complaints')

# Step 2: average across weeks
agg = weekly_counts.groupby(groupby_cols).agg(
    Avg_Complaints_Per_Week=('Weekly_Complaints', 'mean'),
    Std_Complaints=('Weekly_Complaints', 'std'),
    Num_Weeks=('Weekly_Complaints', 'count')
).reset_index()

# Drop groups with fewer than 3 weeks of data
agg = agg[agg['Num_Weeks'] >= 3].copy()

print(f"{len(agg):,} groups")
print(f"Weekly avg range: {agg['Avg_Complaints_Per_Week'].min():.1f} - {agg['Avg_Complaints_Per_Week'].max():.1f}")
print(f"Median: {agg['Avg_Complaints_Per_Week'].median():.1f} complaints/week")

5,880 groups
Weekly avg range: 1.0 - 180.3
Median: 6.1 complaints/week


#  Save to new Csv File

In [None]:
agg.to_csv("../data/processed/aggregated_data.csv", index=False)