# Data Loading & Cleaning

This notebook is used to load raw NYC 311 noise complaint data, clean temporal fields, and merge with neighborhood reference data.

## Imports & Load Data

Note: The 311 Noise Complaint data set is over 4 million rows and may take some time to load.

In [1]:
import pandas as pd
import numpy as np

print("Loading 311 Noise Complaints data... Large file takes about 2 minutes!")
df = pd.read_csv("hf://datasets/idakam/311-nyc-noise-complaints/311_noise_complaints.csv")
print("Loading neighborhood and zipcode data..")
neighborhoods = pd.read_csv("../data/raw/neighborhoods.csv")

Loading 311 Noise Complaints data... Large file takes about 2 minutes!
Loading neighborhood and zipcode data..


## Parse Datetime & Temporal Features

In [2]:
df['Created Date'] = pd.to_datetime(df['Created Date'])

df['Year'] = df['Created Date'].dt.year
df['Month'] = df['Created Date'].dt.month
df['Week'] = df['Created Date'].dt.isocalendar().week
df['Day_of_Week'] = df['Created Date'].dt.dayofweek
df['Day_Name'] = df['Created Date'].dt.day_name()
df['Hour'] = df['Created Date'].dt.hour
df['Date'] = df['Created Date'].dt.date
df[['Created Date', 'Day_of_Week', 'Hour']].head()

Unnamed: 0,Created Date,Day_of_Week,Hour
0,2026-02-09 02:04:44,0,2
1,2026-02-09 02:03:44,0,2
2,2026-02-09 02:03:09,0,2
3,2026-02-09 02:02:03,0,2
4,2026-02-09 02:00:28,0,2


## Create Time Buckets

In [3]:
def get_time_bucket(hour):
    if 6 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 18:
        return 'afternoon'
    elif 18 <= hour < 22:
        return 'evening'
    elif hour >= 22 or hour < 2:
        return 'night'
    else:
        return 'overnight'

df['Time_Bucket'] = df['Hour'].apply(get_time_bucket)

#Distrubution Check
df['Time_Bucket'].value_counts()

Time_Bucket
night        1704997
evening      1125961
afternoon     819821
morning       516154
overnight     473533
Name: count, dtype: int64

## Season Feature Creation

In [4]:
df['Season'] = df['Month'].map({
    12: 'Winter', 1: 'Winter', 2: 'Winter',
    3: 'Spring', 4: 'Spring', 5: 'Spring',
    6: 'Summer', 7: 'Summer', 8: 'Summer',
    9: 'Fall', 10: 'Fall', 11: 'Fall'
})

## Clean ZIPS & Boroughs

In [5]:
df['Incident Zip'] = df['Incident Zip'].astype(str).str.strip().str[:5]
neighborhoods['ZipCode'] = neighborhoods['ZipCode'].astype(str).str.strip().str[:5]

df['Borough'] = df['Borough'].str.upper()
neighborhoods['Borough'] = neighborhoods['Borough'].str.upper()

# Drop unspecified boroughs
df = df[df['Borough'] != 'UNSPECIFIED'].copy()

## Merge Neighborhood Data

This section is used to merge the neigbhorhood data with the 311 data to map zipcodes to neighborhoods.

In [6]:
# Preserve original 311 borough
df['Borough_311'] = df['Borough']

# Rename neighborhood borough BEFORE merge
neighborhoods_renamed = neighborhoods.rename(
    columns={'Borough': 'Borough_Zip'}
)

df = df.merge(
    neighborhoods_renamed[['ZipCode', 'Neighborhood', 'Borough_Zip']],
    left_on='Incident Zip',
    right_on='ZipCode',
    how='left'
)

# Prefer ZIP-based borough when available
df['Borough'] = df['Borough_Zip'].fillna(df['Borough_311'])

df.drop(
    columns=['ZipCode', 'Borough_311', 'Borough_Zip'],
    inplace=True,
    errors='ignore'
)

## Data Quality Checks

In [7]:
# Neighborhoods mapped
df['Neighborhood'].notna().mean()

# Ensure neighborhoods belong to only one borough
df.groupby('Neighborhood')['Borough'].nunique().sort_values(ascending=False).head()


Neighborhood
Borough Park        1
Southeast Bronx     1
North Queens        1
Northeast Bronx     1
Northeast Queens    1
Name: Borough, dtype: int64

## Save Cleaned Data

In [8]:
output_path = "../data/processed/noise_cleaned.csv"
df.to_csv(output_path, index=False)

df.shape


(4639338, 19)

## Creating Neighboorhood Coordinates CSV for future Hotspot mapping

While this dataset will not be used for the model, the interactive tool that will later be created will need the neighborhood coordinates. We won't need other datasets at runtime other than this one for deployment.

In [9]:
neighborhood_coords = df.groupby(['Borough', 'Neighborhood']).agg({
    'Latitude': 'mean',
    'Longitude': 'mean'
}).reset_index()

# Clean
neighborhood_coords = neighborhood_coords.dropna(subset=['Neighborhood', 'Latitude', 'Longitude'])
neighborhood_coords = neighborhood_coords[neighborhood_coords['Borough'] != 'UNKNOWN']

# Save
neighborhood_coords.to_csv("../data/processed/neighborhood_coordinates.csv", index=False)