# NYC Data Wrangling Pipeline

This notebook performs comprehensive data wrangling on NYC 311 complaint data and median rent data to create a unified dataset for analysis.

## Pipeline Structure:
1. **Data Import** - Load required libraries and datasets
2. **Data Cleaning** - Handle missing values, duplicates, and data quality issues
3. **Data Transformation** - Create new features and filter data
4. **Data Integration** - Merge datasets and create final output

## 1. Data Import

### Import Required Libraries

In [64]:
# Import necessary libraries for data manipulation and analysis
import pandas as pd
import json

### Load Datasets

In [65]:
# Load NYC 311 complaints data and median rent data
df_nyc_311 = pd.read_csv('data/nyc_311_2024_2025_sample.csv', index_col="unique_key")
df_median_rent = pd.read_csv('data/medianAskingRent_All.csv')

print(f"NYC 311 data shape: {df_nyc_311.shape}")
print(f"Median rent data shape: {df_median_rent.shape}")

NYC 311 data shape: (71755, 42)
Median rent data shape: (198, 191)


### Load Mapping Files

In [66]:
# Load ZIP code to neighborhood mapping
with open('nyc_uhf_zipcodes.json', 'r') as f:
    uhf_data = json.load(f)

# Load manual mapping for area names to neighborhoods
with open('manual_map.json', 'r') as f:
    manual_map = json.load(f)

### Initial Data Exploration

In [67]:
# Display basic information about the datasets
print("=== NYC 311 Dataset Sample ===")
print(df_nyc_311.head())
print("\n=== Median Rent Dataset Sample ===")
print(df_median_rent.head())

=== NYC 311 Dataset Sample ===
                       created_date              closed_date agency  \
unique_key                                                            
59945453    2024-01-06T13:06:35.000  2024-01-19T18:34:37.000    HPD   
59949590    2024-01-06T14:31:27.000  2024-01-07T13:03:02.000    HPD   
59957878    2024-01-08T07:45:53.000  2024-01-08T07:58:13.000   NYPD   
59940883    2024-01-05T14:54:38.000  2024-01-05T20:48:26.000    HPD   
59957240    2024-01-07T09:51:51.000  2024-03-08T16:55:31.000    HPD   

                                                  agency_name  \
unique_key                                                      
59945453    Department of Housing Preservation and Develop...   
59949590    Department of Housing Preservation and Develop...   
59957878                      New York City Police Department   
59940883    Department of Housing Preservation and Develop...   
59957240    Department of Housing Preservation and Develop...   

               

## 2. Data Cleaning

### NYC 311 Data - Column Selection and Initial Cleaning

In [68]:
# Select relevant columns for analysis
list_of_relevant_columns = ['created_date', 'closed_date', 'complaint_type',
                            'descriptor', 'status', 'resolution_description',
                            'resolution_action_updated_date', 'borough',
                            'community_board', 'incident_zip', 
                            'incident_address', 'street_name', 'city',
                            'latitude', 'longitude']

df_nyc_311_selected = df_nyc_311[list_of_relevant_columns].copy()
print(f"Selected {len(list_of_relevant_columns)} columns from NYC 311 data")

Selected 15 columns from NYC 311 data


### Median Rent Data - Column Selection

In [69]:
# Select relevant date columns (2024-2025) and basic info columns
date_columns = [col for col in df_median_rent.columns if col.startswith('2024') or col.startswith('2025')]
df_median_rent_selected = df_median_rent[df_median_rent.columns[:3].to_list() + date_columns].copy()
print(f"Selected {len(date_columns)} date columns plus 3 info columns from rent data")

Selected 20 date columns plus 3 info columns from rent data


### Missing Values Analysis

In [70]:
# Analyze missing values in NYC 311 data
missing_values = df_nyc_311_selected.isna().sum().sort_values(ascending=False)
missing_percentage = (df_nyc_311_selected.isna().sum() / len(df_nyc_311_selected) * 100).sort_values(ascending=False)

missing_data = pd.DataFrame({
    'Missing_Count': missing_values,
    'Missing_Percentage': missing_percentage
})

# Only show columns with missing values
missing_data = missing_data[missing_data['Missing_Count'] > 0]

print(f"Total number of rows in NYC 311 dataset: {len(df_nyc_311_selected)}")
print("\nMissing values analysis:")
missing_data.round(2)

Total number of rows in NYC 311 dataset: 71755

Missing values analysis:


Unnamed: 0,Missing_Count,Missing_Percentage
descriptor,3764,5.25
city,3450,4.81
street_name,2237,3.12
incident_address,2235,3.11
closed_date,2065,2.88
resolution_description,2019,2.81
latitude,1055,1.47
longitude,1055,1.47
resolution_action_updated_date,919,1.28
incident_zip,603,0.84


In [71]:
# Analyze missing values in rent data
missing_values_rent = df_median_rent_selected.isna().sum().sort_values(ascending=False)
missing_percentage_rent = (df_median_rent_selected.isna().sum() / len(df_median_rent_selected) * 100).sort_values(ascending=False)

missing_data_rent = pd.DataFrame({
    'Missing_Count': missing_values_rent,
    'Missing_Percentage': missing_percentage_rent
})

missing_data_rent = missing_data_rent[missing_data_rent['Missing_Count'] > 0]

print(f"Total number of rows in rent dataset: {len(df_median_rent_selected)}")
print("\nMissing values analysis for rent data:")
missing_data_rent.round(2)

Total number of rows in rent dataset: 198

Missing values analysis for rent data:


Unnamed: 0,Missing_Count,Missing_Percentage
2024-09,62,31.31
2025-08,62,31.31
2024-06,61,30.81
2024-10,59,29.8
2024-07,59,29.8
2024-02,58,29.29
2025-03,58,29.29
2024-04,58,29.29
2025-01,58,29.29
2025-07,58,29.29


### Duplicate Removal

In [72]:
# Check and remove duplicate rows
print(f"Duplicate rows in rent data: {df_median_rent_selected.duplicated().sum()}")
print(f"Duplicate rows in 311 data: {df_nyc_311_selected.duplicated().sum()}")

# Remove duplicates from NYC 311 data
original_shape = df_nyc_311_selected.shape
df_nyc_311_selected = df_nyc_311_selected.drop_duplicates()
print(f"Removed {original_shape[0] - df_nyc_311_selected.shape[0]} duplicate rows from NYC 311 data")
print(f"New shape: {df_nyc_311_selected.shape}")

Duplicate rows in rent data: 0
Duplicate rows in 311 data: 168
Duplicate rows in 311 data: 168
Removed 168 duplicate rows from NYC 311 data
New shape: (71587, 15)
Removed 168 duplicate rows from NYC 311 data
New shape: (71587, 15)


### Date Data Cleaning

In [73]:
# Convert date columns to datetime format
df_nyc_311_selected['created_date'] = pd.to_datetime(df_nyc_311_selected['created_date'], errors='coerce')
df_nyc_311_selected['closed_date'] = pd.to_datetime(df_nyc_311_selected['closed_date'], errors='coerce')
df_nyc_311_selected['resolution_action_updated_date'] = pd.to_datetime(df_nyc_311_selected['resolution_action_updated_date'], errors='coerce')

# Remove invalid date records (created_date > closed_date)
invalid_dates = df_nyc_311_selected[df_nyc_311_selected['created_date'] > df_nyc_311_selected['closed_date']]
print(f"Number of rows with created_date > closed_date (will be removed): {invalid_dates.shape[0]}")

df_nyc_311_selected = df_nyc_311_selected[
    (df_nyc_311_selected['created_date'] <= df_nyc_311_selected['closed_date']) | 
    (df_nyc_311_selected['closed_date'].isna())
]

print(f"Final NYC 311 data shape after date cleaning: {df_nyc_311_selected.shape}")

Number of rows with created_date > closed_date (will be removed): 21
Final NYC 311 data shape after date cleaning: (71566, 15)


### Geographic Data Cleaning

In [74]:
# Standardize city names: trim whitespace and convert to uppercase
df_nyc_311_selected['city'] = df_nyc_311_selected['city'].str.strip().str.upper()

# Replace known outside NYC locations with 'OUTSIDE NYC'
outside_nyc_locations = ['FLORAL PARK', 'NEW HYDE PARK', 'BREEZY POINT']
df_nyc_311_selected['city'] = df_nyc_311_selected['city'].replace(outside_nyc_locations, 'OUTSIDE NYC')

print("City names standardized")
print(f"Unique cities after cleaning: {df_nyc_311_selected['city'].nunique()}")

City names standardized
Unique cities after cleaning: 45


## 3. Data Transformation

### Feature Engineering - NYC 311 Data

In [75]:
# Calculate resolution time in hours
df_nyc_311_selected['resolution_time_hours'] = (
    df_nyc_311_selected['closed_date'] - df_nyc_311_selected['created_date']
).dt.total_seconds() / 3600

# Extract month and year from created_date
df_nyc_311_selected['month'] = df_nyc_311_selected['created_date'].dt.month
df_nyc_311_selected['year'] = df_nyc_311_selected['created_date'].dt.year

print("Created new features: resolution_time_hours, month, year")
print(f"Resolution time statistics:")
print(df_nyc_311_selected['resolution_time_hours'].describe())

Created new features: resolution_time_hours, month, year
Resolution time statistics:
count    69507.000000
mean       219.524183
std        886.413518
min          0.000000
25%          0.857500
50%          4.824444
75%         52.773194
max      15091.083889
Name: resolution_time_hours, dtype: float64


### Geographic Mapping - ZIP to Neighborhood

In [76]:
# Create ZIP code to neighborhood mapping dictionary
zip_to_neighborhood = {}

for borough, neighborhoods in uhf_data.items():
    for neighborhood_info in neighborhoods:
        neighborhood_name = neighborhood_info['neighborhood']
        zip_codes = neighborhood_info['zip_codes']
        
        for zip_code in zip_codes:
            zip_to_neighborhood[zip_code] = neighborhood_name

print(f"Created mapping for {len(zip_to_neighborhood)} ZIP codes to neighborhoods")

Created mapping for 176 ZIP codes to neighborhoods


In [77]:
# Map ZIP codes to neighborhoods for NYC 311 data
df_nyc_311_selected['incident_zip_str'] = (
    df_nyc_311_selected['incident_zip'].fillna(0).astype(int).astype(str).str.zfill(5)
)
df_nyc_311_selected.loc[df_nyc_311_selected['incident_zip'].isna(), 'incident_zip_str'] = None

df_nyc_311_selected['neighborhood'] = df_nyc_311_selected['incident_zip_str'].map(zip_to_neighborhood)

# Report mapping results
mapped_records = df_nyc_311_selected['neighborhood'].notna().sum()
total_records = len(df_nyc_311_selected)
coverage_percentage = (mapped_records / total_records * 100)

print(f"Neighborhood mapping results:")
print(f"Records with neighborhood: {mapped_records:,}")
print(f"Records without neighborhood: {total_records - mapped_records:,}")
print(f"Coverage percentage: {coverage_percentage:.2f}%")

# Clean up temporary column
df_nyc_311_selected = df_nyc_311_selected.drop('incident_zip_str', axis=1)

Neighborhood mapping results:
Records with neighborhood: 70,130
Records without neighborhood: 1,436
Coverage percentage: 97.99%


### Geographic Mapping - Rent Data

In [78]:
# Map area names to neighborhoods for rent data
df_median_rent_selected['neighborhood'] = df_median_rent_selected['areaName'].str.lower().map(manual_map)

# Report mapping results for rent data
mapped_rent_records = df_median_rent_selected['neighborhood'].notna().sum()
total_rent_records = len(df_median_rent_selected)

print(f"Rent data neighborhood mapping results:")
print(f"Records with neighborhood: {mapped_rent_records}")
print(f"Records without neighborhood: {total_rent_records - mapped_rent_records}")
print(f"Coverage percentage: {(mapped_rent_records / total_rent_records * 100):.2f}%")

Rent data neighborhood mapping results:
Records with neighborhood: 138
Records without neighborhood: 60
Coverage percentage: 69.70%


### Data Aggregation - NYC 311 Complaints

In [79]:
# Aggregate complaints by neighborhood, complaint type, year, and month
complaints_by_neighborhood = df_nyc_311_selected.groupby(
    ['neighborhood', 'complaint_type', 'year', 'month']
).agg({
    'resolution_time_hours': ['count', 'median']
}).reset_index()

# Flatten column names
complaints_by_neighborhood.columns = [
    'neighborhood', 'complaint_type', 'year', 'month', 
    'complaint_count', 'median_resolution_time_hours'
]

# Sort by neighborhood, year, month, and complaint count
complaints_by_neighborhood = complaints_by_neighborhood.sort_values(
    by=['neighborhood', 'year', 'month', 'complaint_count'], 
    ascending=[True, True, True, False]
)

print(f"Aggregated complaints data shape: {complaints_by_neighborhood.shape}")
print(f"Unique neighborhoods in complaints: {complaints_by_neighborhood['neighborhood'].nunique()}")

Aggregated complaints data shape: (13062, 6)
Unique neighborhoods in complaints: 42


### Data Aggregation - Rent Data

In [80]:
# Aggregate rent data by neighborhood (median across all areas in same neighborhood)
date_columns = [col for col in df_median_rent_selected.columns if col.startswith('2024') or col.startswith('2025')]
median_rent_by_neighborhood = df_median_rent_selected.groupby('neighborhood')[date_columns].median()

print(f"Aggregated rent data shape: {median_rent_by_neighborhood.shape}")
print(f"Unique neighborhoods in rent data: {median_rent_by_neighborhood.index.nunique()}")

Aggregated rent data shape: (38, 20)
Unique neighborhoods in rent data: 38


### Reshape Rent Data

In [81]:
# Reshape rent data from wide to long format
rent_melted = median_rent_by_neighborhood.reset_index().melt(
    id_vars='neighborhood', 
    var_name='date', 
    value_name='median_rent'
)

# Convert date column and extract year/month
rent_melted['date'] = pd.to_datetime(rent_melted['date'])
rent_melted['year'] = rent_melted['date'].dt.year
rent_melted['month'] = rent_melted['date'].dt.month

print(f"Reshaped rent data shape: {rent_melted.shape}")
print("Sample of reshaped rent data:")
rent_melted.head()

Reshaped rent data shape: (760, 5)
Sample of reshaped rent data:


Unnamed: 0,neighborhood,date,median_rent,year,month
0,Bayside - Little Neck,2024-01-01,2473.0,2024,1
1,Bedford Stuyvesant - Crown Heights,2024-01-01,3000.0,2024,1
2,Bensonhurst - Bay Ridge,2024-01-01,2137.5,2024,1
3,Borough Park,2024-01-01,2275.0,2024,1
4,Canarsie - Flatlands,2024-01-01,2625.0,2024,1


## 4. Data Integration

### Merge Complaints and Rent Data

In [82]:
# Merge complaints and rent data on neighborhood, year, and month
df_merged_monthly = pd.merge(
    complaints_by_neighborhood, 
    rent_melted[['neighborhood', 'year', 'month', 'median_rent']], 
    on=['neighborhood', 'year', 'month'], 
    how='left'
)

print(f"Final merged dataset shape: {df_merged_monthly.shape}")
print(f"Records with rent data: {df_merged_monthly['median_rent'].notna().sum()}")
print(f"Records without rent data: {df_merged_monthly['median_rent'].isna().sum()}")

# Display sample of merged data
print("\nSample of merged data:")
df_merged_monthly.head(10)

Final merged dataset shape: (13062, 7)
Records with rent data: 11576
Records without rent data: 1486

Sample of merged data:


Unnamed: 0,neighborhood,complaint_type,year,month,complaint_count,median_resolution_time_hours,median_rent
0,Bayside - Little Neck,Illegal Parking,2024,1,7,1.985833,2473.0
1,Bayside - Little Neck,Abandoned Vehicle,2024,1,3,2.465833,2473.0
2,Bayside - Little Neck,Building/Use,2024,1,2,886.584028,2473.0
3,Bayside - Little Neck,APPLIANCE,2024,1,1,260.415278,2473.0
4,Bayside - Little Neck,Bike/Roller/Skate Chronic,2024,1,1,0.145833,2473.0
5,Bayside - Little Neck,Blocked Driveway,2024,1,1,0.598889,2473.0
6,Bayside - Little Neck,Commercial Disposal Complaint,2024,1,1,172.530556,2473.0
7,Bayside - Little Neck,Damaged Tree,2024,1,1,638.201667,2473.0
8,Bayside - Little Neck,General Construction/Plumbing,2024,1,1,0.0,2473.0
9,Bayside - Little Neck,HEAT/HOT WATER,2024,1,1,47.84,2473.0


### Data Quality Check

In [83]:
# Final data quality checks
print("=== Final Dataset Summary ===")
print(f"Total records: {len(df_merged_monthly):,}")
print(f"Date range: {df_merged_monthly['year'].min()}-{df_merged_monthly['year'].max()}")
print(f"Unique neighborhoods: {df_merged_monthly['neighborhood'].nunique()}")
print(f"Unique complaint types: {df_merged_monthly['complaint_type'].nunique()}")

print("\n=== Data Completeness ===")
completeness = (df_merged_monthly.notna().sum() / len(df_merged_monthly) * 100).round(2)
print(completeness)

print("\n=== Top 10 Neighborhoods by Complaint Volume ===")
top_neighborhoods = df_merged_monthly.groupby('neighborhood')['complaint_count'].sum().sort_values(ascending=False).head(10)
print(top_neighborhoods)

=== Final Dataset Summary ===
Total records: 13,062
Date range: 2024-2025
Unique neighborhoods: 42
Unique complaint types: 154

=== Data Completeness ===
neighborhood                    100.00
complaint_type                  100.00
year                            100.00
month                           100.00
complaint_count                 100.00
median_resolution_time_hours     95.73
median_rent                      88.62
dtype: float64

=== Top 10 Neighborhoods by Complaint Volume ===
neighborhood
Northeast Bronx                       3925
West Queens                           3495
Washington Heights - Inwood           3241
Downtown - Heights - Park Slope       2790
Bedford Stuyvesant - Crown Heights    2705
Southwest Queens                      2614
Fordham - Bronx Park                  2534
East Flatbush - Flatbush              2507
Jamaica                               2365
Long Island City - Astoria            2263
Name: complaint_count, dtype: int64


### Export Final Dataset

In [84]:
# Export the final merged dataset
output_path = 'data/data_snapshot_for_gdv.csv'
df_merged_monthly.to_csv(output_path, index=False)

print(f"Final dataset exported to: {output_path}")
print(f"Dataset shape: {df_merged_monthly.shape}")
print("\nData wrangling pipeline completed successfully!")

Final dataset exported to: data/data_snapshot_for_gdv.csv
Dataset shape: (13062, 7)

Data wrangling pipeline completed successfully!
