## Library Import

First, we import the necessary components from our simplified libs_daw library. The main component is the `NYCDataPipeline` class, which encapsulates all the data processing functionality in a single, easy-to-use interface.

The library is designed to be lightweight and focused specifically on Method 2 (step-by-step pipeline usage), making it ideal for educational purposes and detailed data analysis workflows.

In [None]:
# Import simplified pipeline
from libs_daw import NYCDataPipeline
import pandas as pd

print("Simplified libs_daw library successfully imported!")

## Step-by-Step Pipeline Usage

We use 4 main data processing steps, each building upon the previous one:

1. **Data Loading**: Import raw data from various sources (CSV files, JSON mappings)
2. **Data Cleaning**: Remove inconsistencies, handle missing values, standardize formats
3. **Data Transformation**: Create new features, apply geographic mappings, prepare for analysis
4. **Aggregation and Integration**: Combine datasets and create the final analytical dataset

This modular approach allows for:
- **Transparency**: Each step can be inspected and validated independently
- **Flexibility**: Individual steps can be modified without affecting others
- **Debugging**: Easy identification of issues at specific processing stages
- **Learning**: Clear understanding of the data transformation process

In [None]:
# Pipeline initialization
pipeline = NYCDataPipeline()

print("Pipeline initialized!")

### Step 1: Data Loading

**Purpose**: Import all necessary data sources into memory for processing.

**What happens in this step**:
- Loads NYC 311 service requests data from CSV file
- Imports median rent data by neighborhood and time period
- Reads UHF (United Hospital Fund) geographic mapping data from JSON
- Loads manual mapping corrections for data quality improvements

**Data Sources**:
- `nyc_311_2024_2025_sample.csv`: Sample of citizen service requests
- `medianAskingRent_All.csv`: Rental market data by area
- `nyc_uhf_zipcodes.json`: Geographic boundary definitions
- `manual_map.json`: Manual corrections for geographic mappings

**Expected Outcome**: Four separate datasets ready for cleaning and processing.

In [None]:
# Import data from csv
df_nyc_311, df_median_rent, uhf_data, manual_map = pipeline.load_data()

### Step 2: Data Cleaning

**Purpose**: Standardize and clean the raw data to ensure quality and consistency.

**What happens in this step**:

**For NYC 311 Data**:
- Removes records with missing critical information (location, complaint type)
- Standardizes date formats and extracts temporal features
- Cleans and normalizes text fields (complaint descriptions, addresses)
- Validates and corrects geographic coordinates
- Removes duplicate entries and obvious data entry errors

**For Median Rent Data**:
- Handles missing rent values using appropriate imputation methods
- Standardizes neighborhood names for consistent mapping
- Validates date ranges and removes outliers
- Ensures proper numeric formatting for rent amounts

**Expected Outcome**: Clean, standardized datasets ready for feature engineering and transformation.

In [None]:
# Clean both datasets
df_nyc_311_cleaned, df_median_rent_cleaned = pipeline.clean_data(
    df_nyc_311, df_median_rent
)

### Step 3: Data Transformation

**Purpose**: Create new features and apply geographic mappings to prepare data for analysis.

**What happens in this step**:

**Geographic Transformation**:
- Maps ZIP codes to UHF neighborhoods using the geographic data
- Applies manual mapping corrections for edge cases
- Creates standardized neighborhood identifiers
- Validates geographic consistency across datasets

**Feature Engineering**:
- Extracts temporal features (year, month, season) from dates
- Categorizes complaint types into broader analytical groups
- Creates derived metrics (complaints per capita, rent change rates)
- Generates location-based features for spatial analysis

**Data Harmonization**:
- Ensures both datasets use consistent geographic boundaries
- Aligns temporal periods between 311 and rent data
- Creates join keys for later integration

**Expected Outcome**: Transformed datasets with rich features and consistent geographic mappings, ready for integration.

In [None]:
# Transform data and create new features
df_nyc_311_transformed, df_median_rent_transformed = pipeline.transform_data(
    df_nyc_311_cleaned, df_median_rent_cleaned, uhf_data, manual_map
)

### Step 4: Aggregation and Integration

**Purpose**: Combine the transformed datasets into a single, analysis-ready dataset.

**What happens in this step**:

**Data Aggregation**:
- Aggregates 311 complaints by neighborhood and time period
- Calculates complaint frequency metrics and trends
- Summarizes rent data by geographic area and temporal windows
- Computes statistical measures (means, medians, trends) for both datasets

**Dataset Integration**:
- Joins 311 and rent data using standardized geographic keys
- Handles temporal alignment between different data collection periods
- Resolves any remaining data inconsistencies
- Creates composite metrics combining both data sources

**Quality Assurance**:
- Validates the integrated dataset for completeness
- Checks for logical consistency across combined features
- Ensures proper data types and value ranges
- Documents any limitations or caveats in the final dataset

**Expected Outcome**: A single, comprehensive dataset ready for analysis, visualization, and modeling, containing both service request patterns and housing market information by NYC neighborhood.

In [None]:
# Combine and create final dataset
final_dataset = pipeline.aggregate_and_integrate(
    df_nyc_311_transformed, df_median_rent_transformed
)

print(f"\nFinal dataset created: {final_dataset.shape}")

## Results Overview

**Purpose**: Examine the final integrated dataset to understand its structure and quality.

This section provides a comprehensive overview of the processed data, including:

**Dataset Structure Analysis**:
- Dimensions (rows and columns) of the final dataset
- Column names and their meanings
- Data types and value ranges

**Data Quality Assessment**:
- Completeness rates for each variable
- Distribution of key categorical variables
- Temporal and geographic coverage

**Sample Data Review**:
- Representative sample of the integrated records
- Examples of how 311 complaints and rent data are combined
- Validation of the integration logic

This overview helps verify that the data processing pipeline has successfully created a usable dataset for downstream analysis.

In [None]:
# Review final dataset structure
print("=== FINAL DATASET STRUCTURE ===")
print(f"Size: {final_dataset.shape}")
print(f"Columns: {list(final_dataset.columns)}")

print("\n=== DATA SAMPLE ===")
display(final_dataset.head(10))

In [None]:
# Basic statistics
print("=== BASIC STATISTICS ===")
print(f"Total records: {len(final_dataset):,}")
print(f"Unique neighborhoods: {final_dataset['neighborhood'].nunique()}")
print(f"Unique complaint types: {final_dataset['complaint_type'].nunique()}")
print(f"Year range: {final_dataset['year'].min()}-{final_dataset['year'].max()}")

print("\n=== DATA COMPLETENESS ===")
completeness = (final_dataset.notna().sum() / len(final_dataset) * 100).round(1)
for col, pct in completeness.items():
    print(f"{col}: {pct}%")

In [None]:
final_dataset.to_csv('data/data_snapshot_for_gdv.csv')
