# DS Project 1: NYC Housing & Neighborhood Factors

**Authors:** Jason Qin; Xiao Teng; Guichen Zheng; Yaxuan Hu

---

## 1. Introduction and Dataset Description

This project examines the relationship between housing prices and neighborhood characteristics in New York City. The goal is to build a clean ZIP-level monthly panel dataset and explore factors that influence property values, for use by real estate analysts, policymakers, and urban planners.

The analysis integrates four primary data sources:

**1. Housing Data (Zillow Home Value Index, ZHVI)**
- *Source:* Zillow Research Data
- *Coverage:* 26,307 records across NYC ZIP codes; January 2000–December 2025 (312 months)
- *Key variables:* ZIP code, city, metro area, monthly home values
- *Description:* Typical home value by geography, smoothed and seasonally adjusted.

**2. Crime Data (NYPD Complaint Records)**
- *Source:* NYC Open Data – NYPD Complaint Database
- *Coverage:* 577,674 crime incidents
- *Key variables:* Complaint number, date, borough, offense type, law category
- *Description:* Official NYPD criminal complaints across the five NYC boroughs.

**3. Demographics (Census ACS)**
- *Source:* U.S. Census Bureau American Community Survey (5-year estimates)
- *Coverage:* 33,772 census areas (ZCTA-level)
- *Key variables:* Population, median income, poverty, labor force, unemployment, education
- *Description:* Socioeconomic and demographic characteristics by area.

**4. Economic Indicators (FRED)**
- *Source:* Federal Reserve Economic Data
- *Coverage:* 197 time points (e.g., Jan 2023–present)
- *Key variables:* 30-year mortgage rate (MORTGAGE30US), Federal Funds rate (FEDFUNDS)
- *Description:* Macroeconomic indicators affecting housing affordability and demand.

---

## 2. Data Acquisition Methodology

Raw data were acquired through a dedicated acquisition layer (Python 3.11) that downloads or ingests from each source and writes timestamped raw files to `data/raw/<source>/`.

- **Zillow:** CSV downloaded from Zillow Research Data (ZIP-level, smoothed seasonally adjusted ZHVI). Supported via inbox mode (user places CSV in `data/raw/zillow/inbox/`) or direct download; URLs are configurable via environment variables.
- **NYC Crime:** NYPD complaint data retrieved via NYC Open Data Socrata API with date-range filters and pagination (50,000 rows per request). Output saved as Parquet.
- **Census ACS:** ACS 5-year estimates at ZCTA level pulled from the Census API (nationwide ZCTAs; filtering to NYC is done in preprocessing). Variables include population, median income, poverty, labor force, unemployment, and education. Output saved as Parquet. Optional `CENSUS_API_KEY` for higher rate limits.
- **FRED:** 30-year mortgage rate and Fed funds rate series fetched from the FRED API for a specified date range. Requires `FRED_API_KEY` in `.env`. Output saved as CSV.

All runs are logged to `data/metadata/ingest_log.json` (row counts, columns, null counts) and `data/metadata/sources.md`. A smoke-test notebook (`00_Data_Collection_Sanity_Check.ipynb`) loads the newest raw file from each source to verify shapes. Dependencies: pandas, requests, pyarrow, python-dotenv; see project README for setup and CLI usage.

---

## 3. Cleaning and Preprocessing Steps

**Objective:** Construct a clean ZIP × Month panel from January 2023 onward, integrating Zillow, ACS, crime, and FRED.

**Zillow restructuring:** The raw Zillow dataset was wide (one column per month). We identified date columns, standardized ZIPs to five-digit strings, used `melt()` to convert to long format, converted dates to `YYYY-MM`, and kept observations from January 2023 onward. Result: base panel `zip | month | zhvi`.

**ACS integration:** We selected variables (population, median household income, poverty base/count, labor force, unemployment, educational attainment), renamed coded ACS variables to readable names, standardized ZIP format, and built rate variables (poverty rate, unemployment rate, higher education count). ACS was merged into the Zillow panel with a left join on ZIP. Each ZIP-month row carries static socioeconomic characteristics.

**Crime spatial integration:** Crime data are incident-level with no ZIP. We converted latitude/longitude to point geometries, parsed MODZCTA polygon geometries from WKT, and performed a point-in-polygon spatial join to assign each incident to a ZIP. Crime dates were converted to datetime and aggregated to monthly (`YYYY-MM`). We then computed `crime_count` by ZIP and month; missing counts after merge were set to 0.

**FRED integration:** FRED series were converted to monthly frequency (dates to `YYYY-MM`, daily/weekly values aggregated to monthly averages) and merged on month. FRED variables are national, so values repeat across ZIPs within the same month.

**Missing values:** Missing rates were computed for all variables (all below 1%). Rows with missing values were dropped. A Census sentinel value (-66666666) in `median_income` was treated as missing. After cleaning, the dataset has no missing values.

**Outliers:** We used descriptive statistics, logical checks, and statistical rules (IQR, Z-score). Many flagged values reflect real cross-sectional variation across ZIPs rather than errors, so no statistical outliers were removed.

**Final structure:** `zip | month | zhvi | socioeconomic variables | crime_count | macro variables` — spatially and temporally aligned, with no missing values.

---

## 4. Exploratory Data Analysis (EDA)

**Data overview:** Housing data (State = "NY") contain 321 columns (identifiers plus monthly ZHVI from Jan 2000–Dec 2025). Crime data: 577,674 records, 36 variables (offense, borough, law category, etc.). Demographics: 33,772 areas, 12 socioeconomic variables. Economic data: two series (MORTGAGE30US, FEDFUNDS) over time.

**Visualizations and interpretations:**

- **Housing price trends:** Sample ZIPs show strong appreciation 2000–2025, a dip around 2008–2009 (financial crisis), recovery by ~2012–2013, and steep growth in 2020–2025. Location drives both level and growth.
- **Home value distribution:** Right-skewed; median $306,920; most units in $200k–$400k; tail up to ~$6M. Median below mean ($458,458), indicating high-value pull.
- **Crime types:** Petit larceny and harassment dominate; top categories largely property-related; assault/felony assault in top 15; quality-of-life offenses present.
- **Crime by borough:** Brooklyn highest by count, then Manhattan, Bronx, Queens; Staten Island much lower. Counts are not per capita.
- **Economic indicators:** Mortgage rate peaked ~7.5% (mid-2023), fell to ~6% by early 2026; Fed Funds peaked ~5.3%, fell to ~3.5–3.8%. Lower rates likely supported demand and prices.
- **Box plot (home values):** Median ~$300k; IQR ~$200k–$500k; many high-value outliers ($3M–$6M); long upper tail.

---

## 5. Feature Engineering Process and Justification

Feature engineering turns raw data into structured, economically meaningful signals to improve model performance and interpretability.

**Rates and proportions:** Raw counts (e.g., crime, education) scale with population. We use **crime per 1,000 residents**, **education shares**, and **graduate-level share** so comparisons across ZIPs reflect intensity and structure, not size.

**Income:** Median income is right-skewed. We apply a **log transformation** to reduce skew, stabilize variance, and align with elasticity-style interpretation.

**Temporal dynamics:** We add **lag features** (e.g., 1-, 3-, 6-, 12-month), **month-over-month and year-over-year percentage changes**, **rolling means**, and **rolling standard deviations** (volatility). Lags capture persistence; rolling stats smooth noise. All use only past information to avoid leakage.

**Seasonality:** Month is cyclical (Dec and Jan are adjacent). We encode **month using sine and cosine** so the model can capture seasonality without a discrete jump at year-end.

**Impact:** The final feature set captures structure (rates/shares), dynamics (lags, changes), stability (volatility), and seasonality (cyclical month), supporting more interpretable and robust modeling.

---

## 6. Summary of Key Findings

- **Housing:** Sustained appreciation over 2000–2025; post-2008 recovery; strong growth in 2020–2025; large variation across ZIPs.
- **Crime:** High incident counts; property crimes (e.g., petit larceny, harassment) dominate; borough-level variation; relevance for neighborhood desirability and prices to be quantified in modeling.
- **Economic:** Mortgage and Fed Funds rates declined from 2023 to 2026; lower borrowing costs likely supported demand and price growth.
- **Demographics:** Rich ACS coverage (33,772 areas) on income, education, employment, and age provides a basis for linking neighborhood characteristics to housing outcomes.
- **Final dataset:** Clean ZIP × Month panel with ZHVI, socioeconomic variables, crime_count, and macro variables, ready for statistical and predictive modeling.

---

## 7. Challenges Faced and Future Recommendations

**Challenges:**
- *Data quality:* Some Parquet reads failed; data were re-exported as CSV and loaded successfully.
- *Scale:* Large crime file (577k+ rows) required efficient pandas usage and aggregated visualizations.
- *Geography:* Housing uses ZIPs, demographics use ZCTA; alignment and spatial join (e.g., MODZCTA) were needed for crime.
- *Time alignment:* Different sources covered different periods; analysis focused on overlapping windows (e.g., from Jan 2023).

**Future recommendations:**
- Maintain a single geographic standard (ZIP) and document ZCTA–ZIP mapping.
- Extend EDA with formal correlation/regression, spatial maps, and time-series forecasting.
- Consider per-capita crime rates, price-change metrics, and composite neighborhood indicators in modeling.
- Keep acquisition scripts and metadata (ingest_log, sources.md) updated for reproducibility.

---

## 8. Link to GitHub Repository

**Repository:** https://github.com/jaysonedu/DS_Project1

The repository is public. All project files (source code, notebooks, data metadata, and this report) are stored in the repo and accessible.

---

## 9. Each Member's Contribution to the Project

**Data Acquisition - Jason Qin**
- Designed and implemented the data acquisition layer (Zillow, NYC Crime, Census ACS, FRED, geo).
- Built CLI entry points, shared utilities (HTTP retries, env, metadata logging), and loaders for downstream use.
- Wrote README, cleaning/merging guide, and contribution documentation.

**Exploratory Data Analysis - Xiao Teng**
- Loaded and validated all four datasets (housing, crime, demographics, economic indicators).
- Performed EDA and created multiple visualizations (housing trends, crime patterns, economic conditions, distributions).
- Produced summary statistics, identified patterns, and documented findings and data quality issues for preprocessing.

**Data Cleaning and Preprocessing - Guichen Zheng**
- Restructured Zillow data (wide to long), integrated ACS (renaming, rates, merge on ZIP), and built the base panel.
- Implemented spatial integration of crime (point-in-polygon with MODZCTA), temporal aggregation, and crime_count by ZIP-month.
- Integrated FRED (monthly aggregation, merge on month), handled missing values and Census sentinels, and evaluated outliers.
- Delivered the final clean ZIP × Month dataset and the Data Preprocessing Report.

**Feature Engineering - Yaxuan Hu**
- Defined and justified structural features (rates, proportions), income transformation, temporal features (lags, changes, rolling stats), and seasonality encoding.
- Documented rationale and impact on modeling (featureEngineer.md and report section above).