# Step 1: Fetch Data

This notebook handles data acquisition from multiple sources to build our NYC 311 analysis dataset:

1. **NYC 311 Service Requests** - Historical service request data from NYC Open Data (Socrata API)
2. **US Census ACS Data** - Population estimates by census block group for per-capita analysis
3. **NOAA Weather Data** - Daily temperature and precipitation data to understand environmental factors

All data is fetched using custom functions from the `src.fetch` module and can be saved locally or to cloud storage.



## Environment Setup

### Set Environment Variables

The NYC Open Data API (Socrata) requires authentication credentials. This notebook assumes that `SOCRATA_APP_TOKEN`, `SOCRATA_API_KEY_ID`, and `SOCRATA_API_KEY_SECRET` are set in a `.env` file in the project root directory.

To obtain these credentials:
1. Create an account at [NYC Open Data](https://data.cityofnewyork.us/)
2. Generate API credentials from your account settings
3. Add them to a `.env` file (never commit this file to version control)

In [1]:
from dotenv import load_dotenv

load_dotenv()


True

### Load Packages

The `src.fetch` module contains custom functions for retrieving data from various APIs:
- **Socrata API** for NYC 311 data
- **US Census API** for population estimates  
- **AWS Open Data** for NOAA weather records

The `src.config` module defines constants like column selections, date ranges, and file paths.


In [2]:
import os
import sys
import pandas as pd
import matplotlib.pyplot as plt

PACKAGE_PATH = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, PACKAGE_PATH)

from src import fetch
from src import config


## NYC 311 Service Request Data

### Full Data Pull - All Agencies

This option fetches **all** service requests from **all NYC agencies** from 2010 to present. 

**Runtime:** ~48 minutes (depending on network speed and API rate limits)

**Data Volume:** 41+ million records across all agencies

**Use Case:** Comprehensive city-wide analysis across all service types

The data is partitioned by year and month for efficient storage and querying. Uncomment the cell below to execute.

In [3]:
# await fetch.fetch_all_service_requests(save = True)


### Filtered Data Pull - DOHMH Only

This option fetches service requests for the **Department of Health and Mental Hygiene (DOHMH)** only.

**Runtime:** ~90 seconds

**Data Volume:** ~1 million records (DOHMH subset)

**Why DOHMH?**
- Consistent data quality and processing workflows
- Clear complaint type categorization
- Sufficient volume for statistical modeling
- Well-defined resolution outcomes

This filtered approach is ideal for focused analysis and predictive modeling. The data includes only DOHMH-relevant columns to reduce storage requirements.

In [None]:
# await fetch.fetch_all_service_requests(save = True,
#     columns=config.DOHMH_COLUMNS,
#     additional_filter=config.AGENCY_FILTER
# )

Fetching 2010-01 ...
Fetching 2010-02 ...
Fetching 2010-03 ...
Fetching 2010-04 ...
Fetching 2010-05 ...
Fetching 2010-06 ...
Fetching 2010-07 ...
Fetching 2010-08 ...
Fetching 2010-09 ...
Fetching 2010-10 ...
Fetching 2010-11 ...
Fetching 2010-12 ...
Fetching 2011-01 ...
Fetching 2011-02 ...
Fetching 2011-03 ...
Fetching 2011-04 ...
Fetching 2011-05 ...
Fetching 2011-06 ...
Fetching 2011-07 ...
Fetching 2011-08 ...
Saved s3://hbc-technical-assessment-gk/landing/DOHMH/year=2011/month=01/part-0000.parquet (3,220 rows, 25 columns)
Fetching 2011-09 ...
Saved s3://hbc-technical-assessment-gk/landing/DOHMH/year=2010/month=10/part-0000.parquet (3,728 rows, 25 columns)
Saved s3://hbc-technical-assessment-gk/landing/DOHMH/year=2010/month=02/part-0000.parquet (2,821 rows, 25 columns)
Fetching 2011-10 ...
Fetching 2011-11 ...
Saved s3://hbc-technical-assessment-gk/landing/DOHMH/year=2011/month=03/part-0000.parquet (3,869 rows, 25 columns)
Fetching 2011-12 ...
Saved s3://hbc-technical-assessment-

### Incremental Data Pull - Current Month

For ongoing analysis, you can fetch only the current month's data rather than reprocessing the entire historical dataset.

**Runtime:** ~1-2 minutes (depending on month volume)

**Use Case:** 
- Regular data updates for dashboards
- Incremental model retraining
- Near real-time monitoring

The function automatically detects the current month and fetches only new records. This can be run daily or weekly to keep your dataset up to date without re-downloading historical data.

In [5]:
# await fetch.fetch_current_month_service_requests(save = True)



**Option 1:** Fetch all agencies for the current month (uncomment the first cell below)

**Option 2:** Fetch DOHMH only for the current month (uncomment the second cell below)

Choose the option that matches your initial data pull strategy.


In [6]:
# await fetch.fetch_current_month_service_requests(save = True,
#     columns=config.DOHDMH_COLUMNS,
#     additional_filter=config.AGENCY_FILTER
# )

## US Census ACS Population Data

The American Community Survey (ACS) provides annual population estimates at the census block group level. This data enables per-capita analysis of service requests, helping us understand demand patterns relative to population density.

**Data Source:** US Census Bureau API (ACS 5-Year Estimates)

**Geographic Level:** Block groups for NYC counties (Bronx, Brooklyn, Manhattan, Queens, Staten Island)

**Time Range:** 2013-2023 (historical) + interpolated values for 2010-2012 and 2024-2025

**Key Fields:**
- `GEOID`: Census block group identifier
- `population`: Total population estimate
- `year`: Survey year

**Runtime:** ~2-3 minutes

This data will be joined with 311 service requests in later steps to calculate per-capita complaint rates.

In [7]:
df_pop = fetch.fetch_acs_census_population_data(start_year=2013, end_year=2023, save=False)

Downloading ACS 5-year data for 2013...
  - County 005...
  - County 047...
  - County 061...
  - County 081...
  - County 085...
Downloading ACS 5-year data for 2014...
  - County 005...
  - County 047...
  - County 061...
  - County 081...
  - County 085...
Downloading ACS 5-year data for 2015...
  - County 005...
  - County 047...
  - County 061...
  - County 081...
  - County 085...
Downloading ACS 5-year data for 2016...
  - County 005...
  - County 047...
  - County 061...
  - County 081...
  - County 085...
Downloading ACS 5-year data for 2017...
  - County 005...
  - County 047...
  - County 061...
  - County 081...
  - County 085...
Downloading ACS 5-year data for 2018...
  - County 005...
  - County 047...
  - County 061...
  - County 081...
  - County 085...
Downloading ACS 5-year data for 2019...
  - County 005...
  - County 047...
  - County 061...
  - County 081...
  - County 085...
Downloading ACS 5-year data for 2020...
  - County 005...
  - County 047...
  - County 061

## NOAA Weather Data

Weather conditions can significantly influence service request patterns. For example:
- **Heat complaints** increase during hot weather
- **Standing water/mosquito complaints** increase after rainfall
- **Rodent activity** varies with temperature and precipitation

**Data Source:** NOAA nClimGrid Daily (via AWS Open Data)

**Geographic Level:** County-level (NYC's 5 counties/boroughs)

**Time Range:** 2010-2025 (historical + recent data)

**Key Fields:**
- `fips`: County FIPS code
- `date`: Daily observation date
- `tmax`: Maximum temperature (°C)
- `tmin`: Minimum temperature (°C)
- `tavg`: Average temperature (°C)
- `prcp`: Precipitation (mm)

**Runtime:** ~1-2 minutes

Weather data is joined with service requests by date and borough to create weather-aware features for predictive modeling.

In [8]:
df_weather = fetch.fetch_noaa_weather_data(start_year=2010, end_year=2025, save=False)

Loaded 201001, rows: 155
Loaded 201002, rows: 155
Loaded 201003, rows: 155
Loaded 201004, rows: 155
Loaded 201005, rows: 155
Loaded 201006, rows: 155
Loaded 201007, rows: 155
Loaded 201008, rows: 155
Loaded 201009, rows: 155
Loaded 201010, rows: 155
Loaded 201011, rows: 155
Loaded 201012, rows: 155
Loaded 201101, rows: 155
Loaded 201102, rows: 155
Loaded 201103, rows: 155
Loaded 201104, rows: 155
Loaded 201105, rows: 155
Loaded 201106, rows: 155
Loaded 201107, rows: 155
Loaded 201108, rows: 155
Loaded 201109, rows: 155
Loaded 201110, rows: 155
Loaded 201111, rows: 155
Loaded 201112, rows: 155
Loaded 201201, rows: 155
Loaded 201202, rows: 155
Loaded 201203, rows: 155
Loaded 201204, rows: 155
Loaded 201205, rows: 155
Loaded 201206, rows: 155
Loaded 201207, rows: 155
Loaded 201208, rows: 155
Loaded 201209, rows: 155
Loaded 201210, rows: 155
Loaded 201211, rows: 155
Loaded 201212, rows: 155
Loaded 201301, rows: 155
Loaded 201302, rows: 155
Loaded 201303, rows: 155
Loaded 201304, rows: 155


---

## Data Fetching Summary

After running this notebook, you will have three key datasets:

### 1. NYC 311 Service Requests
- **Location:** `data/landing/311-service-requests/` (partitioned by year/month)
- **Format:** Parquet files with Hive partitioning
- **Size:** Variable depending on scope (all agencies vs. DOHMH only)

### 2. US Census Population Data
- **Location:** `data/landing/acs-population/combined_population_data.csv`
- **Format:** CSV with GEOID, year, and population columns
- **Coverage:** 2010-2025 (with interpolation)

### 3. NOAA Weather Data
- **Location:** `data/landing/noaa-nclimgrid-daily/nyc_fips_weather_data.csv`
- **Format:** CSV with daily weather observations
- **Coverage:** 2010-2025, all NYC counties

## Next Steps

Proceed to **Step 2 - Basic EDA** to explore the data and understand patterns before cleaning and modeling.
