# Exploration of the Dataset
This notebook performs the **first stage of data wrangling** for the NYC 311 Service Requests dataset.  
The goal is to **collect, clean, and prepare representative samples** from the raw data to support further **exploration and analysis** in later steps.

We connect to the official [NYC Open Data API](https://data.cityofnewyork.us/), download data in manageable chunks per quarter,  
and generate proportional samples across boroughs — ensuring that the dataset remains statistically representative while keeping it computationally efficient.

This notebook lays the groundwork for:
- **Merging and unifying** large raw datasets  
- **Sampling** data per borough and time period  

In [None]:
# Imports
import pandas as pd
import logging
import re
from libs.fetcher import fetch_count_of_grouping, fetch_all_samples_from_plan
from libs.utils import generate_quarters, month_ranges
from libs.calculator import calc_sample_size
from itertools import chain

## ⚙️ Constants and Configuration

The following section defines key constants used throughout this notebook:
- **BASE_URL:** the NYC Open Data API endpoint for 311 Service Requests  
- **DEFAULT_SINCE / DEFAULT_UNTIL:** default year range for sampling  
- **TARGET_SAMPLE:** target number of records for each quarterly sample  
- **DEFAULT_DB_PATH / DEFAULT_TABLE:** optional configuration for local DuckDB storage  
- **MAX_RETRIES / TIMEOUT / BASE_DELAY:** network parameters for reliable API requests  
- **SELECT_COLUMNS:** the list of columns (fields) to retrieve from the API  
- **data_sets:** any additional CSV datasets used for contextual enrichment (e.g., housing, demographic, or rent data)


In [None]:
# Constants
BASE_URL = "https://data.cityofnewyork.us/resource/erm2-nwe9.csv"
DEFAULT_SINCE = 2024
DEFAULT_UNTIL = 2025
TARGET_SAMPLE = 10_000
DEFAULT_DB_PATH = "./mydb.duckdb"
DEFAULT_TABLE = "nyc311_2024_2025"
MAX_RETRIES = 5
TIMEOUT = 60  # seconds
BASE_DELAY = 2.0  # seconds

SELECT_COLUMNS = [
    "unique_key", "created_date", "closed_date", "agency", "agency_name", 
    "complaint_type", "descriptor", "location_type", "incident_zip", 
    "incident_address", "street_name", "cross_street_1", "cross_street_2",
    "intersection_street_1", "intersection_street_2", "address_type", "city", 
    "landmark", "facility_type", "status", "due_date", "resolution_description", 
    "resolution_action_updated_date", "community_board", "bbl", "borough", 
    "x_coordinate_state_plane", "y_coordinate_state_plane", "open_data_channel_type",
    "park_facility_name", "park_borough", "vehicle_type", "taxi_company_borough", 
    "taxi_pick_up_location", "bridge_highway_name", "bridge_highway_direction", 
    "road_ramp", "bridge_highway_segment", "latitude", "longitude", "location"
]

data_sets = [
     "data/medianAskingRent_All.csv",
]

##  1. Workflow Overview

1. **Generate Quarterly Ranges:**  
   For each year between the selected start and end years, we create `(start, end)` date pairs.  
   Example: `2024-01-01T00:00:00` → `2024-03-31T23:59:59` (Q1 2024)

2. **Fetch Borough Counts:**  
   Using the Socrata API, we retrieve the total number of service requests per `borough` within each quarter.  
   → Output: a list or DataFrame with columns  
   `['borough', 'total']`

3. **Compute Sampling Plan:**  
   Based on each borough’s proportion of total records, we calculate how many samples to take per borough:  
   $$
   n_i = N_\text{sample} \times \frac{\text{total}_i}{\text{total}_\text{overall}}
   $$
   The result is a sampling plan with one `sample_size` value per borough.

4. **Fetch Random Samples:**  
   For each borough, we randomly pull `sample_size` records from the corresponding quarter using the Socrata API.  
   - Data is retrieved via the `.csv` endpoint (faster than JSON).  
   - Optionally: a random `$offset` and local random sampling (`.sample()` in Pandas) ensure randomness.

5. **Combine All Quarters:**  
   The sampled data from all boroughs and quarters are concatenated into a single combined DataFrame using  
   `pd.concat(all_quarters, ignore_index=True)`.

---

## ⚙️ 2. Key Functions

| Function | Description |
|-----------|--------------|
| `generate_quarters(start_year, end_year)` | Generates quarterly date ranges |
| `fetch_count_of_grouping(BASE_URL, group_by, start, end)` | Retrieves counts per group (e.g., borough) |
| `calc_sample_size(count_result)` | Computes proportional sample sizes |
| `fetch_random_sample(...)` | Fetches a random subset for one borough and quarter |
| `fetch_all_samples_from_plan(...)` | Iterates over boroughs and collects their samples |
| `fetch_all_quarters(...)` *(optional)* | Runs the entire pipeline across all quarters |

---

In [None]:
# Fetch sample of datasets and parse to Data Frame 

# 1. generate the time ranges:
quarters = generate_quarters(DEFAULT_SINCE, DEFAULT_UNTIL)

data_frames = []
months = []

for start, end in quarters:
    months.append(month_ranges(start, end))
    logging.info(f"Quarter from {start} to {end}")

df_plan = None

print(months)

for start, end in chain.from_iterable(months):
    count_result = fetch_count_of_grouping(BASE_URL, "borough", start, end)
    df_plan = calc_sample_size(count_result, TARGET_SAMPLE)
    print(df_plan)

    df_311_calls = fetch_all_samples_from_plan(
                    BASE_URL=BASE_URL,
                    selectors=SELECT_COLUMNS,
                    df_plan=df_plan,
                    group_by="borough", 
                    time_start=start,
                    time_end=end,
                    sleep_seconds=BASE_DELAY
    )

    data_frames.append(df_311_calls)

df_all_calls = pd.concat(data_frames, ignore_index=True)
logging.info(f"Total records fetched: {len(df_all_calls)}")

print(df_plan)

#for month_start, month_end in months:
    #print(f"  Month from {month_start} to {month_end}")
    # 2. fetch count of each borough -> (This can be changed according to the needs)in the time range
    #count_result = fetch_count_of_grouping(BASE_URL, "borough", month_start, month_end)
    #print(count_result)

    # 3. calculate sample sizes
    #df_plan = calc_sample_size(count_result, TARGET_SAMPLE)
    # 4. fetch samples according to the plan
    #df_311_calls = fetch_all_samples_from_plan(
    #                BASE_URL=BASE_URL,
    #                selectors=SELECT_COLUMNS,
    #                df_plan=df_plan,
    #                group_by="borough", 
    #                time_start=start,
    #                time_end=end,
    #                sleep_seconds=BASE_DELAY
    #)
    #df_311_calls["quarter_start"] = start
    #df_311_calls["quarter_end"] = end
    #data_frames.append(df_311_calls)

# Combine all quarters into a single DataFrame
#df_all_calls = pd.concat(data_frames, ignore_index=True)
#logging.info(f"Total records fetched: {len(df_all_calls)}")

In [7]:
df_all_calls.to_csv("data/nyc_311_2024_2025_sample.csv", index=False)