# Exploration of the Dataset
This notebook performs the **first stage of data wrangling** for the NYC 311 Service Requests dataset.  
The goal is to **collect, clean, and prepare representative samples** from the raw data to support further **exploration and analysis** in later steps.

We connect to the official [NYC Open Data API](https://data.cityofnewyork.us/), download data in manageable chunks per quarter,  
and generate proportional samples across boroughs ‚Äî ensuring that the dataset remains statistically representative while keeping it computationally efficient.

This notebook lays the groundwork for:
- **Merging and unifying** large raw datasets  
- **Sampling** data per borough and time period  

In [1]:
# Imports
import pandas as pd
import logging
import re
from libs.fetcher import fetch_count_of_grouping, fetch_all_samples_from_plan
from libs.utils import generate_quarters, month_ranges
from libs.calculator import calc_sample_size
from itertools import chain

## ‚öôÔ∏è Constants and Configuration

The following section defines key constants used throughout this notebook:
- **BASE_URL:** the NYC Open Data API endpoint for 311 Service Requests  
- **DEFAULT_SINCE / DEFAULT_UNTIL:** default year range for sampling  
- **TARGET_SAMPLE:** target number of records for each quarterly sample  
- **DEFAULT_DB_PATH / DEFAULT_TABLE:** optional configuration for local DuckDB storage  
- **MAX_RETRIES / TIMEOUT / BASE_DELAY:** network parameters for reliable API requests  
- **SELECT_COLUMNS:** the list of columns (fields) to retrieve from the API  
- **data_sets:** any additional CSV datasets used for contextual enrichment (e.g., housing, demographic, or rent data)


In [2]:
# Constants
BASE_URL = "https://data.cityofnewyork.us/resource/erm2-nwe9.csv"
DEFAULT_SINCE = 2024
DEFAULT_UNTIL = 2025
TARGET_SAMPLE = 10_000
DEFAULT_DB_PATH = "./mydb.duckdb"
DEFAULT_TABLE = "nyc311_2024_2025"
MAX_RETRIES = 5
TIMEOUT = 60  # seconds
BASE_DELAY = 2.0  # seconds

SELECT_COLUMNS = [
    "unique_key", "created_date", "closed_date", "agency", "agency_name", 
    "complaint_type", "descriptor", "location_type", "incident_zip", 
    "incident_address", "street_name", "cross_street_1", "cross_street_2",
    "intersection_street_1", "intersection_street_2", "address_type", "city", 
    "landmark", "facility_type", "status", "due_date", "resolution_description", 
    "resolution_action_updated_date", "community_board", "bbl", "borough", 
    "x_coordinate_state_plane", "y_coordinate_state_plane", "open_data_channel_type",
    "park_facility_name", "park_borough", "vehicle_type", "taxi_company_borough", 
    "taxi_pick_up_location", "bridge_highway_name", "bridge_highway_direction", 
    "road_ramp", "bridge_highway_segment", "latitude", "longitude", "location"
]

data_sets = [
     "data/medianAskingRent_All.csv",
]

##  1. Workflow Overview

1. **Generate month range:**  
   For each year between the selected start and end years, we create `(start, end)` date pairs.  
   First we create quarterly ranges and then split them into the months.
   Example: `2024-01-01T00:00:00` ‚Üí `2024-01-31T23:59:59` (M1 2024)

2. **Fetch Borough Counts:**  
   Using the Socrata API, we retrieve the total number of service requests per `borough` within each quarter.  
   ‚Üí Output: a list or DataFrame with columns  
   `['borough', 'total']`

3. **Compute Sampling Plan:**  
   Based on each borough‚Äôs proportion of total records, we calculate how many samples to take per borough:  
   $$
   n_i = N_\text{sample} \times \frac{\text{total}_i}{\text{total}_\text{overall}}
   $$
   The result is a sampling plan with one `sample_size` value per borough.

4. **Fetch Random Samples:**  
   For each borough, we randomly pull `sample_size` records from the corresponding month using the openly accessible API.  
   - Data is retrieved via the `.csv` endpoint (faster than JSON).  
   - A random `$offset` and a random choice, acsending or descending is used in sampling ensure randomness.

5. **Combine All Quarters:**  
   The sampled data from all boroughs and quarters are concatenated into a single combined DataFrame using  
   `pd.concat(all_quarters, ignore_index=True)`.

---

## ‚öôÔ∏è 2. Key Functions

| Function | Description |
|-----------|--------------|
| `generate_quarters(start_year, end_year)` | Generates quarterly date ranges |
| `fetch_count_of_grouping(BASE_URL, group_by, start, end)` | Retrieves counts per group (e.g., borough) |
| `calc_sample_size(count_result)` | Computes proportional sample sizes |
| `fetch_random_sample(...)` | Fetches a random subset for one borough and quarter |
| `fetch_all_samples_from_plan(...)` | Iterates over boroughs and collects their samples |
| `fetch_all_quarters(...)` *(optional)* | Runs the entire pipeline across all quarters |

---

In [3]:
# Fetch sample of datasets and parse to Data Frame 

# 1. generate the time ranges:
quarters = generate_quarters(DEFAULT_SINCE, DEFAULT_UNTIL)

data_frames = []
months = []

for start, end in quarters:
    months.append(month_ranges(start, end))
    logging.info(f"Quarter from {start} to {end}")

df_plan = None

print(months)

for start, end in chain.from_iterable(months):
    count_result = fetch_count_of_grouping(BASE_URL, "borough", start, end)
    df_plan = calc_sample_size(count_result, TARGET_SAMPLE)

    df_311_calls = fetch_all_samples_from_plan(
                    BASE_URL=BASE_URL,
                    selectors=SELECT_COLUMNS,
                    df_plan=df_plan,
                    group_by="borough", 
                    time_start=start,
                    time_end=end,
                    sleep_seconds=BASE_DELAY
    )
    df_311_calls["month_start"] = start
    df_311_calls["month_end"] = end
    data_frames.append(df_311_calls)

df_all_calls = pd.concat(data_frames, ignore_index=True)
logging.info(f"Total records fetched: {len(df_all_calls)}")
df_all_calls.to_csv("data/nyc_311_2024_2025_sample.csv", index=False)

[[('2024-01-01T00:00:00', '2024-02-01T00:00:00'), ('2024-02-01T00:00:00', '2024-03-01T00:00:00'), ('2024-03-01T00:00:00', '2024-03-31T23:59:59')], [('2024-04-01T00:00:00', '2024-05-01T00:00:00'), ('2024-05-01T00:00:00', '2024-06-01T00:00:00'), ('2024-06-01T00:00:00', '2024-06-30T23:59:59')], [('2024-07-01T00:00:00', '2024-08-01T00:00:00'), ('2024-08-01T00:00:00', '2024-09-01T00:00:00'), ('2024-09-01T00:00:00', '2024-09-30T23:59:59')], [('2024-10-01T00:00:00', '2024-11-01T00:00:00'), ('2024-11-01T00:00:00', '2024-12-01T00:00:00'), ('2024-12-01T00:00:00', '2024-12-31T23:59:59')], [('2025-01-01T00:00:00', '2025-02-01T00:00:00'), ('2025-02-01T00:00:00', '2025-03-01T00:00:00'), ('2025-03-01T00:00:00', '2025-03-31T23:59:59')], [('2025-04-01T00:00:00', '2025-05-01T00:00:00'), ('2025-05-01T00:00:00', '2025-06-01T00:00:00'), ('2025-06-01T00:00:00', '2025-06-30T23:59:59')], [('2025-07-01T00:00:00', '2025-08-01T00:00:00'), ('2025-08-01T00:00:00', '2025-09-01T00:00:00'), ('2025-09-01T00:00:00', '2

  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2057 rows for borough = 'BRONX' ...
2057 rows loaded.
üì• Loading 3091 rows for borough = 'BROOKLYN' ...
3091 rows loaded.
üì• Loading 2086 rows for borough = 'MANHATTAN' ...
2086 rows loaded.
üì• Loading 2418 rows for borough = 'QUEENS' ...
2418 rows loaded.
üì• Loading 339 rows for borough = 'STATEN ISLAND' ...
339 rows loaded.
üì• Loading 9 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1952 rows for borough = 'BRONX' ...
1952 rows loaded.
üì• Loading 3063 rows for borough = 'BROOKLYN' ...
3063 rows loaded.
üì• Loading 2110 rows for borough = 'MANHATTAN' ...
2110 rows loaded.
üì• Loading 2499 rows for borough = 'QUEENS' ...
2499 rows loaded.
üì• Loading 368 rows for borough = 'STATEN ISLAND' ...
368 rows loaded.
üì• Loading 8 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1947 rows for borough = 'BRONX' ...
1947 rows loaded.
üì• Loading 3003 rows for borough = 'BROOKLYN' ...
3003 rows loaded.
üì• Loading 2233 rows for borough = 'MANHATTAN' ...
2233 rows loaded.
üì• Loading 2458 rows for borough = 'QUEENS' ...
2458 rows loaded.
üì• Loading 352 rows for borough = 'STATEN ISLAND' ...
352 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1809 rows for borough = 'BRONX' ...
1809 rows loaded.
üì• Loading 3131 rows for borough = 'BROOKLYN' ...
3131 rows loaded.
üì• Loading 2217 rows for borough = 'MANHATTAN' ...
2217 rows loaded.
üì• Loading 2480 rows for borough = 'QUEENS' ...
2480 rows loaded.
üì• Loading 356 rows for borough = 'STATEN ISLAND' ...
356 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1938 rows for borough = 'BRONX' ...
1938 rows loaded.
üì• Loading 3075 rows for borough = 'BROOKLYN' ...
3075 rows loaded.
üì• Loading 1996 rows for borough = 'MANHATTAN' ...
1996 rows loaded.
üì• Loading 2595 rows for borough = 'QUEENS' ...
2595 rows loaded.
üì• Loading 389 rows for borough = 'STATEN ISLAND' ...
389 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2119 rows for borough = 'BRONX' ...
2119 rows loaded.
üì• Loading 2986 rows for borough = 'BROOKLYN' ...
2986 rows loaded.
üì• Loading 2010 rows for borough = 'MANHATTAN' ...
2010 rows loaded.
üì• Loading 2481 rows for borough = 'QUEENS' ...
2481 rows loaded.
üì• Loading 395 rows for borough = 'STATEN ISLAND' ...
395 rows loaded.
üì• Loading 9 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2047 rows for borough = 'BRONX' ...
2047 rows loaded.
üì• Loading 3005 rows for borough = 'BROOKLYN' ...
3005 rows loaded.
üì• Loading 2032 rows for borough = 'MANHATTAN' ...
2032 rows loaded.
üì• Loading 2525 rows for borough = 'QUEENS' ...
2525 rows loaded.
üì• Loading 380 rows for borough = 'STATEN ISLAND' ...
380 rows loaded.
üì• Loading 11 rows for borough = 'Unspecified' ...
11 rows loaded.
<class 'pandas.core.series.Series'>
üì• Loading 2258 rows for borough = 'BRONX' ...
2258 rows loaded.
üì• Loading 2983 rows for borough = 'BROOKLYN' ...
2983 rows loaded.
üì• Loading 2077 rows for borough = 'MANHATTAN' ...
2077 rows loaded.
üì• Loading 2335 rows for borough = 'QUEENS' ...
2335 rows loaded.
üì• Loading 339 rows for borough = 'STATEN ISLAND' ...
339 rows loaded.
üì• Loading 8 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2109 rows for borough = 'BRONX' ...
2109 rows loaded.
üì• Loading 2965 rows for borough = 'BROOKLYN' ...
2965 rows loaded.
üì• Loading 2203 rows for borough = 'MANHATTAN' ...
2203 rows loaded.
üì• Loading 2369 rows for borough = 'QUEENS' ...
2369 rows loaded.
üì• Loading 347 rows for borough = 'STATEN ISLAND' ...
347 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2354 rows for borough = 'BRONX' ...
2354 rows loaded.
üì• Loading 3001 rows for borough = 'BROOKLYN' ...
3001 rows loaded.
üì• Loading 2088 rows for borough = 'MANHATTAN' ...
2088 rows loaded.
üì• Loading 2223 rows for borough = 'QUEENS' ...
2223 rows loaded.
üì• Loading 327 rows for borough = 'STATEN ISLAND' ...
327 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2781 rows for borough = 'BRONX' ...
2781 rows loaded.
üì• Loading 2853 rows for borough = 'BROOKLYN' ...
2853 rows loaded.
üì• Loading 1984 rows for borough = 'MANHATTAN' ...
1984 rows loaded.
üì• Loading 2087 rows for borough = 'QUEENS' ...
2087 rows loaded.
üì• Loading 288 rows for borough = 'STATEN ISLAND' ...
288 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 3305 rows for borough = 'BRONX' ...
3305 rows loaded.
üì• Loading 2669 rows for borough = 'BROOKLYN' ...
2669 rows loaded.
üì• Loading 1762 rows for borough = 'MANHATTAN' ...
1762 rows loaded.
üì• Loading 1968 rows for borough = 'QUEENS' ...
1968 rows loaded.
üì• Loading 291 rows for borough = 'STATEN ISLAND' ...
291 rows loaded.
üì• Loading 5 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2276 rows for borough = 'BRONX' ...
2276 rows loaded.
üì• Loading 3045 rows for borough = 'BROOKLYN' ...
3045 rows loaded.
üì• Loading 1990 rows for borough = 'MANHATTAN' ...
1990 rows loaded.
üì• Loading 2324 rows for borough = 'QUEENS' ...
2324 rows loaded.
üì• Loading 358 rows for borough = 'STATEN ISLAND' ...
0 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2031 rows for borough = 'BRONX' ...
2031 rows loaded.
üì• Loading 3180 rows for borough = 'BROOKLYN' ...
3180 rows loaded.
üì• Loading 1963 rows for borough = 'MANHATTAN' ...
1963 rows loaded.
üì• Loading 2444 rows for borough = 'QUEENS' ...
2444 rows loaded.
üì• Loading 375 rows for borough = 'STATEN ISLAND' ...
375 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1919 rows for borough = 'BRONX' ...
1919 rows loaded.
üì• Loading 3115 rows for borough = 'BROOKLYN' ...
3115 rows loaded.
üì• Loading 2124 rows for borough = 'MANHATTAN' ...
2124 rows loaded.
üì• Loading 2441 rows for borough = 'QUEENS' ...
2441 rows loaded.
üì• Loading 393 rows for borough = 'STATEN ISLAND' ...
393 rows loaded.
üì• Loading 8 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1847 rows for borough = 'BRONX' ...
1847 rows loaded.
üì• Loading 3123 rows for borough = 'BROOKLYN' ...
3123 rows loaded.
üì• Loading 2077 rows for borough = 'MANHATTAN' ...
2077 rows loaded.
üì• Loading 2530 rows for borough = 'QUEENS' ...
2530 rows loaded.
üì• Loading 415 rows for borough = 'STATEN ISLAND' ...
415 rows loaded.
üì• Loading 8 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1910 rows for borough = 'BRONX' ...
1910 rows loaded.
üì• Loading 2966 rows for borough = 'BROOKLYN' ...
2966 rows loaded.
üì• Loading 2040 rows for borough = 'MANHATTAN' ...
2040 rows loaded.
üì• Loading 2658 rows for borough = 'QUEENS' ...
2658 rows loaded.
üì• Loading 416 rows for borough = 'STATEN ISLAND' ...
416 rows loaded.
üì• Loading 10 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1861 rows for borough = 'BRONX' ...
1861 rows loaded.
üì• Loading 2901 rows for borough = 'BROOKLYN' ...
2901 rows loaded.
üì• Loading 1905 rows for borough = 'MANHATTAN' ...
1905 rows loaded.
üì• Loading 2863 rows for borough = 'QUEENS' ...
2863 rows loaded.
üì• Loading 460 rows for borough = 'STATEN ISLAND' ...
460 rows loaded.
üì• Loading 10 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1769 rows for borough = 'BRONX' ...
1769 rows loaded.
üì• Loading 2900 rows for borough = 'BROOKLYN' ...
2900 rows loaded.
üì• Loading 1935 rows for borough = 'MANHATTAN' ...
1935 rows loaded.
üì• Loading 2962 rows for borough = 'QUEENS' ...
2962 rows loaded.
üì• Loading 426 rows for borough = 'STATEN ISLAND' ...
426 rows loaded.
üì• Loading 8 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1987 rows for borough = 'BRONX' ...
1987 rows loaded.
üì• Loading 3028 rows for borough = 'BROOKLYN' ...
3028 rows loaded.
üì• Loading 2018 rows for borough = 'MANHATTAN' ...
2018 rows loaded.
üì• Loading 2580 rows for borough = 'QUEENS' ...
2580 rows loaded.
üì• Loading 380 rows for borough = 'STATEN ISLAND' ...
380 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 2367 rows for borough = 'BRONX' ...
2367 rows loaded.
üì• Loading 2888 rows for borough = 'BROOKLYN' ...
2888 rows loaded.
üì• Loading 2069 rows for borough = 'MANHATTAN' ...
2069 rows loaded.
üì• Loading 2327 rows for borough = 'QUEENS' ...
2327 rows loaded.
üì• Loading 342 rows for borough = 'STATEN ISLAND' ...
342 rows loaded.
üì• Loading 7 rows for borough = 'Unspecified' ...
0 rows loaded.


  return pd.concat(all_samples, ignore_index=True)


<class 'pandas.core.series.Series'>
üì• Loading 1731 rows for borough = 'BRONX' ...
0 rows loaded.
üì• Loading 2390 rows for borough = 'MANHATTAN' ...
0 rows loaded.
üì• Loading 2445 rows for borough = 'QUEENS' ...
0 rows loaded.
üì• Loading 3063 rows for borough = 'BROOKLYN' ...
0 rows loaded.
üì• Loading 8 rows for borough = 'Unspecified' ...
0 rows loaded.
üì• Loading 363 rows for borough = 'STATEN ISLAND' ...
0 rows loaded.
<class 'pandas.core.series.Series'>


  df_all_calls = pd.concat(data_frames, ignore_index=True)
