Final Submit DAW
<br>
Authors: Mychailo & Roberto

#  **New York State of Mind**

##  **Introduction**

This notebook presents the **final report** for our project in the module  
**Data Wrangling (DAW)**.  

It is based on two primary datasets:  

- [NYC 311 Service Requests](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/about_data)  
- [StreetEasy Data Dashboard](https://streeteasy.com/blog/data-dashboard/?utm_source=chatgpt.com)  

---

The project demonstrates a complete **data wrangling pipeline**, covering the topics of the LE's in this module.

1. **Importing** – retrieving and sampling real-world open data  
2. **Data Cleaning** – preparing data for analysis  
3. **Tranforming** – further preparation feature engineering  
4. **Joining** - joining the data
5. **Data Pipelines** – bla bla
6. **Reproducability** - bla bla

---

Our work follows the structure of the **commonly used data exploration framework**, 

---
<img src="./img/data_exploring.png">


-------------------------

## LE1 Importing the Data
###  1. Workflow Overview

1. **Generate month range:**  
   For each year between the selected start and end years, we create `(start, end)` date pairs.  
   First we create quarterly ranges and then split them into the months.
   Example: `2024-01-01T00:00:00` → `2024-01-31T23:59:59` (M1 2024)

2. **Fetch Entries per Day:**  
   For the month we pick a amount of days, which days are random but the amount of days of the month are the same. For the amount of entries we want to fetch per day we calculate the amount as follows: 
   $$ 
      \text{per\_day} = \left\lfloor \frac{ \text{target} \times \text{per\_day\_mult} }{ \lvert \text{days} \rvert } \right\rfloor
   $$

3. **Fetch Random Samples:**  
   For each borough, we randomly pull the calculated per day amount of records from the corresponding day using the openly accessible API. 
   - Data is retrieved via the `.csv` endpoint (faster than JSON).  
   - A random `$offset` and a random choice, acsending or descending is used in sampling ensure randomness.

4. **Combine:**  
   The sampled data from all boroughs and days in the months are concatenated into a single combined DataFrame using  
   `pd.concat(all_quarters, ignore_index=True)`.

---

### 2. Key Functions

| Function | Description |
|-----------|--------------|
| `generate_quarters(start_year, end_year)` | Generates quarterly date ranges |
| `month_range(start, end)` | Generates the month range |
| `fetch_month_strat_data(...)` | Fetches a random subset for one borough and day |

---

### Importing

In [None]:
# Imports for LE1
import pandas as pd
from itertools import chain
from config import Settings, get_settings
from libs.utils import generate_quarters, generate_month_ranges
from libs.fetcher import get_dataset_stratified
from libs_tidy.distribution import test_imported_data_distribution_light, plot_distribution
from libs_tidy.tidying import prepare_date_time

### Constants

In [None]:
# Constants 
SETTINGS: Settings = get_settings()

print("Loaded config:")
print(f"{SETTINGS.GROUP_BY}")
print(f"{SETTINGS.GROUP_BY_VALUE}")
print(f"{SETTINGS.PLOT_DIST}")

# TODO: Maybe put this into a env variable
# Selection
SELECT_COLUMNS = [
    "unique_key", "created_date", "closed_date", "agency", "agency_name", 
    "complaint_type", "descriptor", "location_type", "incident_zip", 
    "incident_address", "street_name", "cross_street_1", "cross_street_2",
    "intersection_street_1", "intersection_street_2", "address_type", "city", 
    "landmark", "facility_type", "status", "due_date", "resolution_description", 
    "resolution_action_updated_date", "community_board", "bbl", "borough", 
    "x_coordinate_state_plane", "y_coordinate_state_plane", "open_data_channel_type",
    "park_facility_name", "park_borough", "vehicle_type", "taxi_company_borough", 
    "taxi_pick_up_location", "bridge_highway_name", "bridge_highway_direction", 
    "road_ramp", "bridge_highway_segment", "latitude", "longitude", "location"
]

### Fetch and Sample

In [None]:
# Fetch sample of datasets and parse to Data Frame 

# 1. generate the time ranges:
quarters = generate_quarters(SETTINGS.DEFAULT_SINCE, SETTINGS.DEFAULT_UNTIL)
months = generate_month_ranges(quarters)
# 2. fetch the data 
df_all_calls = get_dataset_stratified(months, SETTINGS, SELECT_COLUMNS)

### Safe data to csv

In [None]:
df_all_calls.to_csv("data/nyc_311_2024_2025_sample.csv")

### Testing the import of data

To verify our function to fetch a representitv sample worked, it is nessacary to:
1. verify if the distribution over time is the same? 

The second step in this procedure could be considered part of the tidy step in eplorational data analysis workflow?

In [None]:
# Testing the import 

# TODO: Maybe think about putting this test into tidying of data, but question is, what kinda test will we put here when we only test the import in the tidying of the data. Maybe a Test here could be to test Endpoint connection? 

# Read the saved dataset
df_nyc_311_2024_2025 = pd.read_csv("data/nyc_311_2024_2025_sample.csv")
df_nyc_311_2024_2025 = prepare_date_time(df_nyc_311_2024_2025)


for month, group in df_nyc_311_2024_2025.groupby('create_month'):
    # Plot if Plot wanted 
    if SETTINGS.PLOT_DIST:
        plot_distribution(group, month)


# Call the test function    
test_imported_data_distribution_light(df_nyc_311_2024_2025, max_cv=0.5)

## LE2 Tidying the Data

In [None]:
# For the Tidying we look at the shape and the basic information for the datasets

df_nyc_311 = pd.read_csv('data/nyc_311_2024_2025_sample.csv', index_col="unique_key")
df_median_rent = pd.read_csv('data/medianAskingRent_All.csv')

print(f"NYC 311 data shape: {df_nyc_311.shape}")
print(f"Median rent data shape: {df_median_rent.shape}")