<small> We import the `os` module to handle file system operations like creating directories and managing file paths. We also import `pandas`, the core Python library for data manipulation and analysis, which enables us to read, write, and transform tabular data efficiently. </small>

In [6]:
#os: Used to create directories and manage file paths
import os
#pandas: the backbone of all data manipulation in python (reading, writing, transforming tables)
import pandas as pd

<small> This line ensures that the directory `"data/raw"` exists before saving any files there. The parameter `exist_ok=True` prevents an error if the folder already exists, making the operation safe to run multiple times without interruption. </small>

In [7]:
#Ensures the folder exist before you try to save files there. 
#exist_ok=True prevents errors if folder already exist
os.makedirs("data/raw", exist_ok=True)

<small> In this step, we load two key datasets directly from online sources:
The ZIP code–level population data is loaded from a public NYC dataset URL. We explicitly set the ZCTA column as a string to ensure proper merging later on.

The NYC 311 complaints data is also loaded from an online source, limiting the data to 1,000 records for initial analysis. The incident_zip column is read as a string to maintain consistent ZIP code formatting.

These datasets form the foundation for our analysis, linking complaint records to population data by ZIP code.
</small>

In [None]:
# === Step 1: Load Population and Complaints Data ===
# Load ZIP Code-level population data (ensure 'ZCTA' column is read as string for merging later)
data_url1 = "https://data.cityofnewyork.us/api/views/pri4-ifjk/rows.csv?accessType=DOWNLOAD"
pop_df = pd.read_csv(data_url1, dtype={"ZCTA": str})

# Load NYC 311 complaints data (zip_code as string to match formatting)
data_url2 = "https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$limit=1000"
complaints_df = pd.read_csv(data_url2, dtype={"incident_zip": str})


<small> After loading the datasets, we save each one locally as CSV files for future use. We define separate file paths for the complaints data and the population data within the `"data/raw"` directory. Using `.to_csv()`, we export each DataFrame without the index column to keep the files clean. Printing the file paths confirms successful saving and provides easy reference for subsequent processing steps. </small>

In [11]:
# Define separate paths
complaints_path = "data/raw/nyc_311_raw.csv"
pop_path = "data/raw/nyc_zcta_population.csv"

# Save each dataset to its own file
complaints_df.to_csv(complaints_path, index=False)
pop_df.to_csv(pop_path, index=False)

print(f"Saved complaints to: {complaints_path}")
print(f"Saved population data to: {pop_path}")

Saved complaints to: data/raw/nyc_311_raw.csv
Saved population data to: data/raw/nyc_zcta_population.csv
