# Bronze Data Load

**Purpose:**  
Ingest raw data from all sources into the Bronze layer with **no business logic** or feature engineering—only the bare minimum of cleaning required for schema alignment.

**What this notebook does:**  
1. **Reads** data from:  
   - San Jose API (JSON → DataFrame)  
   - Dallas CSV  
   - SoCo CSV  
2. **Materializes** the data into our "tables":
   - `data-assets/bronze/dallas_df.parquet`
   - `data-assets/bronze/san_jose_df.parquet`
   - `data-assets/bronze/soco_df.parquet`

This data will be used when creating [Silver](./2_1_silver.ipynb), where it will be cleaned and pre-processed to allow us to work with higher quality data.

For more on Medallion Architecture, see [Databricks Glossary: Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture) (Databricks, n.d.).

---

### References  
Databricks. (n.d.). *Medallion Architecture*. Retrieved May 10, 2025, from https://www.databricks.com/glossary/medallion-architecture


---
## Table of Contents

1. [Setup](#setup)  
   - Install project dependencies from requirements.txt
   - Import essential libraries (os, requests, pandas)

2. [Configuration](#configuration)  
   - Define data directory paths
   - Set up API endpoints and parameters
   - Configure date column mappings
   - Centralize all file paths for reproducibility

3. [Data Loading](#data-loading)  
   - Fetch and parse San Jose animal shelter data from API
   - Read Dallas shelter data from CSV
   - Import Sonoma County shelter data from CSV
   - Save all datasets as parquet files in bronze layer

----

## 1. Setup

**Purpose:**  
Ensure the environment has all necessary libraries installed and imported.  
- `%pip install -r ../../requirements.txt` installs dependencies. 

> **Note:** we use a project-wide `requirements.txt` for consistency

In [21]:
%pip install -r ../../requirements.txt

Collecting pyarrow==20.0.0 (from -r ../../requirements.txt (line 6))
  Downloading pyarrow-20.0.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (3.3 kB)
Downloading pyarrow-20.0.0-cp311-cp311-macosx_12_0_arm64.whl (30.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.9/30.9 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pyarrow
Successfully installed pyarrow-20.0.0
Note: you may need to restart the kernel to use updated packages.


In [34]:
import os
import requests
import pandas as pd

-----

## 2. Configuration

**Purpose:**  
Centralize all “magic” values—file paths, API endpoints, parameters, and date-column names to make it easy to load everything locally.
- Makes the notebook reproducible.  
- Keeps the loading cells concise.

In [30]:
# ─── Configuration ───

# File paths for the CSV files
DATA_DIR = "../../data-assets/_raw"
CSV_PATHS = {
    "dallas": os.path.join(
        DATA_DIR,
        "Dallas_Animal_Shelter_Data_Fiscal_Year_2023_-_2025_20250516.csv"
    ),
    "soco": os.path.join(
        DATA_DIR,
        "SoCo_Animal_Shelter_Intake_and_Outcome_20250519.csv"
    ),
}
API_PATHS = {
    "san_jose": {
        "base_url":    "https://data.sanjoseca.gov/api/3/action/datastore_search",
        # Setting this limit tells the API to retyrn up to 10,000 records in a single call without
        # overwhelming the API or the network with an unbounded payload.
        # This helps us work around the the CKAN API's default which is 100 records per request.
        "params":      {"resource_id": "f3354a37-7e03-41f8-a94d-3f720389a68a", "limit": 10000}, 
    }
}

# ─── Date columns to parse ───
DATE_COLS = {
    "san_jose": ["IntakeDate", "OutcomeDate"],
    "dallas":   ["Intake_Date", "Outcome_Date"],
    "soco":     ["Intake Date", "Outcome Date"],
}

-----

## 3. Data Loading

**Purpose:**  
Pull raw data into pandas DataFrames:  
1. Call the San Jose API and parse its JSON response.  
2. Read the Dallas + SoCo CSV files, converting date strings to `datetime64`.  

In [31]:
# ─── Data Loading ───

# San Jose API
resp = requests.get(
    API_PATHS["san_jose"]["base_url"], 
    params=API_PATHS["san_jose"]["params"]
)
resp.raise_for_status()
san_jose_df = pd.DataFrame(resp.json()["result"]["records"])

# Parse as datetimes
for col in DATE_COLS["san_jose"]:
    san_jose_df[col] = pd.to_datetime(san_jose_df[col], errors="coerce")

# Dallas and SoCo CSVs
dallas_df = pd.read_csv(
    CSV_PATHS["dallas"],
    parse_dates=DATE_COLS["dallas"],
    low_memory=False
)
soco_df = pd.read_csv(
    CSV_PATHS["soco"],
    parse_dates=DATE_COLS["soco"],
    low_memory=False
)

-----

## 4. Materialize Bronze

**Purpose:**  
Materialize the Bronze data, as is from the source.

This allows us to have a copy of the data for reproducability, and if we need to re-build Silver. By materializing this data, we avoid re-incurring the costs of pulling down data from an API/download a csv, and store it as-is for future use-cases, in a parquet format.

Since we do not have a Database, as is common when using Medallion architecture, we are materializing the data by writing it to `.parquet`. Parquet allows for faster analysis, preserves data types for data, and is an efficient standard for data-storage.

In [32]:
BRONZE_DIR = "../../data-assets/bronze"

dfs = [san_jose_df, dallas_df, soco_df]

for df in dfs:
    df_name = [name for name, obj in globals().items() if obj is df][0]
    df.to_parquet(os.path.join(BRONZE_DIR, f"{df_name}.parquet"), index=False)
    print(f"Saved {df_name} to {BRONZE_DIR}/{df_name}.parquet")

Saved san_jose_df to ../../data-assets/bronze/san_jose_df.parquet
Saved dallas_df to ../../data-assets/bronze/dallas_df.parquet
Saved soco_df to ../../data-assets/bronze/soco_df.parquet


-----