# **Bronze Data Load**

**Purpose:**  Ingest raw data from all sources into the Bronze layer with **no business logic** or feature engineering—only the bare minimum of cleaning required for schema alignment.

**What this notebook does:**  
1. **Reads** data from:  
   - San Jose API (JSON → DataFrame)  
   - Dallas CSV  
   - SoCo CSV  
2. **Materializes** the data into our "tables":
   - `data-assets/bronze/dallas_df.parquet`
   - `data-assets/bronze/san_jose_df.parquet`
   - `data-assets/bronze/soco_df.parquet`

This data will be used when creating [Silver](./2_silver.ipynb), where it will be cleaned and pre-processed to allow us to work with higher quality data.


```mermaid
flowchart LR
    A([Bronze])
    %% Bronze styling - brown/bronze color
    style A fill:#cd7f32,stroke:#8b4513,stroke-width:3px,color:#fff
    B([Silver])
    C([Gold])

    A --> B
    B --> C
```

<div>

For more on Medallion Architecture, see [Databricks Glossary: Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture) (Databricks, n.d.).

---

## **Table of Contents**

1. [Setup](#1-setup)  
   Install dependencies and import essential libraries for data processing

1. [Configuration](#2-configuration)  
   Define paths, API endpoints

1. [Data Loading](#3-data-loading)   
   Fetch regional shelter data from APIs and CSV files, save to bronze layer

1. [Materialize Bronze](#4-materialize-bronze)   
   Process raw data into bronze parquet files to use in Silver for cleaning

1. [References](#5-references)   

---

## **1. Setup**

**Purpose:**  Ensure the environment has all necessary libraries installed and imported.  
```python
# Install project-wide dependencies
%pip install -r ../../requirements.txt
``` 

> **Note:** we use a project-wide `requirements.txt` for consistency

In [None]:
%pip install -r ../../requirements.txt

: 

In [2]:
import os
import requests
import pandas as pd

---

## **2. Configuration**

**Purpose:**  
Centralize all “magic” values—file paths, API endpoints, parameters, and date-column names to make it easy to load everything locally.
- Makes the notebook reproducible.  
- Keeps the loading cells concise.

In [3]:
# ─── Configuration ───

# File paths for the CSV files
DATA_DIR = "../../data-assets/_raw"
CSV_PATHS = {
    "dallas": os.path.join(
        DATA_DIR, "Dallas_Animal_Shelter_Data_Fiscal_Year_2023_-_2025_20250516.csv"
    ),
    "soco": os.path.join(
        DATA_DIR, "SoCo_Animal_Shelter_Intake_and_Outcome_20250519.csv"
    ),
}
API_PATHS = {
    "san_jose": {
        "base_url": "https://data.sanjoseca.gov/api/3/action/datastore_search",
        # The default limit for the API is 100, so we had to add a limit to get all the data
        # The expected amount of rows is 16,274 as of 2025-05-25
        "params": {
            "resource_id": "f3354a37-7e03-41f8-a94d-3f720389a68a",
            "limit": 1000000,
        },
    }
}

# ─── Date columns to parse ───
DATE_COLS = {
    "san_jose": ["IntakeDate", "OutcomeDate"],
    "dallas": ["Intake_Date", "Outcome_Date"],
    "soco": ["Intake Date", "Outcome Date"],
}

---

## **3. Data Loading**

**Purpose:**  
Pull raw data into pandas DataFrames:  
-  Call the San Jose API and parse its JSON response.  
-  Read the Dallas + SoCo CSV files, converting date strings to `datetime64`.  

In [4]:
# ─── Data Loading ───

# San Jose API
resp = requests.get(
    API_PATHS["san_jose"]["base_url"], params=API_PATHS["san_jose"]["params"]
)
resp.raise_for_status()
san_jose_df = pd.DataFrame(resp.json()["result"]["records"])

# Parse as datetimes
for col in DATE_COLS["san_jose"]:
    san_jose_df[col] = pd.to_datetime(san_jose_df[col], errors="coerce")


def read(name: str):
    return pd.read_csv(CSV_PATHS[name], parse_dates=DATE_COLS[name], low_memory=False)


# Dallas and SoCo CSVs
dallas_df = read("dallas")
soco_df = read("soco")

---

## **4. Materialize Bronze**

**Purpose:**  
Materialize the Bronze data, as is from the source.

This allows us to have a copy of the data for reproducability, and if we need to re-build Silver. By materializing this data, we avoid re-incurring the costs of pulling down data from an API/download a csv, and store it as-is for future use-cases, in a parquet format.

Since we do not have a Database, as is common when using Medallion architecture, we are materializing the data by writing it to `.parquet`. Parquet allows for faster analysis, preserves data types for data, and is an efficient standard for data-storage.

In [5]:
BRONZE_DIR = "../../data-assets/bronze"
os.makedirs(BRONZE_DIR, exist_ok=True)

dfs = [san_jose_df, dallas_df, soco_df]

for df in dfs:
    df_name = [name for name, obj in globals().items() if obj is df][0]
    df.to_parquet(os.path.join(BRONZE_DIR, f"{df_name}.parquet"), index=False)
    print(f"Saved {df_name} to {BRONZE_DIR}/{df_name}.parquet")

Saved san_jose_df to ../../data-assets/bronze/san_jose_df.parquet
Saved dallas_df to ../../data-assets/bronze/dallas_df.parquet
Saved soco_df to ../../data-assets/bronze/soco_df.parquet


In [6]:
for df in dfs:
    df_name = [name for name, obj in globals().items() if obj is df][0]
    print(f"{df_name} shape: {df.shape}")

san_jose_df shape: (16490, 22)
dallas_df shape: (65079, 34)
soco_df shape: (30554, 24)


> -> Click to go to [Silver Layer](./2_silver.ipynb).

-----

## **5. References**  
Databricks. (n.d.). *Medallion Architecture*. Retrieved May 10, 2025, from https://www.databricks.com/glossary/medallion-architecture
