# Bronze Data Load

In this notebook, we will be extracting and loading the documents into the Bronze layer.  

The data will be stored as-is from the source in its raw state, without any transformations, cleaning, or modification.

For more information on Medallion Architecture, see [Databricks Glossary](https://www.databricks.com/glossary/medallion-architecture) (Databricks, n.d.).

-----

### References  
Databricks. (n.d.). *Medallion Architecture*. Retrieved May 10, 2025, from [https://www.databricks.com/glossary/medallion-architecture](https://www.databricks.com/glossary/medallion-architecture)

In [1]:
%pip install -r ../../requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [5]:
import kagglehub
import os
import subprocess
import requests
import json
import pandas as pd
from sodapy import Socrata

In [None]:
# San Jose API
BASE     = "https://data.sanjoseca.gov/api/3/action/datastore_search"
resource = "f3354a37-7e03-41f8-a94d-3f720389a68a"

# San Jose API
params = {
    "resource_id": resource,
    "limit": 10000
}
resp = requests.get(BASE, params=params)
resp.raise_for_status()
records = resp.json()["result"]["records"]
san_jose_df = pd.DataFrame.from_records(records)

# Dallas Csv
dallas_df = pd.read_csv("../../data-assets/bronze/Dallas_Animal_Shelter_Data_Fiscal_Year_2023_-_2025_20250516.csv")

  dallas_df = pd.read_csv("../../data-assets/bronze/Dallas_Animal_Shelter_Data_Fiscal_Year_2023_-_2025_20250516.csv")


In [45]:
pd.set_option('display.max_columns', None)
san_jose_df.head(10)

Unnamed: 0,_id,AnimalID,AnimalName,AnimalType,PrimaryColor,SecondaryColor,PrimaryBreed,Sex,DOB,Age,IntakeDate,IntakeCondition,IntakeType,IntakeSubtype,IntakeReason,OutcomeDate,OutcomeType,OutcomeSubtype,OutcomeCondition,Crossing,Jurisdiction,LastUpdate
0,1,A0075579,BAILEY,DOG,BLACK,RED,LABRADOR RETR,SPAYED,1994-01-16T00:00:00,16 YEARS,2024-10-15T00:00:00,MED R,STRAY,OTC,,2024-10-15T00:00:00,RTO,,MED R,SENTER RD X TULLY RD,SAN JOSE,2024-10-15T00:00:00
1,2,A0533827,PATCHES,DOG,TRICOLOR,BLACK,PARSON RUSS TER,NEUTERED,2006-02-06T00:00:00,19 YEARS,2024-08-28T00:00:00,MED SEV,EUTH REQ,,,2024-08-28T00:00:00,RTO,,MED SEV,,SANTA CLARA,2024-08-28T00:00:00
2,3,A0538570,CARAMEL,CAT,CALICO-TRI,,DOMESTIC SH,SPAYED,,NO AGE,2024-09-30T00:00:00,DEAD,DISPO REQ,OTC,,2024-09-30T00:00:00,DISPOSAL,,DEAD,HEIMGARTNER LN,SAN JOSE,2024-09-30T00:00:00
3,4,A0564053,BISCUIT,CAT,ORANGE,,DOMESTIC SH,NEUTERED,2007-06-01T00:00:00,17 YEARS,2024-10-29T00:00:00,DEAD,DISPO REQ,OTC,,2024-10-29T00:00:00,DISPOSAL,,DEAD,STARLITE DR,MILPITAS,2024-10-29T00:00:00
4,5,A0569573,SHALE,CAT,BLACK,,DOMESTIC SH,SPAYED,2007-10-12T00:00:00,16 YEARS,2024-09-25T00:00:00,MED SEV,STRAY,OTC,,2024-09-25T00:00:00,EUTH,,MED SEV,1600 BLOCK ALMADEN RD,SAN JOSE,2024-09-25T00:00:00
5,6,A0608333,BLACK,CAT,TORBI-BRN,,DOMESTIC SH,SPAYED,2008-05-05T00:00:00,17 YEARS,2024-08-31T00:00:00,MED R,STRAY,OTC,,2024-09-03T00:00:00,RTF,,FERAL,RIVER VIEW DR,SAN JOSE,2024-09-03T00:00:00
6,7,A0636780,SPONGIE,DOG,FAWN,,POODLE MIN,NEUTERED,,NO AGE,2024-07-22T00:00:00,MED SEV,STRAY,OTC,,2024-07-22T00:00:00,RTO,,MED SEV,SNELL AVE,SAN JOSE,2024-08-02T00:00:00
7,8,A0643984,TAMIA,DOG,BROWN,,CHIHUAHUA SH,SPAYED,2009-02-02T00:00:00,15 YEARS,2024-10-30T00:00:00,DEAD,DISPO REQ,OTC OWNED,,2024-10-30T00:00:00,DISPOSAL,,DEAD,,SAN JOSE,2024-10-30T00:00:00
8,9,A0652502,KUJO,CAT,BLACK,,DOMESTIC SH,NEUTERED,2009-04-16T00:00:00,15 YEARS,2024-11-06T00:00:00,DEAD,DISPO REQ,FIELD,,2025-02-01T00:00:00,DISPOSAL,,DEAD,JASMINE X YERBA BUENA,SAN JOSE,2025-02-01T00:00:00
9,10,A0663870,ERNIE,CAT,TABBY-ORG,,DOMESTIC SH,NEUTERED,2005-12-29T00:00:00,18 YEARS,2024-11-21T00:00:00,DEAD,DISPO REQ,OTC OWNED,,2024-11-21T00:00:00,DISPOSAL,,DEAD,,SAN JOSE,2024-11-21T00:00:00


In [47]:
pd.set_option('display.max_columns', None)
dallas_df.head(10)

Unnamed: 0,Animal_Id,Animal_Type,Animal_Breed,Kennel_Number,Kennel_Status,Tag_Type,Activity_Number,Activity_Sequence,Source_Id,Census_Tract,Council_District,Intake_Type,Intake_Subtype,Intake_Total,Reason,Staff_Id,Intake_Date,Intake_Time,Due_Out,Intake_Condition,Hold_Request,Outcome_Type,Outcome_Subtype,Outcome_Date,Outcome_Time,Receipt_Number,Impound_Number,Service_Request_Number,Outcome_Condition,Chip_Status,Animal_Origin,Additional_Information,Month,Year
0,A0011910,DOG,PIT BULL,RESC FOST,UNAVAILABLE,,A23-412044,1,P9998533,4900.0,4.0,STRAY,CONFINED,1,OTHRINTAKS,CAB8533,12/21/2023,20:07:00,12/25/2023,APP WNL,EMERGENCY RESCUE,TRANSFER,MEDICAL,01/09/2024,11:27:00,,K23-609562,,APP SICK,SCAN NO CHIP,FIELD,,DEC.2023,FY2024
1,A0011910,DOG,PIT BULL,DC 24,AVAILABLE,,A23-412044,1,P0737656,,,TREATMENT,SPAY/NEUT,1,SURGERY,JFP,02/19/2024,09:29:00,02/19/2024,APP WNL,EMERGENCY RESCUE,TREATMENT,COMPLETED,02/19/2024,16:07:00,,K24-615492,,APP WNL,SCAN CHIP,OVER THE COUNTER,RGE,FEB.2024,FY2024
2,A0178985,DOG,ROTTWEILER,2708,LAB,,A24-443161,1,P1096837,8701.0,4.0,STRAY,AT LARGE,1,OTHRINTAKS,CAB8533,06/20/2024,19:03:00,06/26/2024,APP INJ,ADOP RESCU,EUTHANIZED,HUMANE,06/21/2024,16:21:00,,K24-631913,,APP INJ,SCAN CHIP,FIELD,,JUN.2024,FY2024
3,A0180810,DOG,MIXED BREED,G05,AVAILABLE,,,1,P0886821,4000.0,7.0,OWNER SURRENDER,WALK IN,1,PERSNLISSU,GRA,10/07/2024,12:51:00,10/07/2024,APP WNL,ADOP RESCU,FOSTER,TO ADOPT,10/18/2024,18:25:00,R24-622191,K24-644844,,APP WNL,SCAN CHIP,OVER THE COUNTER,,FY2024,FY2024
4,A0180810,DOG,MIXED BREED,G05,AVAILABLE,,,1,P1112628,,,FOSTER,APPOINT,1,FOR ADOPT,KDC,11/13/2024,16:21:00,11/13/2024,APP WNL,ADOP RESCU,ADOPTION,BY FOSTER,11/13/2024,16:22:00,R24-622191,K24-648974,,APP WNL,SCAN CHIP,OVER THE COUNTER,,FY2024,FY2024
5,A0329215,DOG,ROTTWEILER,LAB 03,LAB,,A24-443134,1,P0964738,12702.0,9.0,OWNER SURRENDER,URGENT,1,MEDICAL,VDB8516,06/20/2024,13:28:00,06/20/2024,APP SICK,DD/AGG,EUTHANIZED,DD/AGG,06/22/2024,15:25:00,,K24-631842,,APP WNL,SCAN CHIP,FIELD,JAP,JUN.2024,FY2024
6,A0521350,DOG,LABRADOR RETR,HOME,UNAVAILABLE,,,1,P0523745,19034.0,10.0,RESOURCE,SUPPLIES,1,FINANCIAL,ZOJ,05/31/2024,11:21:00,05/31/2024,APP WNL,,CLOSED,REMAIN HOM,05/31/2024,00:00:00,,K24-628809,,APP WNL,SCAN CHIP,OVER THE COUNTER,,MAY.2024,FY2024
7,A0524426,DOG,CHIHUAHUA SH,1214,AVAILABLE,,A25-489143,1,P0525590,5500.0,4.0,OWNER SURRENDER,URGENT,1,MEDICAL,GMM1713,04/11/2025,13:37:00,04/11/2025,APP WNL,ADOP RESCU,TRANSFER,GENERAL,04/12/2025,14:47:00,,K25-663323,,APP SICK,SCAN CHIP,FIELD,,FY2024,FY2024
8,A0569318,DOG,SKYE TERRIER,FREEZER,LAB,,,1,P1103275,20500.0,6.0,OWNER SURRENDER,EUTHANASIA REQUESTED,1,MEDICAL,GRA,08/06/2024,14:12:00,08/12/2024,GERIATRIC,ADOP RESCU,EUTHANIZED,HUMANE,08/06/2024,15:59:00,,K24-637625,,APP SICK,SCAN CHIP,OVER THE COUNTER,,AUG.2024,FY2024
9,A0575921,CAT,DOMESTIC MH,2333,AVAILABLE,,A24-435367,1,P1090127,8703.0,4.0,KEEPSAFE,OWN DECEAS,1,OTHER,RVB1213,05/08/2024,13:28:00,05/09/2024,APP WNL,ADOP RESCU,TRANSFER,GENERAL,05/21/2024,11:28:00,,K24-625764,,APP WNL,SCAN NO CHIP,FIELD,AGA,MAY.2024,FY2024


##########

In [10]:
# Configuraitons
DATASET_NAMES = [
    "data-assets/Dallas_Animal_Shelter_Data_Fiscal_Year_2023_-_2025_20250516.csv"
]
DATA_DIRECTORY = "data-assets"

# You should set this to true if you modify files and need to force the dataset to be re-downlaoded
CLEAR_CACHE = False

Below, we will load the Datasets into our `data-asset/bronze` folder.

This seeds our **bronze** layer, which has raw, unprocessed datasets. Keeping these is valuable as it allows us to reprocess data if our methods change, and helps with auditing and troubleshooting.

In [11]:
# Download latest version of the dataset from kaggle, for each dataset
for dataset_name in DATASET_NAMES:

    # Clear downloads cache if present, to ensure we are getting fresh data
    # This helps ensure we are getting consistent results across different machines
    if CLEAR_CACHE and os.path.exists(os.path.join(os.path.expanduser("~"), ".cache/kagglehub/datasets", dataset_name)):
        print("Cache found. Deleting cache...")
        subprocess.run(["rm", "-rf", os.path.join(os.path.expanduser("~"), ".cache/kagglehub/datasets", dataset_name)])
 
    # Download the dataset from kaggle
    temp = kagglehub.dataset_download("aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes")
    print("Downloaded dataset files in temp path:", temp)

    # Move the dataset files to the current working directory. Create the directory if it does not exist
    data_dir = f"{DATA_DIRECTORY}/bronze/{dataset_name}"
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
        print(f"Created directory {data_dir}.")

    for file in os.listdir(temp):
        os.rename(os.path.join(temp, file), os.path.join(os.path.relpath("."), data_dir, file))
        print(f"Moved {file} to current working directory in bronze.")

Downloading from https://www.kaggle.com/api/v1/datasets/download/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes?dataset_version_number=1...


100%|██████████| 9.25M/9.25M [00:00<00:00, 15.5MB/s]

Extracting files...





Downloaded dataset files in temp path: /Users/mariamckay/.cache/kagglehub/datasets/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes/versions/1
Created directory data-assets/bronze/data-assets/Dallas_Animal_Shelter_Data_Fiscal_Year_2023_-_2025_20250516.csv.
Moved aac_intakes_outcomes.csv to current working directory in bronze.
Moved aac_intakes.csv to current working directory in bronze.
Moved aac_outcomes.csv to current working directory in bronze.
