# Bronze Data Load

In this notebook, we will be extracting and loading the documents into the Bronze layer.  

The data will be stored as-is from the source in its raw state, without any transformations, cleaning, or modification.

For more information on Medallion Architecture, see [Databricks Glossary](https://www.databricks.com/glossary/medallion-architecture) (Databricks, n.d.).

-----

### References  
Databricks. (n.d.). *Medallion Architecture*. Retrieved May 10, 2025, from [https://www.databricks.com/glossary/medallion-architecture](https://www.databricks.com/glossary/medallion-architecture)

In [1]:
%pip install kagglehub

Collecting kagglehub
  Downloading kagglehub-0.3.12-py3-none-any.whl.metadata (38 kB)
Collecting tqdm (from kagglehub)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Downloading kagglehub-0.3.12-py3-none-any.whl (67 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.0/68.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tqdm-4.67.1-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tqdm, kagglehub
Successfully installed kagglehub-0.3.12 tqdm-4.67.1
Note: you may need to restart the kernel to use updated packages.


In [None]:
import kagglehub
import os
import subprocess

In [None]:
# Configuraitons
KAGGLE_DATASET_NAMES = [
    "aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes"
]
DATA_DIRECTORY = "data-assets"

# You should set this to true if you modify files and need to force the dataset to be re-downlaoded
CLEAR_CACHE = False

Below, we will load the Datasets into our `data-asset/bronze` folder.

This seeds our **bronze** layer, which has raw, unprocessed datasets. Keeping these is valuable as it allows us to reprocess data if our methods change, and helps with auditing and troubleshooting.

In [None]:
# Download latest version of the dataset from kaggle, for each dataset
for dataset_name in KAGGLE_DATASET_NAMES:

    # Clear downloads cache if present, to ensure we are getting fresh data
    # This helps ensure we are getting consistent results across different machines
    if CLEAR_CACHE and os.path.exists(os.path.join(os.path.expanduser("~"), ".cache/kagglehub/datasets", dataset_name)):
        print("Cache found. Deleting cache...")
        subprocess.run(["rm", "-rf", os.path.join(os.path.expanduser("~"), ".cache/kagglehub/datasets", dataset_name)])
 
    # Download the dataset from kaggle
    temp = kagglehub.dataset_download("aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes")
    print("Downloaded dataset files in temp path:", temp)

    # Move the dataset files to the current working directory. Create the directory if it does not exist
    data_dir = f"{DATA_DIRECTORY}/bronze/{dataset_name}"
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
        print(f"Created directory {data_dir}.")

    for file in os.listdir(temp):
        os.rename(os.path.join(temp, file), os.path.join(os.path.relpath("."), data_dir, file))
        print(f"Moved {file} to current working directory in bronze.")

Cache found. Deleting cache...
Downloading from https://www.kaggle.com/api/v1/datasets/download/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes?dataset_version_number=1...


100%|██████████| 9.25M/9.25M [00:01<00:00, 8.51MB/s]

Extracting files...





Downloaded dataset files in temp path: /Users/mariamckay/.cache/kagglehub/datasets/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes/versions/1
Moved aac_intakes_outcomes.csv to current working directory.
Moved aac_intakes.csv to current working directory.
Moved aac_outcomes.csv to current working directory.
