# Bronze Data Load

In this notebook, we will be extracting and loading the documents into the Bronze layer.  

The data will be stored as-is from the source in its raw state, without any transformations, cleaning, or modification.

For more information on Medallion Architecture, see [Databricks Glossary](https://www.databricks.com/glossary/medallion-architecture) (Databricks, n.d.).

-----

### References  
Databricks. (n.d.). *Medallion Architecture*. Retrieved May 10, 2025, from [https://www.databricks.com/glossary/medallion-architecture](https://www.databricks.com/glossary/medallion-architecture)

In [1]:
%pip install -r ../../requirements.txt

Collecting kagglehub==0.3.12 (from -r ../../requirements.txt (line 1))
  Using cached kagglehub-0.3.12-py3-none-any.whl.metadata (38 kB)
Collecting seaborn==0.13.2 (from -r ../../requirements.txt (line 2))
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pandas==2.2.2 (from -r ../../requirements.txt (line 3))
  Using cached pandas-2.2.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting matplotlib==3.9.2 (from -r ../../requirements.txt (line 4))
  Downloading matplotlib-3.9.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib==3.9.2->-r ../../requirements.txt (line 4))
  Using cached contourpy-1.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib==3.9.2->-r ../../requirements.txt (line 4))
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib==3.9.2->-r ../../requirements.txt (line 4))
  Using cached font

In [None]:
import kagglehub
import os
import subprocess

In [None]:
# Configuraitons
KAGGLE_DATASET_NAMES = [
    "aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes"
]
DATA_DIRECTORY = "data-assets"

# You should set this to true if you modify files and need to force the dataset to be re-downlaoded
CLEAR_CACHE = False

Below, we will load the Datasets into our `data-asset/bronze` folder.

This seeds our **bronze** layer, which has raw, unprocessed datasets. Keeping these is valuable as it allows us to reprocess data if our methods change, and helps with auditing and troubleshooting.

In [None]:
# Download latest version of the dataset from kaggle, for each dataset
for dataset_name in KAGGLE_DATASET_NAMES:

    # Clear downloads cache if present, to ensure we are getting fresh data
    # This helps ensure we are getting consistent results across different machines
    if CLEAR_CACHE and os.path.exists(os.path.join(os.path.expanduser("~"), ".cache/kagglehub/datasets", dataset_name)):
        print("Cache found. Deleting cache...")
        subprocess.run(["rm", "-rf", os.path.join(os.path.expanduser("~"), ".cache/kagglehub/datasets", dataset_name)])
 
    # Download the dataset from kaggle
    temp = kagglehub.dataset_download("aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes")
    print("Downloaded dataset files in temp path:", temp)

    # Move the dataset files to the current working directory. Create the directory if it does not exist
    data_dir = f"{DATA_DIRECTORY}/bronze/{dataset_name}"
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
        print(f"Created directory {data_dir}.")

    for file in os.listdir(temp):
        os.rename(os.path.join(temp, file), os.path.join(os.path.relpath("."), data_dir, file))
        print(f"Moved {file} to current working directory in bronze.")

Cache found. Deleting cache...
Downloading from https://www.kaggle.com/api/v1/datasets/download/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes?dataset_version_number=1...


100%|██████████| 9.25M/9.25M [00:01<00:00, 8.51MB/s]

Extracting files...





Downloaded dataset files in temp path: /Users/mariamckay/.cache/kagglehub/datasets/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes/versions/1
Moved aac_intakes_outcomes.csv to current working directory.
Moved aac_intakes.csv to current working directory.
Moved aac_outcomes.csv to current working directory.
