# Ames Housing Dataset Ingestion to Unity Catalog (AWS)

This notebook demonstrates a best-practice workflow for copying the Ames Housing dataset from a local folder (`../fixtures/datasets/`) into Databricks on AWS, using Unity Catalog for secure, governed data management. You will learn how to copy the dataset files to a Unity Catalog volume and prepare them for analysis.

## What is Unity Catalog?
Unity Catalog is Databricks' unified governance solution for all data assets, providing fine-grained access control, auditing, and lineage across workspaces and clouds.

## Prerequisites
- You have a Databricks workspace on AWS with Unity Catalog enabled.
- You have access to the target catalog, schema, and volume.
- The Ames Housing dataset files are available in the `../fixtures/dataset/` directory relative to the notebook.

## Workflow Steps
1. Set up widgets for catalog, schema, and volume selection.
2. Retrieve widget values and construct the volume path.
3. List files in the local dataset directory.
4. Display the files to be copied.
5. Copy the dataset files to the specified Unity Catalog volume.

In [0]:
"""
Step 1: Set up widgets for Unity Catalog location selection
Widgets allow users to specify the catalog, schema, and volume path for storing the dataset.
"""
# Create widgets for catalog, schema, and volume path selection

# Catalog selection (default: 'main')
dbutils.widgets.text("catalog_use", "main", "Catalog")

# Schema selection (default: 'default')
dbutils.widgets.text("schema_use", "default", "Schema")

# Volume path selection (default: '/Volumes/main/default/landing')
dbutils.widgets.text(
    "volume_path_use",
    "/Volumes/main/default/landing",
    "Volume Path"
)

In [0]:
"""
Step 2: Retrieve widget values for catalog, schema, and volume path
"""
# Retrieve the value selected by the user for the catalog
catalog_use = dbutils.widgets.get("catalog_use")
# Retrieve the value selected by the user for the schema
schema_use = dbutils.widgets.get("schema_use")
# Retrieve the value selected by the user for the volumes path
volumes_path_use = dbutils.widgets.get("volume_path_use")

In [0]:
"""
Step 3: Set source path and list files in the local dataset directory
"""
import os

source_path = os.path.abspath(os.path.join('..', 'fixtures', 'datasets'))

if not os.path.exists(source_path):
    raise FileNotFoundError(f"Source directory not found: {source_path}")

print(f"Source dataset directory: {source_path}")
file_list = os.listdir(source_path)

In [0]:
"""
Step 4: List and display files found in the specified local dataset directory
"""
import os
import pandas as pd

# List files in the specified local dataset directory
file_list = os.listdir(source_path)  # Get a list of all files in the source directory
file_info = [
    {"file_name": f, "full_path": os.path.join(source_path, f)}  # Create a dictionary with file name and full path
    for f in file_list
]

# Display file information as a DataFrame for better visualization
display(pd.DataFrame(file_info))

In [0]:
"""
Step 5: Copy dataset files from source directory to Unity Catalog volume
This process ensures that each file is securely copied to the specified volume path for governed storage.
"""
source_path = "source_directory_path"  # Define the source directory path
for file in file_info:
    local_path = f"{source_path}/{file['file_name']}"  # Construct the full local path
    volume_path = f"{volumes_path_use}/{file['file_name']}"  # Construct the volume path
    try:
        dbutils.fs.cp(f"file:{local_path}", volume_path)  # Copy the file to the volume
        print(f"Successfully copied {local_path} to {volume_path}")  # Success message
    except Exception as e:
        print(f"Failed to copy {local_path} to {volume_path}. Error: {str(e)}")  # Improved error handling