# Ames Housing Dataset Ingestion to Unity Catalog (AWS)

This notebook demonstrates a best-practice workflow for copying the Ames Housing dataset from a local folder (`../fixtures/datasets/`) into Databricks on AWS, using Unity Catalog for secure, governed data management. You will learn how to copy the dataset files to a Unity Catalog volume and prepare them for analysis.

## What is Unity Catalog?
Unity Catalog is Databricks' unified governance solution for all data assets, providing fine-grained access control, auditing, and lineage across workspaces and clouds.

## Prerequisites
- You have a Databricks workspace on AWS with Unity Catalog enabled.
- You have access to the target catalog, schema, and volume.
- The Ames Housing dataset files are available in the `../fixtures/datasets/` directory relative to the notebook.

## Workflow Steps
1. Set up widgets for selecting the catalog, schema, and volume.
2. Retrieve the values from the widgets and construct the volume path.
3. List the files in the local dataset directory (`../fixtures/datasets/`).
4. Display the files that will be copied to the Unity Catalog volume.
5. Copy the dataset files to the specified Unity Catalog volume.

In [0]:
"""
Step 1: Set up widgets for Unity Catalog location selection
These widgets allow users to specify the catalog, schema, and volume path for storing the dataset.
"""
# Create widgets for selecting the catalog, schema, and volume path

# Catalog selection (default: 'main')
dbutils.widgets.text("catalog_use", "main", "Catalog")

# Schema selection (default: 'default')
dbutils.widgets.text("schema_use", "default", "Schema")

# Volume path selection (default: '/Volumes/main/default/landing')
dbutils.widgets.text(
    "volume_path_use",
    "/Volumes/main/default/landing",
    "Volume Path"
)

In [0]:
"""
Step 2: Retrieve widget values for catalog, schema, and volume path.
This section gets the user selections from the widgets defined above.
"""
# Retrieve the value selected by the user for the catalog
catalog_use = dbutils.widgets.get("catalog_use")  # Catalog selection
# Retrieve the value selected by the user for the schema
schema_use = dbutils.widgets.get("schema_use")    # Schema selection
# Retrieve the value selected by the user for the volume path
volumes_path_use = dbutils.widgets.get("volume_path_use")  # Volume path selection

In [0]:
"""
Step 3: Set source path and list files in the local dataset directory.
This step checks for the existence of the source directory and lists all files in it.
The directory name should match the documentation and other cells.
"""
import os

source_path = os.path.abspath(os.path.join('..', 'fixtures', 'datasets'))

if not os.path.exists(source_path):
    raise FileNotFoundError(f"Source directory not found: {source_path}")

print(f"Source dataset directory: {source_path}")
file_list = os.listdir(source_path)

In [0]:
"""
Step 4: List and display files found in the specified local dataset directory.
This step creates a DataFrame with file names and their corresponding full paths for better visualization.
"""
import os
import pandas as pd

# List files in the specified local dataset directory
file_list = os.listdir(source_path)  # Get a list of all files in the source directory

# Create a list of dictionaries containing file names and their full paths
file_info = [
    {"file_name": file_name, "full_path": os.path.join(source_path, file_name)}  # Create a dictionary for each file
    for file_name in file_list
]

# Display the file information as a DataFrame for improved visualization
display(pd.DataFrame(file_info))

In [0]:
"""
Step 5: Move dataset files from the local source directory to the Unity Catalog volume.
This process ensures that each file is securely moved to the specified volume path for governed storage.
Uses shutil.move() to move files from the local filesystem to the Unity Catalog volume via the Databricks file system mount.
"""
import shutil

# Iterate over each file in the file_info list to move it to the Unity Catalog volume
for file in file_info:
    local_path = file["full_path"]  # Full local path to the file
    volume_path = f"{volumes_path_use}/{file['file_name']}"  # Destination path in DBFS for Unity Catalog volume
    try:
        shutil.move(local_path, volume_path)  # Move the file to the volume
        print(f"Successfully moved {local_path} to {volume_path}")  # Print success message upon successful move
    except Exception as e:
        # Print error message if the move operation fails, including the error details
        print(f"Failed to move {local_path} to {volume_path}. Error: {str(e)}")  # Error handling