# Bronze Layer - Raw Data Ingestion

This notebook demonstrates loading raw Parquet files into Delta Lake tables.

**Key Concepts:**
- Delta Lake table creation
- Schema inference from Parquet
- DBFS file access

---

**Prerequisites:**
- Generated parquet files in `databricks/data/` (run `python generate_synthetic_data.py`)
- Uploaded files to Databricks storage (DBFS, Volumes, or workspace)

## Setup: Define paths and database

In [None]:
# Configuration
# For Free Edition: Use Volumes or upload files directly to workspace
# Try these paths in order (first available will be used):
POSSIBLE_PATHS = [
    "/Volumes/babblr/bronze",  # Volumes (Free Edition compatible)
    "/FileStore/babblr/bronze",  # DBFS FileStore (may be restricted in Free Edition)
    "/Workspace/babblr/bronze",  # Workspace files (alternative)
]

DATABASE_NAME = "babblr_bronze"

# Detect which path is available
BRONZE_PATH = None
for path in POSSIBLE_PATHS:
    try:
        dbutils.fs.ls(path)
        BRONZE_PATH = path
        print(f"[OK] Found accessible path: {BRONZE_PATH}")
        break
    except Exception:
        continue

if BRONZE_PATH is None:
    print("[WARNING] No accessible storage path found. For Free Edition:")
    print("   1. Upload files using 'Upload Data' in the workspace")
    print("   2. Or use Volumes: Create a Volume at /Volumes/babblr/bronze")
    print("   3. Or upload files directly in this notebook using:")
    print("      dbutils.fs.put('/tmp/your_file.parquet', file_content)")
    print("\n   Then update BRONZE_PATH above to match your upload location.")
    BRONZE_PATH = "/tmp/babblr/bronze"  # Fallback to temp location
    print(f"\n   Using fallback path: {BRONZE_PATH}")

# Create database if not exists
spark.sql(f"CREATE DATABASE IF NOT EXISTS {DATABASE_NAME}")
spark.sql(f"USE {DATABASE_NAME}")

print(f"Using database: {DATABASE_NAME}")
print(f"Using bronze path: {BRONZE_PATH}")

## List available data files

In [None]:
# Check what files are available
try:
    files = dbutils.fs.ls(BRONZE_PATH)
    print("Available files in bronze layer:")
    for f in files:
        print(f"  - {f.name} ({f.size / 1024:.1f} KB)")
except Exception as e:
    print(f"Error: {e}")
    print(f"\nPlease upload Parquet files to {BRONZE_PATH}")
    print("Run generate_synthetic_data.py locally first, then upload the data/ folder")

## Load tables into Delta format

Delta Lake provides:
- ACID transactions
- Time travel (versioning)
- Schema enforcement

In [None]:
def load_to_delta(table_name: str):
    """Load a Parquet file into a Delta table."""
    parquet_path = f"{BRONZE_PATH}/{table_name}.parquet"

    try:
        # Read Parquet with schema inference
        df = spark.read.parquet(parquet_path)

        # Write as Delta table (overwrite for demo purposes)
        df.write.format("delta").mode("overwrite").saveAsTable(table_name)

        row_count = spark.table(table_name).count()
        print(f"[OK] {table_name}: {row_count} rows loaded")
        return row_count
    except Exception as e:
        print(f"[SKIP] {table_name}: {e}")
        return 0

In [None]:
# Load all tables
tables = [
    "conversations",
    "messages",
    "lessons",
    "lesson_progress",
    "assessments",
    "assessment_attempts",
    "user_levels"
]

total_rows = 0
for table in tables:
    total_rows += load_to_delta(table)

print(f"\nTotal: {total_rows} rows loaded into Delta tables")

## Verify Delta tables

In [None]:
%%sql
-- Show all tables in the bronze database
SHOW TABLES

In [None]:
%%sql
-- Quick preview of conversations table
SELECT * FROM conversations LIMIT 5

In [None]:
%%sql
-- Check data distribution by language
SELECT
    language,
    COUNT(*) as conversation_count,
    COUNT(DISTINCT user_id) as unique_users
FROM conversations
GROUP BY language
ORDER BY conversation_count DESC

## Delta Lake Feature: Table History

Delta Lake automatically tracks all changes to tables.

In [None]:
%%sql
-- View table history (time travel metadata)
DESCRIBE HISTORY conversations

## Delta Lake Feature: Schema Information

In [None]:
%%sql
-- View schema of a table
DESCRIBE TABLE EXTENDED conversations

## Summary

In this notebook we:
1. Created a Bronze database for raw data
2. Loaded Parquet files into Delta Lake tables
3. Verified data with basic queries
4. Demonstrated Delta Lake features (history, schema)

**Next:** Run `02_silver_layer` to clean and transform the data.