# Lakehouse Lab - Getting Started (DuckDB 1.3.0)

Welcome to your lakehouse environment! This notebook demonstrates the latest DuckDB 1.3.0 features.

## What's Available

- **MinIO**: S3-compatible object storage
- **Apache Spark**: Distributed data processing
- **DuckDB 1.3.0**: Fast analytics database with enhanced S3 support
- **Apache Airflow**: Workflow orchestration
- **Apache Superset**: Business intelligence and visualization
- **Portainer**: Container management


In [None]:
# Import necessary libraries - FIXED: No more import errors!
import pandas as pd
import duckdb
import boto3
from pyspark.sql import SparkSession
import os

print("✅ Lakehouse Lab Environment Ready!")
print(f"📊 DuckDB version: {duckdb.__version__}")  # Should show 1.3.0
print(f"🐍 Python version: {os.sys.version}")

## Connect to MinIO (S3-Compatible Storage)

In [None]:
# Configure MinIO connection with environment variables
s3_client = boto3.client(
    's3',
    endpoint_url='http://minio:9000',
    aws_access_key_id=os.environ.get('MINIO_ROOT_USER', 'minio'),
    aws_secret_access_key=os.environ.get('MINIO_ROOT_PASSWORD', 'minio123')
)

# List buckets
buckets = s3_client.list_buckets()
print("Available buckets:")
for bucket in buckets['Buckets']:
    print(f"  - {bucket['Name']}")

## Query Data with DuckDB 1.3.0

In [None]:
# Connect to DuckDB
conn = duckdb.connect()

# Configure S3 access for DuckDB with environment variables
minio_user = os.environ.get('MINIO_ROOT_USER', 'minio')
minio_password = os.environ.get('MINIO_ROOT_PASSWORD', 'minio123')

conn.execute("""
    INSTALL httpfs;
    LOAD httpfs;
    SET s3_endpoint='minio:9000';
    SET s3_use_ssl=false;
    SET s3_url_style='path';
""")

conn.execute(f"SET s3_access_key_id='{minio_user}';")
conn.execute(f"SET s3_secret_access_key='{minio_password}';")

print("✅ DuckDB 1.3.0 configured for S3 access!")

In [None]:
# Query sample data
result = conn.execute("""
    SELECT * FROM read_csv_auto('s3://lakehouse/raw-data/sample_orders.csv')
    LIMIT 10
""").fetchdf()

print("Sample data from MinIO:")
display(result)

## Analytics with DuckDB 1.3.0

In [None]:
# Run analytics query
analytics = conn.execute("""
    SELECT 
        product_category,
        COUNT(*) as order_count,
        SUM(total_amount) as total_revenue,
        AVG(total_amount) as avg_order_value
    FROM read_csv_auto('s3://lakehouse/raw-data/sample_orders.csv')
    GROUP BY product_category
    ORDER BY total_revenue DESC
""").fetchdf()

print("Sales by Product Category:")
display(analytics)

## Initialize Spark Session

In [None]:
# Create Spark session with environment variables
spark = SparkSession.builder \
    .appName("Lakehouse Lab") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", os.environ.get('MINIO_ROOT_USER', 'minio')) \
    .config("spark.hadoop.fs.s3a.secret.key", os.environ.get('MINIO_ROOT_PASSWORD', 'minio123')) \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print("✅ Spark session created successfully!")

## Next Steps

1. **Explore Superset**: Open http://localhost:9030 to create dashboards
2. **Check Airflow**: Visit http://localhost:9020 to see workflow orchestration
3. **Monitor with Portainer**: Use http://localhost:9060 for container management
4. **Access MinIO Console**: Visit http://localhost:9001 for file management

## Issues Fixed

✅ **Issue #1**: Superset S3 configuration now persistent  
✅ **Issue #2**: Airflow DuckDB import errors resolved  
✅ **Latest packages**: DuckDB 1.3.0 + duckdb-engine 0.17.0  
✅ **Credentials**: Now uses environment variables for security  

Happy data engineering!