# Module 1: Databricks Lakehouse Platform Fundamentals 
## Laboratory Exercises

Welcome to the hands-on laboratory exercises for Module 1! This notebook contains all the practical exercises referenced in the student materials.

### Prerequisites
- Access to a Databricks workspace
- Basic familiarity with Python
- Approximately 2.5 hours for completion

### Lab Structure
1. **Lab 1**: Exploring Your Databricks Workspace (30 minutes)
2. **Lab 2**: Creating and Managing Compute Resources (45 minutes)
3. **Lab 3**: Working with DBFS and Delta Lake (45 minutes)
4. **Lab 4**: Collaboration and Development Workflows (30 minutes)

### Instructions
- Complete the labs in order as they build upon each other
- Run each cell sequentially within a section
- Take time to understand the output before proceeding
- Document any issues or questions for the Wednesday follow-up session

---
## Lab 1: Exploring Your Databricks Workspace (30 minutes)

### Objectives
- Navigate the Databricks interface
- Create your personal workspace structure
- Understand notebooks and their capabilities
- Set up your learning environment

### Note
Some exercises in this lab require UI interaction. Follow the instructions and use the code cells where provided.

### Exercise 1.1: Workspace Setup

🟢 **Required Steps:**
1. Log into your Databricks workspace
2. Navigate to your user folder under Workspace > Users > [your-email]
3. Create a folder structure:
   - Right-click your user folder
   - Create folder: `Databricks_Course`
   - Inside that, create: `Module_01_Fundamentals`
4. Upload this notebook to the Module_01_Fundamentals folder

Once complete, run the following cell to verify your environment:

In [0]:
# Verify Databricks environment
print("Hello, Databricks!")
print(f"Current date: {__import__('datetime').datetime.now()}")
print(f"Python version: {__import__('sys').version}")

# Note: This cell will only run after attaching to a cluster (Lab 2)

### Exercise 1.2: Understanding Notebook Features

Notebooks support multiple languages and cell types. Let's explore:

In [0]:
# Python cell - Default language
welcome_message = "Welcome to Databricks!"
print(welcome_message)

# Create a simple function
def greet_user(name):
    return f"Hello {name}, welcome to the Lakehouse journey!"

print(greet_user("Data Engineer"))

In [0]:
%sql
-- SQL cell - Using magic command
-- This will work after cluster attachment
SELECT 'SQL is also available in notebooks!' as message

In [0]:
%scala
// Scala cell - Another language option
println("Scala is available too!")
val courseModules = List("Fundamentals", "ETL", "Delta Lake", "Production", "Governance")
courseModules.foreach(println)

### Exercise 1.3: Exploring Databricks Utilities

Databricks provides utilities (dbutils) for file system operations, secrets, and more:

In [0]:
# Explore available utilities
# Note: Requires cluster attachment
dbutils.help()

In [0]:
# Get help on specific utility
dbutils.fs.help()

### 💡 Lab 1 Checkpoint

By now you should have:
- ✅ Successfully logged into Databricks
- ✅ Created your course folder structure
- ✅ Uploaded and opened this notebook
- ✅ Understood basic notebook operations

**Note**: Code execution requires a cluster, which we'll create in Lab 2.

---
## Lab 2: Creating and Managing Compute Resources (45 minutes)

### Objectives
- Understand different types of compute resources
- Create and configure your first cluster
- Learn cost optimization strategies
- Practice cluster management operations

### Exercise 2.1: Create Your Learning Cluster

🟢 **Required Steps:**

1. Navigate to **Compute** in the left sidebar
2. Click **Create Cluster**
3. Configure with these settings:
   - **Cluster Name**: `learning-[yourname]-module01`
   - **Cluster Mode**: Standard
   - **Databricks Runtime**: Latest LTS version (e.g., 14.3 LTS)
   - **Worker Type**: 
     - AWS: `i3.xlarge`
     - Azure: `Standard_DS3_v2`
     - GCP: `n1-standard-4`
   - **Workers**: Min 1, Max 2
   - **Enable autoscaling**: ✓
   - **Auto termination**: 120 minutes
   - **Enable spot instances**: ✓ (for cost savings)
4. Click **Create Cluster**
5. Wait for cluster to start (5-7 minutes)

While waiting, review the configuration options and their implications.

### Exercise 2.2: Attach Notebook to Cluster

Once your cluster is running:
1. At the top of this notebook, click the **Detached** dropdown
2. Select your newly created cluster
3. Wait for "Connected" status
4. Run the following verification cells:

In [0]:
# Verify cluster attachment and Spark session
print(f"Spark version: {spark.version}")
print(f"Python version: {sc.pythonVer}")
print(f"Cluster: {spark.conf.get('spark.databricks.clusterUsageTags.clusterName')}")

### Exercise 2.3: Understanding Cluster Metrics

Let's run some basic operations to see cluster metrics:

In [0]:
# Create a simple DataFrame to test cluster
from pyspark.sql import Row
import random

# Generate sample data
data = [Row(id=i, value=random.random()) for i in range(1000000)]
df = spark.createDataFrame(data)

df.createOrReplaceTempView("test_data")
# Perform operations
print(f"Total records: {df.count():,}")
num_partitions = spark.sql("SELECT spark_partition_id() as partition_id FROM test_data").distinct().count()
print(f"Number of partitions: {num_partitions}")

# Simple aggregation
result = df.agg({"value": "avg"}).collect()[0][0]
print(f"Average value: {result:.4f}")

In [0]:
# View Spark UI (opens in new tab)
displayHTML("""
<h4>Cluster Monitoring</h4>
<p>Click the <b>Spark UI</b> tab at the top of this notebook to see:</p>
<ul>
  <li>Job execution details</li>
  <li>Stage information</li>
  <li>Executor status</li>
  <li>Environment configurations</li>
</ul>
<p>This is invaluable for performance tuning!</p>
""")

### Exercise 2.4: Cluster Management Operations

🟡 **Practice these operations from the Compute UI:**

1. **View Event Log**: Check cluster startup events
2. **Metrics**: Monitor CPU, memory usage
3. **Configuration**: Review and understand settings
4. **Clone**: Create a copy of cluster config

🔴 **Important**: Don't terminate the cluster yet - we need it for remaining labs!

### 💡 Lab 2 Checkpoint

You should now have:
- ✅ Successfully created a cluster
- ✅ Attached this notebook to the cluster
- ✅ Verified Spark is running
- ✅ Understood basic cluster operations
- ✅ Explored the Spark UI

---
## Lab 3: Working with DBFS and Delta Lake (45 minutes)

### Objectives
- Explore DBFS and understand its relationship to cloud storage
- Read various file formats using Spark
- Create your first Delta table
- Practice time travel queries
- Understand transaction logs

### Exercise 3.1: Exploring DBFS

In [0]:
# List DBFS root contents
display(dbutils.fs.ls("/"))

In [0]:
# Explore sample datasets
display(dbutils.fs.ls("/databricks-datasets"))

In [0]:
# Look at retail dataset structure
display(dbutils.fs.ls("/databricks-datasets/retail-org"))

In [0]:
# Check file details
dbutils.fs.head("/databricks-datasets/retail-org/customers/customers.csv", max_bytes=1000)

### Exercise 3.2: Reading Different File Formats

In [0]:
# Read CSV file
customers_csv = spark.read.csv(
    "/databricks-datasets/retail-org/customers/customers.csv",
    header=True,
    inferSchema=True
)

print(f"CSV Records: {customers_csv.count():,}")
customers_csv.printSchema()
display(customers_csv.limit(5))

In [0]:
# Read JSON file (if available)
try:
    events_json = spark.read.json("/databricks-datasets/structured-streaming/events")
    print(f"JSON Records: {events_json.count():,}")
    display(events_json.limit(5))
except:
    print("JSON dataset not available, creating sample...")
    # Create sample JSON data
    json_data = spark.range(100).selectExpr(
        "id",
        "rand() * 100 as value",
        "current_timestamp() as timestamp"
    )
    display(json_data.limit(5))

In [0]:
# Compare file format performance
import time

# Write same data in different formats
test_data = spark.range(1000000).selectExpr("id", "rand() * 1000 as value")

# CSV write/read
csv_path = "/tmp/format_test.csv"
start = time.time()
test_data.write.mode("overwrite").csv(csv_path)
csv_write_time = time.time() - start

start = time.time()
spark.read.csv(csv_path).count()
csv_read_time = time.time() - start

# Parquet write/read
parquet_path = "/tmp/format_test.parquet"
start = time.time()
test_data.write.mode("overwrite").parquet(parquet_path)
parquet_write_time = time.time() - start

start = time.time()
spark.read.parquet(parquet_path).count()
parquet_read_time = time.time() - start

print("Format Performance Comparison:")
print(f"CSV - Write: {csv_write_time:.2f}s, Read: {csv_read_time:.2f}s")
print(f"Parquet - Write: {parquet_write_time:.2f}s, Read: {parquet_read_time:.2f}s")
print(f"\nParquet is {csv_read_time/parquet_read_time:.1f}x faster for reads!")

### Exercise 3.3: Creating Your First Delta Table

In [0]:
# Create a database for our work
spark.sql("CREATE DATABASE IF NOT EXISTS training")
spark.sql("USE training")

# Show current database
print(f"Current database: {spark.catalog.currentDatabase()}")

In [0]:
# Create Delta table from DataFrame
# First, let's create some sample sales data
from pyspark.sql.functions import col, rand, round, date_add, current_date

sales_data = (
    spark.range(1000)
    .selectExpr("id as transaction_id")
    .withColumn("customer_id", (rand() * 100).cast("int"))
    .withColumn("product_id", (rand() * 50).cast("int"))
    .withColumn("quantity", (rand() * 10 + 1).cast("int"))
    .withColumn("price", round(rand() * 100 + 10, 2))
    .withColumn("transaction_date", date_add(current_date(), -(rand() * 30).cast("int")))
)

# Write as Delta table
delta_path = "/tmp/delta_sales"
sales_data.write.format("delta").mode("overwrite").save(delta_path)

print("Delta table created successfully!")
display(spark.read.format("delta").load(delta_path).limit(10))

In [0]:
# Create managed Delta table
sales_data.write.format("delta").mode("overwrite").saveAsTable("sales_transactions")

# Query the table
display(spark.sql("""
    SELECT 
        transaction_date,
        COUNT(*) as num_transactions,
        SUM(quantity * price) as total_revenue
    FROM sales_transactions
    GROUP BY transaction_date
    ORDER BY transaction_date DESC
    LIMIT 10
"""))

### Exercise 3.4: Understanding Delta Lake Transaction Logs

In [0]:
# Examine Delta transaction log
display(dbutils.fs.ls(f"{delta_path}/_delta_log"))

In [0]:
# Read transaction log content
import json

log_file = f"{delta_path}/_delta_log/00000000000000000000.json"
log_content = dbutils.fs.head(log_file, max_bytes=2000)

print("Transaction Log Preview:")
print("=" * 50)
# Parse and pretty print first log entry
for line in log_content.split('\n')[:3]:  # First 3 entries
    if line.strip():
        log_entry = json.loads(line)
        print(json.dumps(log_entry, indent=2))
        print("-" * 30)

In [0]:
# View Delta table details
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, delta_path)
display(deltaTable.detail())

### Exercise 3.5: Delta Lake Operations and Updates

In [0]:
# Perform UPDATE operation
print("Before update - High value transactions:")
display(spark.sql("""
    SELECT * FROM sales_transactions 
    WHERE quantity * price > 500 
    LIMIT 5
"""))

# Apply 10% discount to high-value transactions
spark.sql("""
    UPDATE sales_transactions 
    SET price = price * 0.9 
    WHERE quantity * price > 500
""")

print("\nAfter update - Same transactions with discount:")

In [0]:
# Add new data (simulating daily batch)
new_sales = (
    spark.range(100)
    .selectExpr("id + 1000 as transaction_id")
    .withColumn("customer_id", (rand() * 100).cast("int"))
    .withColumn("product_id", (rand() * 50).cast("int"))
    .withColumn("quantity", (rand() * 10 + 1).cast("int"))
    .withColumn("price", round(rand() * 100 + 10, 2))
    .withColumn("transaction_date", current_date())
)

# Append new data
new_sales.write.format("delta").mode("append").saveAsTable("sales_transactions")

print(f"Total records after append: {spark.table('sales_transactions').count():,}")

In [0]:
# View table history
display(spark.sql("DESCRIBE HISTORY sales_transactions"))

### Exercise 3.6: Time Travel with Delta Lake

In [0]:
# Get current record count
current_count = spark.table("sales_transactions").count()
print(f"Current record count: {current_count:,}")

# Query table as of version 0 (original load)
version_0_count = spark.read.format("delta").option("versionAsOf", 0).table("sales_transactions").count()
print(f"Version 0 record count: {version_0_count:,}")

# Show the difference
print(f"\nRecords added since version 0: {current_count - version_0_count:,}")

In [0]:
# Time travel using timestamp
# Get timestamp of version 1
history_df = spark.sql("DESCRIBE HISTORY sales_transactions")
version_1_timestamp = history_df.filter("version = 1").select("timestamp").collect()[0][0]

print(f"Version 1 timestamp: {version_1_timestamp}")

# Query as of that timestamp
df_at_timestamp = spark.read.format("delta") \
    .option("timestampAsOf", version_1_timestamp) \
    .table("sales_transactions")

print(f"\nRecords at {version_1_timestamp}: {df_at_timestamp.count():,}")

In [0]:
# Compare data between versions
print("Comparing average prices between versions:")

# Current version
current_avg = spark.sql("SELECT AVG(price) as avg_price FROM sales_transactions").collect()[0][0]

# Version 0 (before discount)
v0_avg = spark.sql("""
    SELECT AVG(price) as avg_price 
    FROM sales_transactions VERSION AS OF 0
""").collect()[0][0]

print(f"Version 0 average price: ${v0_avg:.2f}")
print(f"Current average price: ${current_avg:.2f}")
print(f"Difference: ${v0_avg - current_avg:.2f} (due to discounts applied)")

### Exercise 3.7: Delta Lake Optimization

In [0]:
# Check file statistics before optimization
detail_df = deltaTable.detail()
display(detail_df.select("numFiles", "sizeInBytes", "properties"))

# Optimize the table (compact small files)
display(spark.sql("OPTIMIZE sales_transactions"))

# Check after optimization
print("\nAfter optimization:")
display(deltaTable.detail().select("numFiles", "sizeInBytes"))

In [0]:
# Z-Order optimization for better query performance
# This co-locates related data in the same files
display(spark.sql("""
    OPTIMIZE sales_transactions 
    ZORDER BY (customer_id, transaction_date)
"""))

### 💡 Lab 3 Checkpoint

You've now experienced:
- ✅ DBFS navigation and file operations
- ✅ Reading different file formats (CSV, JSON, Parquet)
- ✅ Creating and managing Delta tables
- ✅ Understanding transaction logs
- ✅ Performing updates and inserts
- ✅ Time travel queries
- ✅ Table optimization techniques

---
## Lab 4: Collaboration and Development Workflows (30 minutes)

### Objectives
- Use collaboration features effectively
- Understand revision history
- Create reusable components
- Establish development best practices

### Exercise 4.1: Creating Reusable Functions

In [0]:
# Create utility functions for common operations
from pyspark.sql.functions import current_timestamp, lit, col
from pyspark.sql import DataFrame
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("DataEngineering")

def add_audit_columns(df: DataFrame, process_name: str = "Unknown") -> DataFrame:
    """
    Add standard audit columns to any DataFrame.
    
    Parameters:
    - df: Input DataFrame
    - process_name: Name of the process adding these columns
    
    Returns:
    - DataFrame with audit columns added
    """
    logger.info(f"Adding audit columns for process: {process_name}")
    
    return df.withColumn("processed_timestamp", current_timestamp()) \
             .withColumn("process_name", lit(process_name)) \
             .withColumn("processing_cluster", lit(spark.conf.get("spark.databricks.clusterUsageTags.clusterName")))

# Test the function
test_df = spark.range(5)
audited_df = add_audit_columns(test_df, "Lab4_Testing")
display(audited_df)

In [0]:
# Create data quality validation function
def validate_data_quality(df: DataFrame, required_columns: list, max_null_percentage: float = 0.1) -> dict:
    """
    Validate data quality for a DataFrame.
    
    Parameters:
    - df: DataFrame to validate
    - required_columns: List of columns that must exist
    - max_null_percentage: Maximum allowed percentage of nulls (default 10%)
    
    Returns:
    - Dictionary with validation results
    """
    results = {
        "row_count": df.count(),
        "column_count": len(df.columns),
        "missing_columns": [],
        "null_percentages": {},
        "validation_passed": True
    }
    
    # Check for missing columns
    existing_columns = set(df.columns)
    required_set = set(required_columns)
    results["missing_columns"] = list(required_set - existing_columns)
    
    if results["missing_columns"]:
        results["validation_passed"] = False
        logger.warning(f"Missing required columns: {results['missing_columns']}")
    
    # Check null percentages
    total_rows = results["row_count"]
    if total_rows > 0:
        for col_name in df.columns:
            null_count = df.filter(col(col_name).isNull()).count()
            null_percentage = null_count / total_rows
            results["null_percentages"][col_name] = null_percentage
            
            if null_percentage > max_null_percentage:
                results["validation_passed"] = False
                logger.warning(f"Column {col_name} has {null_percentage:.2%} nulls")
    
    return results

# Test validation
validation_results = validate_data_quality(
    spark.table("sales_transactions"),
    required_columns=["transaction_id", "customer_id", "price"],
    max_null_percentage=0.05
)

print("Data Quality Validation Results:")
print(f"Validation Passed: {validation_results['validation_passed']}")
print(f"Row Count: {validation_results['row_count']:,}")
print(f"Null Percentages: {validation_results['null_percentages']}")

### Exercise 4.2: Documentation Best Practices

In [0]:
# Example of well-documented data processing function
def process_daily_sales(date_str: str, source_path: str = None) -> DataFrame:
    """
    Process daily sales data with standardized transformations.
    
    This function reads raw sales data for a specific date, applies business rules,
    and returns a cleaned DataFrame ready for analysis or storage.
    
    Parameters:
    -----------
    date_str : str
        Date in 'YYYY-MM-DD' format
    source_path : str, optional
        Override default source path for testing
    
    Returns:
    --------
    DataFrame
        Processed sales data with the following columns:
        - transaction_id: Unique identifier
        - customer_id: Customer identifier
        - product_id: Product identifier
        - quantity: Number of items
        - unit_price: Price per item
        - total_amount: quantity * unit_price
        - transaction_date: Date of transaction
        - processing_timestamp: When this record was processed
    
    Raises:
    -------
    ValueError
        If date_str is not in correct format
    FileNotFoundError
        If source data doesn't exist
    
    Example:
    --------
    >>> df = process_daily_sales('2024-01-15')
    >>> df.show(5)
    
    Notes:
    ------
    - Applies 10% discount for quantities > 10
    - Filters out test transactions (customer_id = 0)
    - Adds audit columns for lineage tracking
    """
    # Implementation would go here
    logger.info(f"Processing sales data for {date_str}")
    
    # For demo, return sample data
    return spark.table("sales_transactions").filter(f"transaction_date = '{date_str}'")

# Display the docstring
help(process_daily_sales)

### Exercise 4.3: Development Standards Checklist

In [0]:
# Create a development standards template
development_standards = """
# Databricks Development Standards Checklist

## Code Quality
- [ ] Functions have descriptive names following snake_case convention
- [ ] All functions include docstrings with parameters and returns documented
- [ ] Complex logic includes inline comments
- [ ] No hardcoded values - use configuration or parameters
- [ ] Error handling implemented with try-except blocks
- [ ] Logging added for key operations

## Data Quality
- [ ] Input data validated before processing
- [ ] Null checks implemented where appropriate
- [ ] Data types verified and cast explicitly
- [ ] Row counts logged at each transformation step
- [ ] Output data quality validated

## Performance
- [ ] Appropriate partitioning strategy implemented
- [ ] Broadcast joins used for small lookup tables
- [ ] Caching applied for repeatedly used DataFrames
- [ ] File sizes optimized (100-200MB per file)
- [ ] Z-ordering applied for frequently filtered columns

## Security
- [ ] No credentials hardcoded in notebooks
- [ ] Secrets stored in secret scopes
- [ ] Access controls reviewed and documented
- [ ] Sensitive data columns identified and protected

## Documentation
- [ ] README created explaining the solution
- [ ] Data flow diagram included
- [ ] Dependencies documented
- [ ] Runbook created for operations team

## Testing
- [ ] Unit tests for utility functions
- [ ] Integration tests for end-to-end flow
- [ ] Performance tests with production-scale data
- [ ] Edge cases identified and tested
"""

# Save to a file for team reference
dbutils.fs.put("/tmp/development_standards.md", development_standards, overwrite=True)
print("Development standards checklist created!")
print("\nPreview:")
print(development_standards[:500] + "...")

### Exercise 4.4: Creating a Module Summary

In [0]:
# Generate a summary of what we've learned
module_summary = {
    "Module": "01 - Databricks Fundamentals",
    "Topics Covered": [
        "Databricks Workspace Navigation",
        "Cluster Creation and Management",
        "DBFS and Cloud Storage Integration",
        "Delta Lake Fundamentals",
        "Time Travel and Versioning",
        "Collaboration Features"
    ],
    "Key Commands Learned": [
        "dbutils.fs for file operations",
        "spark.read/write for data I/O",
        "Delta table operations",
        "Time travel queries"
    ],
    "Hands-on Achievements": [
        "Created and configured a cluster",
        "Read multiple file formats",
        "Created and optimized Delta tables",
        "Performed time travel queries",
        "Built reusable functions"
    ],
    "Ready for Module 2": True
}

# Display summary
import json
print("Module 1 Learning Summary")
print("=" * 50)
print(json.dumps(module_summary, indent=2))

# Save summary as Delta table for tracking
summary_df = spark.createDataFrame(
    [("Module_01", "Completed", str(current_timestamp()))],
    ["module", "status", "completion_time"]
)
summary_df.write.format("delta").mode("append").saveAsTable("training.course_progress")

In [0]:
# Create a personal reflection template
reflection_template = f"""
# Module 1 Reflection - {current_date()}

## What I Learned
- [Add your key learnings here]
- 
- 

## What Surprised Me
- [What was unexpected?]
- 

## Questions for Follow-up
- [What needs clarification?]
- 

## How I'll Apply This
- [Real-world applications]
- 

## Technical Challenges Faced
- [What was difficult?]
- 

## Next Steps
- Review Delta Lake documentation
- Practice more with time travel
- Prepare for Module 2 (ETL with PySpark)
"""

print("Personal Reflection Template:")
print(reflection_template)
print("\n💡 Take 5 minutes to fill this out - it will help consolidate your learning!")

### 💡 Lab 4 Checkpoint

You've now completed:
- ✅ Created reusable utility functions
- ✅ Implemented data quality validation
- ✅ Practiced documentation standards
- ✅ Established development best practices
- ✅ Generated a module summary

---
## 🎉 Congratulations!

You've successfully completed all Module 1 laboratory exercises! 

### Your Achievements:
1. **Workspace Mastery**: Navigated and organized your Databricks environment
2. **Cluster Management**: Created and configured compute resources
3. **Data Engineering Fundamentals**: Worked with DBFS, various file formats, and Delta Lake
4. **Advanced Features**: Implemented time travel, optimization, and versioning
5. **Professional Practices**: Built reusable code and documentation

### Before Module 2:
1. Review any sections that were challenging
2. Experiment with the sample datasets
3. Practice Delta Lake operations
4. Prepare questions for the Wednesday session

### Remember:
- **Terminate your cluster** if you're done practicing (save costs!)
- **Download this notebook** for future reference
- **Join the Wednesday session** for Q&A and additional support

See you in Module 2, where we'll dive deep into ETL development with PySpark! 🚀

In [0]:
# Final cleanup reminder
print("⚠️  IMPORTANT REMINDERS:")
print("1. Save your work - File > Save")
print("2. Export this notebook - File > Export > IPython Notebook")
print("3. Terminate your cluster - Go to Compute tab")
print("\nGreat job today! 👏")