# Lab 01: Big Data Volume Demonstration
## Banking Analytics Workshop - Data Volume Challenges

**Duration:** 90 minutes  
**Learning Objectives:**
- Understand the Volume challenge in Big Data (5 V's)
- Experience performance differences between traditional and distributed processing
- Work with realistic banking transaction volumes
- Explore memory limitations and scaling solutions

### 📊 Real Banking Volume Context

**Industry Examples:**
- **Deutsche Bank:** ~1 Billion transactions/day
- **PayPal:** 19 million transactions/day  
- **Visa:** 65,000 transactions/second
- **Our Lab:** 1M+ synthetic banking transactions

Let's explore how traditional tools break down and Big Data solutions scale!

## 🚀 Environment Setup and Data Loading

First, let's set up our environment and understand the tools we'll be working with.

In [None]:
# Essential imports for Big Data processing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import time
import psutil
import os
from datetime import datetime, timedelta
import random
from faker import Faker

# PySpark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import findspark

# Initialize findspark to locate Spark installation
findspark.init()

print("✅ All imports successful!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔥 NumPy version: {np.__version__}")

# Check system resources
memory_gb = psutil.virtual_memory().total / (1024**3)
cpu_cores = psutil.cpu_count()
print(f"💻 System Memory: {memory_gb:.1f} GB")
print(f"⚡ CPU Cores: {cpu_cores}")

In [None]:
# Configure Spark Session for optimal performance
def create_spark_session():
    """Create optimally configured Spark session for banking analytics"""
    
    spark = SparkSession.builder \
        .appName("Banking-BigData-Volume-Lab") \
        .master("local[*]") \
        .config("spark.driver.memory", "4g") \
        .config("spark.executor.memory", "2g") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB") \
        .getOrCreate()
    
    # Set log level to reduce noise
    spark.sparkContext.setLogLevel("WARN")
    
    return spark

# Create Spark session
spark = create_spark_session()

print("🔥 Spark Session Created Successfully!")
print(f"📊 Spark Version: {spark.version}")
print(f"⚡ Available Cores: {spark.sparkContext.defaultParallelism}")
print(f"💾 Driver Memory: {spark.conf.get('spark.driver.memory')}")

# Display Spark UI URL for monitoring
print(f"🌐 Spark Web UI: {spark.sparkContext.uiWebUrl}")

In [None]:
# Generate synthetic banking data for volume demonstration
fake = Faker()
Faker.seed(42)  # For reproducible results

def generate_banking_transactions(num_records):
    """
    Generate synthetic banking transaction data
    
    Args:
        num_records (int): Number of transaction records to generate
    
    Returns:
        list: List of dictionaries containing transaction data
    """
    
    print(f"🏦 Generating {num_records:,} synthetic banking transactions...")
    
    # Banking-specific data pools
    merchants = [
        "REWE Supermarket", "Amazon", "Deutsche Bahn", "Shell Tankstelle",
        "MediaMarkt", "H&M Fashion", "McDonald's", "Sparkasse ATM",
        "Edeka", "Zalando", "Netflix", "Spotify", "Apple Store",
        "Google Play", "Vodafone", "BMW Service", "IKEA", "Booking.com"
    ]
    
    merchant_categories = [
        "Grocery", "Online Shopping", "Transport", "Gas Station",
        "Electronics", "Fashion", "Fast Food", "ATM Withdrawal",
        "Grocery", "Online Fashion", "Streaming", "Music",
        "Technology", "Apps", "Telecommunications", "Automotive",
        "Home & Garden", "Travel"
    ]
    
    channels = ["card", "online", "mobile", "atm", "bank_transfer"]
    currencies = ["EUR", "USD", "GBP"]
    countries = ["DE", "US", "GB", "FR", "IT", "ES", "NL"]
    
    transactions = []
    customer_ids = [f"CUST_{i:08d}" for i in range(1, min(100000, num_records//10) + 1)]
    
    for i in range(num_records):
        # Realistic transaction patterns
        if i < num_records * 0.8:  # 80% normal transactions
            amount = round(fake.random.uniform(5.0, 500.0), 2)
        elif i < num_records * 0.95:  # 15% higher value transactions
            amount = round(fake.random.uniform(500.0, 5000.0), 2)
        else:  # 5% very high value transactions
            amount = round(fake.random.uniform(5000.0, 50000.0), 2)
        
        merchant_idx = fake.random.randint(0, len(merchants) - 1)
        
        # Create transaction record
        transaction = {
            "transaction_id": f"TXN_{fake.year()}_{i+1:08d}",
            "customer_id": fake.random.choice(customer_ids),
            "amount": amount,
            "currency": fake.random.choice(currencies),
            "merchant": merchants[merchant_idx],
            "merchant_category": merchant_categories[merchant_idx],
            "transaction_type": fake.random.choice(["purchase", "withdrawal", "transfer", "payment"]),
            "channel": fake.random.choice(channels),
            "timestamp": fake.date_time_between(start_date='-2y', end_date='now'),
            "location_country": fake.random.choice(countries),
            "location_city": fake.city(),
            "is_weekend": fake.random.choice([True, False]),
            "description": f"Transaction at {merchants[merchant_idx]}"
        }
        
        transactions.append(transaction)
        
        # Progress indicator
        if (i + 1) % 100000 == 0:
            print(f"   Generated {i+1:,} transactions...")
    
    print(f"✅ Generated {len(transactions):,} transactions successfully!")
    return transactions

# Start with smaller dataset for comparison
print("Creating datasets of different sizes for comparison...")
small_transactions = generate_banking_transactions(10000)  # 10K
print("Small dataset created.")

## 📊 Big Data Fundamentals: Pandas vs Spark Comparison

Now let's experience the fundamental difference between traditional data processing (Pandas) and Big Data processing (Spark). We'll analyze the same banking data using both approaches and measure performance differences.

In [None]:
def performance_benchmark(func, *args, **kwargs):
    """Benchmark function execution time and memory usage"""
    
    # Memory before execution
    process = psutil.Process()
    mem_before = process.memory_info().rss / 1024 / 1024  # MB
    
    # Time execution
    start_time = time.time()
    result = func(*args, **kwargs)
    end_time = time.time()
    
    # Memory after execution
    mem_after = process.memory_info().rss / 1024 / 1024  # MB
    
    execution_time = end_time - start_time
    memory_used = mem_after - mem_before
    
    return result, execution_time, memory_used

def analyze_transactions_pandas(transactions_data):
    """Traditional pandas-based analysis of banking transactions"""
    
    print("🐼 Starting Pandas Analysis...")
    
    # Convert to DataFrame
    df = pd.DataFrame(transactions_data)
    
    # Convert timestamp to datetime
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['date'] = df['timestamp'].dt.date
    df['hour'] = df['timestamp'].dt.hour
    
    # Banking analytics queries
    results = {}
    
    # 1. Daily transaction volume
    daily_volume = df.groupby('date').agg({
        'transaction_id': 'count',
        'amount': ['sum', 'mean']
    }).reset_index()
    results['daily_volume'] = daily_volume.shape[0]
    
    # 2. Customer transaction patterns
    customer_patterns = df.groupby('customer_id').agg({
        'amount': ['sum', 'mean', 'count'],
        'merchant_category': 'nunique'
    }).reset_index()
    results['customers'] = customer_patterns.shape[0]
    
    # 3. Fraud indicators (high-value, unusual patterns)
    fraud_indicators = df[
        (df['amount'] > df['amount'].quantile(0.95)) |
        (df['hour'].isin([0, 1, 2, 3, 4, 5]))
    ]
    results['fraud_indicators'] = fraud_indicators.shape[0]
    
    # 4. Merchant analysis
    merchant_stats = df.groupby(['merchant', 'merchant_category']).agg({
        'amount': ['sum', 'count', 'mean'],
        'customer_id': 'nunique'
    }).reset_index()
    results['merchants'] = merchant_stats.shape[0]
    
    print(f"   ✅ Processed {len(df):,} transactions")
    print(f"   📈 Generated {len(results)} analytical views")
    
    return results

def analyze_transactions_spark(spark_session, transactions_data):
    """Distributed Spark-based analysis of banking transactions"""
    
    print("🔥 Starting Spark Analysis...")
    
    # Create DataFrame from data
    df = spark_session.createDataFrame(transactions_data)
    
    # Convert timestamp and add derived columns
    df = df.withColumn("timestamp", to_timestamp(col("timestamp"))) \
          .withColumn("date", to_date(col("timestamp"))) \
          .withColumn("hour", hour(col("timestamp")))
    
    # Cache for multiple operations
    df.cache()
    
    results = {}
    
    # 1. Daily transaction volume using Spark SQL
    df.createOrReplaceTempView("transactions")
    daily_volume = spark_session.sql("""
        SELECT date,
               COUNT(*) as transaction_count,
               SUM(amount) as total_amount,
               AVG(amount) as avg_amount
        FROM transactions
        GROUP BY date
    """)
    results['daily_volume'] = daily_volume.count()
    
    # 2. Customer transaction patterns
    customer_patterns = spark_session.sql("""
        SELECT customer_id,
               SUM(amount) as total_spent,
               AVG(amount) as avg_transaction,
               COUNT(*) as transaction_count,
               COUNT(DISTINCT merchant_category) as categories_used
        FROM transactions
        GROUP BY customer_id
    """)
    results['customers'] = customer_patterns.count()
    
    # 3. Fraud indicators using DataFrame API
    quantile_95 = df.select(expr("percentile_approx(amount, 0.95)").alias("q95")).collect()[0]["q95"]
    fraud_indicators = df.filter(
        (col("amount") > quantile_95) |
        (col("hour").isin([0, 1, 2, 3, 4, 5]))
    )
    results['fraud_indicators'] = fraud_indicators.count()
    
    # 4. Merchant analysis
    merchant_stats = df.groupBy("merchant", "merchant_category").agg(
        sum("amount").alias("total_amount"),
        count("*").alias("transaction_count"),
        avg("amount").alias("avg_amount"),
        countDistinct("customer_id").alias("unique_customers")
    )
    results['merchants'] = merchant_stats.count()
    
    print(f"   ✅ Processed {df.count():,} transactions")
    print(f"   📈 Generated {len(results)} analytical views")
    
    # Unpersist cache
    df.unpersist()
    
    return results

# Run performance comparison
print("🏁 Starting Performance Comparison...")
print("="*60)

# Pandas benchmark
pandas_result, pandas_time, pandas_memory = performance_benchmark(
    analyze_transactions_pandas, small_transactions
)

print(f"🐼 Pandas Results:")
print(f"   ⏱️  Execution Time: {pandas_time:.2f} seconds")
print(f"   💾 Memory Used: {pandas_memory:.1f} MB")
print()

# Spark benchmark
spark_result, spark_time, spark_memory = performance_benchmark(
    analyze_transactions_spark, spark, small_transactions
)

print(f"🔥 Spark Results:")
print(f"   ⏱️  Execution Time: {spark_time:.2f} seconds")
print(f"   💾 Memory Used: {spark_memory:.1f} MB")
print()

# Performance comparison
if pandas_time > 0:
    speedup = pandas_time / spark_time
    print(f"📊 Performance Comparison:")
    print(f"   🚀 Spark is {speedup:.1f}x {'faster' if speedup > 1 else 'slower'} than Pandas")
    print(f"   💡 Memory efficiency: {(pandas_memory/spark_memory):.1f}x")
else:
    print("⚠️ Performance comparison not available")

## 📈 Scaling to Real Banking Volumes

Now let's experience what happens when we scale up to more realistic banking volumes. We'll generate progressively larger datasets and observe how each system performs.

In [None]:
# Volume scaling experiment
dataset_sizes = [10_000, 50_000, 100_000, 500_000]  # Start conservative for demo
performance_results = []

print("🎯 Volume Scaling Experiment")
print("="*50)
print("Testing both Pandas and Spark with increasing data volumes...")
print()

for size in dataset_sizes:
    print(f"📊 Testing with {size:,} transactions...")
    
    # Generate dataset of current size
    transactions = generate_banking_transactions(size)
    
    result_dict = {
        'dataset_size': size,
        'data_size_mb': len(str(transactions)) / 1024 / 1024  # Rough estimate
    }
    
    try:
        # Test Pandas
        pandas_result, pandas_time, pandas_memory = performance_benchmark(
            analyze_transactions_pandas, transactions
        )
        result_dict['pandas_time'] = pandas_time
        result_dict['pandas_memory'] = pandas_memory
        result_dict['pandas_success'] = True
        print(f"   🐼 Pandas: {pandas_time:.2f}s, {pandas_memory:.1f}MB")
        
    except Exception as e:
        print(f"   🐼 Pandas: Failed - {str(e)[:50]}...")
        result_dict['pandas_time'] = None
        result_dict['pandas_memory'] = None
        result_dict['pandas_success'] = False
    
    try:
        # Test Spark
        spark_result, spark_time, spark_memory = performance_benchmark(
            analyze_transactions_spark, spark, transactions
        )
        result_dict['spark_time'] = spark_time
        result_dict['spark_memory'] = spark_memory
        result_dict['spark_success'] = True
        print(f"   🔥 Spark: {spark_time:.2f}s, {spark_memory:.1f}MB")
        
    except Exception as e:
        print(f"   🔥 Spark: Failed - {str(e)[:50]}...")
        result_dict['spark_time'] = None
        result_dict['spark_memory'] = None
        result_dict['spark_success'] = False
    
    performance_results.append(result_dict)
    print()

# Create performance visualization
results_df = pd.DataFrame(performance_results)
print("📈 Performance Results Summary:")
print(results_df)

In [None]:
# Create comprehensive performance visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Execution Time vs Dataset Size', 'Memory Usage vs Dataset Size',
                   'Performance Comparison', 'Banking Volume Context'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# Filter successful results for plotting
successful_results = results_df[results_df['pandas_success'] | results_df['spark_success']]

if not successful_results.empty:
    # Execution Time Comparison
    fig.add_trace(
        go.Scatter(
            x=successful_results['dataset_size'],
            y=successful_results['pandas_time'],
            mode='lines+markers',
            name='Pandas Time',
            line=dict(color='blue', width=2),
            marker=dict(size=8)
        ),
        row=1, col=1
    )
    
    fig.add_trace(
        go.Scatter(
            x=successful_results['dataset_size'],
            y=successful_results['spark_time'],
            mode='lines+markers',
            name='Spark Time',
            line=dict(color='red', width=2),
            marker=dict(size=8)
        ),
        row=1, col=1
    )
    
    # Memory Usage Comparison
    fig.add_trace(
        go.Scatter(
            x=successful_results['dataset_size'],
            y=successful_results['pandas_memory'],
            mode='lines+markers',
            name='Pandas Memory',
            line=dict(color='lightblue', width=2),
            marker=dict(size=8),
            showlegend=False
        ),
        row=1, col=2
    )
    
    fig.add_trace(
        go.Scatter(
            x=successful_results['dataset_size'],
            y=successful_results['spark_memory'],
            mode='lines+markers',
            name='Spark Memory',
            line=dict(color='lightcoral', width=2),
            marker=dict(size=8),
            showlegend=False
        ),
        row=1, col=2
    )

# Performance efficiency chart
if len(successful_results) > 0:
    # Calculate efficiency (records/second)
    latest_result = successful_results.iloc[-1]
    if latest_result['pandas_time'] and latest_result['spark_time']:
        pandas_efficiency = latest_result['dataset_size'] / latest_result['pandas_time']
        spark_efficiency = latest_result['dataset_size'] / latest_result['spark_time']
        
        fig.add_trace(
            go.Bar(
                x=['Pandas', 'Spark'],
                y=[pandas_efficiency, spark_efficiency],
                name='Records/Second',
                marker_color=['blue', 'red'],
                showlegend=False
            ),
            row=2, col=1
        )

# Banking industry context
banking_volumes = {
    'Our Lab': successful_results['dataset_size'].max() if not successful_results.empty else 100_000,
    'Small Bank': 1_000_000,
    'Regional Bank': 10_000_000,
    'PayPal (Daily)': 19_000_000,
    'Deutsche Bank (Daily)': 1_000_000_000
}

fig.add_trace(
    go.Bar(
        x=list(banking_volumes.keys()),
        y=list(banking_volumes.values()),
        name='Daily Transactions',
        marker_color='green',
        showlegend=False
    ),
    row=2, col=2
)

# Update layout
fig.update_layout(
    height=800,
    title_text="Big Data Volume Performance Analysis - Banking Context",
    title_x=0.5
)

# Update axis labels
fig.update_xaxes(title_text="Dataset Size (Records)", row=1, col=1)
fig.update_yaxes(title_text="Time (Seconds)", row=1, col=1)
fig.update_xaxes(title_text="Dataset Size (Records)", row=1, col=2)
fig.update_yaxes(title_text="Memory (MB)", row=1, col=2)
fig.update_xaxes(title_text="Technology", row=2, col=1)
fig.update_yaxes(title_text="Records/Second", row=2, col=1)
fig.update_xaxes(title_text="Banking Institution", row=2, col=2)
fig.update_yaxes(title_text="Daily Transactions", row=2, col=2, type="log")

fig.show()

# Print key insights
print("\n🔍 Key Insights from Volume Analysis:")
print("="*50)

if not successful_results.empty:
    max_size = successful_results['dataset_size'].max()
    pandas_worked = successful_results['pandas_success'].any()
    spark_worked = successful_results['spark_success'].any()
    
    print(f"✅ Maximum dataset size processed: {max_size:,} records")
    print(f"🐼 Pandas handled: {'✅' if pandas_worked else '❌'}")
    print(f"🔥 Spark handled: {'✅' if spark_worked else '❌'}")
    
    if pandas_worked and spark_worked:
        # Compare efficiency at largest common size
        last_common = successful_results[
            successful_results['pandas_success'] & successful_results['spark_success']
        ].iloc[-1] if any(successful_results['pandas_success'] & successful_results['spark_success']) else None
        
        if last_common is not None:
            pandas_rate = last_common['dataset_size'] / last_common['pandas_time']
            spark_rate = last_common['dataset_size'] / last_common['spark_time']
            print(f"⚡ Pandas processing rate: {pandas_rate:,.0f} records/second")
            print(f"🚀 Spark processing rate: {spark_rate:,.0f} records/second")
            
            if spark_rate > pandas_rate:
                print(f"🏆 Spark is {spark_rate/pandas_rate:.1f}x faster for processing")
            else:
                print(f"🏆 Pandas is {pandas_rate/spark_rate:.1f}x faster for processing")

print("\n💡 Real-World Banking Context:")
print(f"🏦 A major bank processes ~1B transactions/day")
print(f"📈 At current rates, that would take:")
if not successful_results.empty and successful_results['spark_success'].any():
    spark_rate = successful_results[successful_results['spark_success']].iloc[-1]['dataset_size'] / \
                 successful_results[successful_results['spark_success']].iloc[-1]['spark_time']
    days_needed = 1_000_000_000 / spark_rate / 86400  # Convert to days
    print(f"   🔥 Spark: {days_needed:.1f} days (needs distributed cluster!)")
else:
    print("   🔥 Spark: Unable to estimate (processing failed)")

## 🏗️ Hadoop Ecosystem Demo

Let's explore the foundational concepts of the Hadoop ecosystem, including distributed file storage concepts and the MapReduce programming paradigm applied to banking data.

In [None]:
# Demonstrate MapReduce paradigm with banking fraud detection
def mapreduce_fraud_detection_demo(transactions_data):
    """
    Demonstrate MapReduce thinking for fraud detection
    
    MapReduce Pattern:
    1. MAP: Extract key-value pairs from each transaction
    2. SHUFFLE & SORT: Group by key
    3. REDUCE: Aggregate values for each key
    """
    
    print("🗺️  MapReduce Fraud Detection Demo")
    print("="*50)
    
    # STEP 1: MAP PHASE
    print("1️⃣ MAP Phase: Extract suspicious patterns from each transaction")
    
    mapped_data = []
    
    for transaction in transactions_data[:1000]:  # Demo with first 1000
        # Map function: identify potential fraud indicators
        amount = transaction['amount']
        hour = pd.to_datetime(transaction['timestamp']).hour
        
        # Fraud indicators as key-value pairs
        if amount > 1000:  # High value transaction
            mapped_data.append(("high_value_fraud", 1))
        
        if hour in [0, 1, 2, 3, 4, 5]:  # Late night transaction
            mapped_data.append(("late_night_fraud", 1))
        
        if transaction['channel'] == 'atm' and amount > 500:
            mapped_data.append(("suspicious_atm", 1))
        
        # Normal transaction counter
        mapped_data.append(("normal_transaction", 1))
    
    print(f"   📊 Generated {len(mapped_data)} key-value pairs")
    print(f"   🔍 Sample mappings: {mapped_data[:10]}")
    
    # STEP 2: SHUFFLE & SORT PHASE
    print("\n2️⃣ SHUFFLE & SORT Phase: Group by fraud type")
    
    from collections import defaultdict
    grouped_data = defaultdict(list)
    
    for key, value in mapped_data:
        grouped_data[key].append(value)
    
    print(f"   📚 Grouped into {len(grouped_data)} categories")
    for key, values in grouped_data.items():
        print(f"   🏷️  {key}: {len(values)} occurrences")
    
    # STEP 3: REDUCE PHASE
    print("\n3️⃣ REDUCE Phase: Calculate final fraud statistics")
    
    fraud_summary = {}
    for key, values in grouped_data.items():
        fraud_summary[key] = sum(values)
    
    print(f"   📈 Fraud Detection Results:")
    for fraud_type, count in fraud_summary.items():
        percentage = (count / len(transactions_data[:1000])) * 100
        print(f"      {fraud_type}: {count} ({percentage:.1f}%)")
    
    return fraud_summary

# Run MapReduce demo
fraud_results = mapreduce_fraud_detection_demo(small_transactions)

# Visualize MapReduce results
fig, ax = plt.subplots(1, 1, figsize=(12, 6))

fraud_types = list(fraud_results.keys())
fraud_counts = list(fraud_results.values())

colors = ['red' if 'fraud' in ft or 'suspicious' in ft else 'green' for ft in fraud_types]

bars = ax.bar(fraud_types, fraud_counts, color=colors, alpha=0.7)
ax.set_title('MapReduce Fraud Detection Results\nDistributed Pattern Recognition', fontsize=14, fontweight='bold')
ax.set_xlabel('Fraud Pattern Type')
ax.set_ylabel('Number of Occurrences')
ax.tick_params(axis='x', rotation=45)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.5,
           f'{int(height)}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\n💡 MapReduce Key Concepts Demonstrated:")
print("✅ Distributed processing: Each transaction processed independently")
print("✅ Scalability: Same logic works for millions of transactions")
print("✅ Fault tolerance: If one node fails, others continue")
print("✅ Data locality: Processing happens where data is stored")

# Compare with traditional approach
print("\n⚖️ Traditional vs MapReduce Approach:")
print("🐌 Traditional: Process all data on single machine")
print("🚀 MapReduce: Distribute processing across cluster")
print("💾 Traditional: Limited by single machine memory")
print("🌐 MapReduce: Scales with cluster size")

## 🔥 Apache Spark Core Concepts

Now let's dive deep into Spark's core concepts: RDDs, DataFrames, lazy evaluation, and in-memory computing. We'll demonstrate these with practical banking examples.

In [None]:
# Demonstrate Spark Core Concepts with Banking Data

print("🔥 Apache Spark Core Concepts Demo")
print("="*50)

# 1. RDDs (Resilient Distributed Datasets) - The Foundation
print("\n1️⃣ RDDs: Resilient Distributed Datasets")
print("-" * 40)

# Create RDD from transaction data
transaction_amounts = [t['amount'] for t in small_transactions[:1000]]
amounts_rdd = spark.sparkContext.parallelize(transaction_amounts)

print(f"✅ Created RDD with {amounts_rdd.count()} transaction amounts")
print(f"📊 Partitions: {amounts_rdd.getNumPartitions()}")
print(f"🎯 Sample amounts: {amounts_rdd.take(5)}")

# RDD transformations (lazy evaluation)
high_value_rdd = amounts_rdd.filter(lambda x: x > 100)
squared_amounts_rdd = high_value_rdd.map(lambda x: x * x)

print(f"\n🔄 RDD Transformations (Lazy):")
print(f"   📈 High value transactions: {high_value_rdd.count()}")
print(f"   ⚡ Squared amounts sample: {squared_amounts_rdd.take(3)}")

# RDD actions (trigger execution)
total_high_value = high_value_rdd.reduce(lambda x, y: x + y)
print(f"   💰 Total high value amount: €{total_high_value:,.2f}")

# 2. DataFrames - Structured Data Processing
print("\n2️⃣ DataFrames: Structured Data with Schema")
print("-" * 45)

# Create DataFrame from transactions
df_transactions = spark.createDataFrame(small_transactions[:1000])

# DataFrame schema
print("📋 DataFrame Schema:")
df_transactions.printSchema()

# DataFrame operations
print("\n🔍 DataFrame Analysis:")
print(f"   📊 Total rows: {df_transactions.count()}")
print(f"   📈 Columns: {len(df_transactions.columns)}")

# Statistical summary
print("\n📈 Statistical Summary of Transaction Amounts:")
df_transactions.select("amount").describe().show()

# 3. Lazy Evaluation Demonstration
print("\n3️⃣ Lazy Evaluation: Building Execution Plan")
print("-" * 45)

# Build complex transformation chain (no execution yet)
processed_df = df_transactions \
    .filter(col("amount") > 50) \
    .withColumn("amount_category", 
               when(col("amount") < 100, "small")
               .when(col("amount") < 500, "medium")
               .otherwise("large")) \
    .groupBy("merchant_category", "amount_category") \
    .agg(count("*").alias("transaction_count"),
         avg("amount").alias("avg_amount")) \
    .orderBy("merchant_category", "amount_category")

print("🔄 Transformation chain created (not executed yet)")
print("📋 Execution plan:")
processed_df.explain(True)

# Trigger execution
print("\n🚀 Triggering execution with .show():")
processed_df.show(10, truncate=False)

# 4. In-Memory Computing with Caching
print("\n4️⃣ In-Memory Computing: Caching for Performance")
print("-" * 50)

# Create a DataFrame that will be used multiple times
active_df = df_transactions.filter(col("amount") > 10)

# Time without caching
start_time = time.time()
count1 = active_df.count()
avg_amount1 = active_df.select(avg("amount")).collect()[0][0]
max_amount1 = active_df.select(max("amount")).collect()[0][0]
time_without_cache = time.time() - start_time

print(f"⏱️ Operations without cache: {time_without_cache:.3f} seconds")

# Cache the DataFrame in memory
active_df.cache()
print("💾 DataFrame cached in memory")

# Time with caching (first operation will still be slow due to caching)
start_time = time.time()
count2 = active_df.count()  # This triggers caching
cache_build_time = time.time() - start_time

# Subsequent operations should be faster
start_time = time.time()
avg_amount2 = active_df.select(avg("amount")).collect()[0][0]
max_amount2 = active_df.select(max("amount")).collect()[0][0]
time_with_cache = time.time() - start_time

print(f"⏱️ Cache building time: {cache_build_time:.3f} seconds")
print(f"⚡ Operations with cache: {time_with_cache:.3f} seconds")

if time_without_cache > 0:
    speedup = time_without_cache / max(time_with_cache, 0.001)
    print(f"🚀 Speedup from caching: {speedup:.1f}x")

# Show cache statistics
print(f"\n📊 Results verification:")
print(f"   Count: {count1} vs {count2} ({'✅' if count1 == count2 else '❌'})")
print(f"   Avg amount: €{avg_amount1:.2f} vs €{avg_amount2:.2f}")
print(f"   Max amount: €{max_amount1:.2f} vs €{max_amount2:.2f}")

# Clean up cache
active_df.unpersist()
print("🧹 Cache cleared")

# 5. Spark SQL Integration
print("\n5️⃣ Spark SQL: SQL Interface to DataFrames")
print("-" * 45)

# Register DataFrame as SQL table
df_transactions.createOrReplaceTempView("banking_transactions")

# Use SQL to query the data
sql_result = spark.sql("""
    SELECT 
        merchant_category,
        COUNT(*) as transaction_count,
        AVG(amount) as avg_amount,
        MAX(amount) as max_amount,
        MIN(amount) as min_amount
    FROM banking_transactions
    WHERE amount > 20
    GROUP BY merchant_category
    ORDER BY transaction_count DESC
    LIMIT 5
""")

print("💻 SQL Query Results:")
sql_result.show(truncate=False)

# 6. Partitioning and Distribution
print("\n6️⃣ Data Partitioning: Distribution Strategy")
print("-" * 45)

print(f"📊 Original partitions: {df_transactions.rdd.getNumPartitions()}")

# Repartition data
repartitioned_df = df_transactions.repartition(4, col("merchant_category"))
print(f"🔄 After repartitioning: {repartitioned_df.rdd.getNumPartitions()}")

# Show partition distribution
partition_info = repartitioned_df.rdd.mapPartitionsWithIndex(
    lambda idx, iterator: [(idx, len(list(iterator)))]
).collect()

print("📈 Records per partition:")
for partition_id, record_count in partition_info:
    print(f"   Partition {partition_id}: {record_count} records")

print("\n🎯 Key Spark Concepts Demonstrated:")
print("✅ RDDs: Low-level distributed data structures")
print("✅ DataFrames: Structured data with schema optimization")
print("✅ Lazy Evaluation: Builds execution plan before running")
print("✅ In-Memory Computing: Caching for iterative algorithms")
print("✅ Spark SQL: SQL interface for familiar querying")
print("✅ Partitioning: Data distribution across cluster")

## 🎯 Lab 01 Summary & Key Takeaways

### What We've Learned

**Big Data Volume Challenges:**
- Traditional tools (Pandas) have memory limitations (~8GB RAM)
- Real banking volumes (1B+ transactions/day) require distributed solutions
- Performance degrades exponentially with traditional approaches

**Technology Comparison:**
- **Pandas:** Great for small-medium datasets, familiar interface, single-machine limited
- **Spark:** Designed for Big Data, distributed processing, in-memory computing, unified platform

**Hadoop Ecosystem:**
- **MapReduce:** Foundational distributed computing paradigm
- **HDFS:** Distributed file system for massive datasets
- **Fault Tolerance:** Automatic recovery from hardware failures

**Spark Advantages:**
- **RDDs:** Fault-tolerant distributed datasets
- **DataFrames:** Structured data with schema optimization
- **Lazy Evaluation:** Optimized execution plans
- **In-Memory Computing:** 10-100x faster than disk-based processing
- **Unified Platform:** Batch, streaming, ML, and graph processing

### Real-World Banking Applications

1. **Fraud Detection:** Process millions of transactions in real-time
2. **Customer Analytics:** Segment customers using behavioral patterns
3. **Risk Assessment:** Monte Carlo simulations with billions of scenarios
4. **Regulatory Reporting:** Process historical data for compliance

### Next Steps

- **Lab 02:** Dive deeper into Spark SQL for complex analytics
- **Lab 03:** Build real-time fraud detection systems
- **Lab 04:** Customer segmentation with machine learning
- **Lab 05:** Data integration and ETL pipelines

### 🔧 Clean Up Resources

In [None]:
# Clean up Spark resources
print("🧹 Cleaning up resources...")

# Stop Spark session
spark.stop()

print("✅ Spark session stopped")
print("💡 Lab 01 completed successfully!")

print("\n📚 Additional Resources:")
print("   📖 Spark Documentation: https://spark.apache.org/docs/latest/")
print("   🎓 Banking Analytics Best Practices: See Day 2 materials")
print("   🔧 Performance Tuning: Check Spark UI during exercises")
print("\n🎉 Ready for Lab 02: Spark SQL for Banking Analytics!")