# Lab 01.2: Data Variety & Multi-Format Processing
## Big Data Analytics Workshop - Banking Use Cases

### 🎯 **Learning Objectives**
After completing this lab, you will understand:
- The "Variety" dimension of Big Data (structured, semi-structured, unstructured data)
- How different data formats impact processing strategies
- Spark's unified approach to multi-format data processing
- Real-world banking data integration challenges
- Performance implications of different data formats

### 📊 **Banking Context: Multi-Format Data Reality**
Modern banks process data from multiple sources and formats:
- **Structured**: Core banking systems, transaction databases
- **Semi-structured**: API responses, configuration files, logs
- **Unstructured**: Customer emails, chat logs, regulatory documents

### 🛠 **Technical Skills Developed**
- Multi-format data ingestion with Spark
- Schema inference and evolution
- Data format conversion and optimization
- Performance comparison across formats
- Banking data integration patterns

---

In [None]:
# Environment Setup and Imports
import os
import sys
import time
import json
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import matplotlib.pyplot as plt
import seaborn as sns

# Add utils to path
sys.path.append('../utils')
from banking_data_generator import BankingDataGenerator
from performance_monitor import PerformanceMonitor

print("🚀 Lab 01.2: Data Variety & Multi-Format Processing")
print("=" * 60)
print("📊 Focus: Structured, Semi-structured, and Unstructured Data")
print("🏦 Context: Banking Data Integration")
print("=" * 60)

In [None]:
# Initialize Spark Session with Multi-Format Support
spark = SparkSession.builder \
    .appName("Banking-Data-Variety-Lab") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.driver.memory", "4g") \
    .config("spark.driver.maxResultSize", "2g") \
    .getOrCreate()

# Set log level to reduce noise
spark.sparkContext.setLogLevel("WARN")

print("✅ Spark Session Initialized")
print(f"🎯 Spark Version: {spark.version}")
print(f"💻 Application Name: {spark.sparkContext.appName}")
print(f"🧮 Available Cores: {spark.sparkContext.defaultParallelism}")

# Initialize utilities
data_generator = BankingDataGenerator(seed=42)
performance_monitor = PerformanceMonitor()

print("\n🛠 Utilities Ready:")
print("   📊 Banking Data Generator")
print("   📈 Performance Monitor")

## 1. Understanding Data Variety in Banking

### 📚 **The Three Types of Data**

**Structured Data** 🗃️
- Fixed schema (rows/columns)
- Examples: Transaction tables, customer records, account balances
- Formats: CSV, Parquet, SQL databases
- ~20% of banking data

**Semi-Structured Data** 📋
- Flexible schema with some organization
- Examples: JSON API responses, XML files, log files
- Formats: JSON, XML, YAML, Avro
- ~10% of banking data

**Unstructured Data** 📄
- No predefined schema
- Examples: Customer emails, chat transcripts, documents, images
- Formats: Text files, PDFs, emails, audio, video
- ~70% of banking data

### 🏦 **Banking Data Integration Challenge**
Banks must combine all three types to get complete customer insights!

In [None]:
# Generate Sample Data in Multiple Formats
print("🔄 Generating banking data in multiple formats...")
print("=" * 50)

# 1. Generate core datasets
customers = data_generator.generate_customers(5000)
transactions = data_generator.generate_transactions(50000, customers)
market_data = data_generator.generate_market_data(365)
unstructured_data = data_generator.generate_unstructured_data(1000)

print(f"✅ Generated datasets:")
print(f"   👥 Customers: {len(customers):,}")
print(f"   💳 Transactions: {len(transactions):,}")
print(f"   📈 Market Data: {len(market_data):,}")
print(f"   📄 Unstructured: {len(unstructured_data):,}")

# 2. Create sample directory
sample_dir = "../data/variety_sample"
os.makedirs(sample_dir, exist_ok=True)
print(f"\n📁 Sample data directory: {sample_dir}")

In [None]:
# Save Data in Different Formats
print("💾 Saving data in multiple formats...")

# Convert to DataFrames for easier handling
customers_df = pd.DataFrame(customers)
transactions_df = pd.DataFrame(transactions)
market_df = pd.DataFrame(market_data)

# 1. STRUCTURED DATA FORMATS
print("\n1️⃣ Structured Data Formats:")

# CSV Format
customers_df.to_csv(f"{sample_dir}/customers.csv", index=False)
transactions_df.to_csv(f"{sample_dir}/transactions.csv", index=False)
print("   ✅ CSV files saved")

# Parquet Format (columnar, optimized)
customers_df.to_parquet(f"{sample_dir}/customers.parquet")
transactions_df.to_parquet(f"{sample_dir}/transactions.parquet")
print("   ✅ Parquet files saved")

# 2. SEMI-STRUCTURED DATA FORMATS
print("\n2️⃣ Semi-Structured Data Formats:")

# JSON Format
with open(f"{sample_dir}/customers.json", 'w') as f:
    json.dump(customers[:100], f, indent=2, default=str)  # Sample subset

with open(f"{sample_dir}/transactions.json", 'w') as f:
    json.dump(transactions[:1000], f, indent=2, default=str)  # Sample subset

with open(f"{sample_dir}/market_data.json", 'w') as f:
    json.dump(market_data, f, indent=2, default=str)
print("   ✅ JSON files saved")

# NDJSON (Newline Delimited JSON) - Common in streaming
with open(f"{sample_dir}/transactions.ndjson", 'w') as f:
    for transaction in transactions[:1000]:
        f.write(json.dumps(transaction, default=str) + '\n')
print("   ✅ NDJSON files saved")

# 3. UNSTRUCTURED DATA FORMATS
print("\n3️⃣ Unstructured Data Formats:")

# Text logs (simulated banking system logs)
with open(f"{sample_dir}/system_logs.txt", 'w') as f:
    for i, record in enumerate(unstructured_data[:500]):
        if record['type'] == 'email':
            f.write(f"[{record['timestamp']}] EMAIL - Customer: {record['customer_id']} - Subject: {record['subject'][:50]}...\n")
        elif record['type'] == 'chat_log':
            f.write(f"[{record['timestamp']}] CHAT - Customer: {record['customer_id']} - Duration: {record['session_duration']}s\n")
        else:
            f.write(f"[{record['timestamp']}] CALL - Customer: {record['customer_id']} - Duration: {record['duration']}s\n")
print("   ✅ Text logs saved")

# Customer feedback (unstructured text)
with open(f"{sample_dir}/customer_feedback.txt", 'w') as f:
    feedback_samples = [
        "Great service! Very satisfied with the new mobile app features.",
        "ATM was out of order for three days. Poor maintenance.",
        "Credit card approval process was very fast and efficient.",
        "Online banking system is slow during peak hours.",
        "Excellent customer support. Representative was very helpful.",
        "Transaction fees are too high compared to other banks.",
        "New security features give me peace of mind.",
        "Mortgage application process needs improvement."
    ]
    for i, feedback in enumerate(feedback_samples * 50):  # Repeat to create larger dataset
        f.write(f"Feedback-{i+1:04d}: {feedback}\n")

print("   ✅ Unstructured feedback saved")

print(f"\n📊 Data files created in: {sample_dir}")

# List all created files with sizes
for file in os.listdir(sample_dir):
    file_path = os.path.join(sample_dir, file)
    size_mb = os.path.getsize(file_path) / (1024 * 1024)
    print(f"   📄 {file}: {size_mb:.2f} MB")

## 2. Structured Data Processing with Spark

### 📋 **CSV vs Parquet: The Great Format Debate**

**CSV Characteristics:**
- ✅ Human-readable
- ✅ Universal compatibility
- ❌ No schema enforcement
- ❌ No compression
- ❌ Row-based (inefficient for analytics)

**Parquet Characteristics:**
- ✅ Schema enforcement
- ✅ Excellent compression
- ✅ Columnar format (analytics-optimized)
- ✅ Predicate pushdown
- ❌ Not human-readable

In [None]:
# Performance Comparison: CSV vs Parquet
print("🏁 Performance Comparison: CSV vs Parquet")
print("=" * 50)

def read_csv_transactions():
    """Read transactions from CSV format"""
    df = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .csv(f"{sample_dir}/transactions.csv")
    
    # Perform analytics operation
    result = df.groupBy("merchant_category") \
               .agg(count("*").alias("transaction_count"),
                   sum("amount").alias("total_amount"),
                   avg("amount").alias("avg_amount")) \
               .orderBy(desc("total_amount"))
    
    return result.collect()

def read_parquet_transactions():
    """Read transactions from Parquet format"""
    df = spark.read.parquet(f"{sample_dir}/transactions.parquet")
    
    # Same analytics operation
    result = df.groupBy("merchant_category") \
               .agg(count("*").alias("transaction_count"),
                   sum("amount").alias("total_amount"),
                   avg("amount").alias("avg_amount")) \
               .orderBy(desc("total_amount"))
    
    return result.collect()

# Run performance comparison
approaches = [
    {
        "name": "CSV Format",
        "function": read_csv_transactions
    },
    {
        "name": "Parquet Format", 
        "function": read_parquet_transactions
    }
]

print("🔄 Running format comparison...")
for approach in approaches:
    result = performance_monitor.benchmark_function(
        approach["function"],
        approach["name"]
    )
    
    if result["result"]:
        print(f"   📊 {approach['name']}: {len(result['result'])} categories analyzed")

print("\n💡 Key Insights:")
print("   • Parquet typically 2-5x faster for analytics queries")
print("   • Parquet files are 60-80% smaller due to compression")
print("   • CSV better for data exchange and debugging")
print("   • Parquet optimized for column-oriented operations")

In [None]:
# Schema Analysis and Enforcement
print("🔍 Schema Analysis: CSV vs Parquet")
print("=" * 40)

# Load both formats
csv_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(f"{sample_dir}/transactions.csv")

parquet_df = spark.read.parquet(f"{sample_dir}/transactions.parquet")

print("📋 CSV Schema (inferred):")
csv_df.printSchema()

print("\n📋 Parquet Schema (stored):")
parquet_df.printSchema()

print("\n🔬 Schema Comparison:")
csv_schema = {field.name: str(field.dataType) for field in csv_df.schema.fields}
parquet_schema = {field.name: str(field.dataType) for field in parquet_df.schema.fields}

schema_matches = all(csv_schema.get(field) == parquet_schema.get(field) for field in parquet_schema.keys())
print(f"   Schema consistency: {'✅ Match' if schema_matches else '❌ Mismatch'}")

# Show data type differences if any
print("\n📊 Data Types:")
for field in parquet_df.schema.fields:
    csv_type = csv_schema.get(field.name, "Missing")
    parquet_type = str(field.dataType)
    status = "✅" if csv_type == parquet_type else "⚠️"
    print(f"   {status} {field.name}: CSV={csv_type}, Parquet={parquet_type}")

# Demonstrate schema evolution challenge
print(f"\n💡 Key Points:")
print(f"   • CSV schema inference can be inconsistent")
print(f"   • Parquet enforces schema at write time")
print(f"   • Type mismatches can cause processing errors")
print(f"   • Schema evolution easier with Parquet")

## 3. Semi-Structured Data: JSON Processing

### 🔧 **JSON in Banking: API Responses & Configuration**

JSON is everywhere in modern banking:
- **API responses** from payment processors
- **Configuration files** for banking applications  
- **Log entries** from web services
- **Mobile app** transaction data

### 💡 **Spark's JSON Capabilities**
- Automatic schema inference
- Nested structure handling
- Complex data type support
- Integration with DataFrame API

In [None]:
# JSON Data Processing with Spark
print("📄 Processing JSON Banking Data")
print("=" * 40)

# 1. Read JSON transaction data
json_df = spark.read.json(f"{sample_dir}/transactions.json")

print("📋 JSON Schema (auto-inferred):")
json_df.printSchema()

print(f"\n📊 JSON DataFrame Info:")
print(f"   Rows: {json_df.count():,}")
print(f"   Columns: {len(json_df.columns)}")

# 2. Read Market Data JSON
market_json_df = spark.read.json(f"{sample_dir}/market_data.json")

print(f"\n📈 Market Data JSON:")
print(f"   Rows: {market_json_df.count():,}")
print(f"   Columns: {len(market_json_df.columns)}")
market_json_df.show(5, truncate=False)

# 3. Complex JSON Operations
print("\n🔧 JSON Data Transformations:")

# Extract date components from JSON timestamps
json_enhanced = json_df.withColumn("transaction_date", 
                                  to_date(col("timestamp"))) \
                       .withColumn("transaction_hour", 
                                  hour(col("timestamp")))

# Group by date and calculate daily metrics
daily_metrics = json_enhanced.groupBy("transaction_date") \
    .agg(
        count("*").alias("daily_transactions"),
        sum("amount").alias("daily_volume"),
        avg("amount").alias("avg_transaction_amount"),
        countDistinct("customer_id").alias("unique_customers")
    ) \
    .orderBy("transaction_date")

print("📊 Daily Transaction Metrics from JSON:")
daily_metrics.show(10)

# 4. JSON vs Structured Performance
print("\n⚡ Performance: JSON vs Structured Formats")

def process_json_data():
    df = spark.read.json(f"{sample_dir}/transactions.json")
    return df.groupBy("merchant_category").count().collect()

def process_parquet_data():
    df = spark.read.parquet(f"{sample_dir}/transactions.parquet")
    return df.groupBy("merchant_category").count().collect()

# Compare processing times
json_result = performance_monitor.benchmark_function(process_json_data, "JSON Processing")
parquet_result = performance_monitor.benchmark_function(process_parquet_data, "Parquet Processing")

json_time = json_result["metrics"]["execution_time"]
parquet_time = parquet_result["metrics"]["execution_time"]
performance_ratio = json_time / parquet_time

print(f"\n📈 Performance Comparison:")
print(f"   JSON Time: {json_time:.4f}s")
print(f"   Parquet Time: {parquet_time:.4f}s") 
print(f"   JSON is {performance_ratio:.1f}x slower than Parquet")

In [None]:
# NDJSON Processing (Common for Streaming)
print("🌊 NDJSON Processing for Streaming Data")
print("=" * 45)

# NDJSON (Newline Delimited JSON) is common in streaming/logging systems
ndjson_df = spark.read.json(f"{sample_dir}/transactions.ndjson")

print(f"📊 NDJSON Data:")
print(f"   Format: One JSON object per line")
print(f"   Use case: Streaming, log processing, real-time analytics")
print(f"   Records: {ndjson_df.count():,}")

# Real-time analytics simulation
print("\n📊 Real-time Banking Analytics from NDJSON:")

# Fraud detection simulation
fraud_indicators = ndjson_df.select(
    "transaction_id",
    "customer_id", 
    "amount",
    "timestamp",
    "channel",
    # Flag high-value transactions
    when(col("amount") > 1000, 1).otherwise(0).alias("high_value_flag"),
    # Flag off-hours transactions
    when(hour(col("timestamp")).isin([22, 23, 0, 1, 2, 3, 4, 5]), 1).otherwise(0).alias("off_hours_flag"),
    # Flag foreign transactions
    when(col("currency") != "EUR", 1).otherwise(0).alias("foreign_currency_flag")
)

# Calculate composite risk score
risk_analysis = fraud_indicators.withColumn(
    "risk_score", 
    col("high_value_flag") + col("off_hours_flag") + col("foreign_currency_flag")
)

# Show high-risk transactions
high_risk = risk_analysis.filter(col("risk_score") >= 2).orderBy(desc("risk_score"))

print("🚨 High-Risk Transactions Detected:")
high_risk.show(10, truncate=False)

print(f"📈 Risk Summary:")
risk_summary = risk_analysis.groupBy("risk_score").count().orderBy("risk_score")
risk_summary.show()

print(f"⚠️ High-risk transactions: {high_risk.count():,} out of {ndjson_df.count():,} ({(high_risk.count()/ndjson_df.count()*100):.1f}%)")

## 4. Unstructured Data: Text Analytics

### 📄 **The Challenge of Unstructured Banking Data**

70% of banking data is unstructured:
- **Customer communications** (emails, chat logs)
- **Regulatory documents** (compliance reports)
- **Call center transcripts**
- **Social media mentions**
- **Financial news** and market reports

### 🧠 **Text Analytics for Banking**
- Sentiment analysis of customer feedback
- Topic modeling for customer complaints  
- Named entity recognition for compliance
- Document classification and routing

In [None]:
# Text Analytics on Banking Data
print("📄 Processing Unstructured Banking Text Data")
print("=" * 50)

# 1. Load system logs as text data
logs_df = spark.read.text(f"{sample_dir}/system_logs.txt")
print(f"📋 System Logs:")
print(f"   Total lines: {logs_df.count():,}")
logs_df.show(5, truncate=False)

# 2. Parse log entries using regex
from pyspark.sql.functions import regexp_extract, split, trim

parsed_logs = logs_df.select(
    regexp_extract("value", r'\[(.*?)\]', 1).alias("timestamp"),
    regexp_extract("value", r'\] (\w+) -', 1).alias("event_type"),
    regexp_extract("value", r'Customer: (\w+)', 1).alias("customer_id"),
    regexp_extract("value", r'Duration: (\d+)', 1).cast("integer").alias("duration"),
    "value"
).filter(col("customer_id") != "")

print("\n📊 Parsed Log Entries:")
parsed_logs.show(5, truncate=False)

# 3. Analyze log patterns
print("\n📈 Log Analysis:")
event_counts = parsed_logs.groupBy("event_type").count().orderBy(desc("count"))
print("Event Type Distribution:")
event_counts.show()

# Customer activity analysis
customer_activity = parsed_logs.groupBy("customer_id") \
    .agg(count("*").alias("total_events"),
         countDistinct("event_type").alias("event_types"),
         avg("duration").alias("avg_duration")) \
    .orderBy(desc("total_events"))

print("Top Active Customers:")
customer_activity.show(10)

# 4. Customer Feedback Sentiment Analysis (Simple)
feedback_df = spark.read.text(f"{sample_dir}/customer_feedback.txt")

print(f"\n💬 Customer Feedback Analysis:")
print(f"   Total feedback entries: {feedback_df.count():,}")

# Simple sentiment analysis using keyword matching
positive_keywords = ["great", "excellent", "satisfied", "fast", "efficient", "helpful", "peace"]
negative_keywords = ["poor", "slow", "out of order", "too high", "needs improvement"]

# Create sentiment flags
sentiment_analysis = feedback_df.select(
    "value",
    # Count positive words
    sum([when(lower(col("value")).contains(word), 1).otherwise(0) for word in positive_keywords]).alias("positive_score"),
    # Count negative words  
    sum([when(lower(col("value")).contains(word), 1).otherwise(0) for word in negative_keywords]).alias("negative_score")
).withColumn(
    "sentiment",
    when(col("positive_score") > col("negative_score"), "positive")
    .when(col("positive_score") < col("negative_score"), "negative") 
    .otherwise("neutral")
)

print("📊 Sentiment Analysis Results:")
sentiment_summary = sentiment_analysis.groupBy("sentiment").count()
sentiment_summary.show()

# Show examples of each sentiment
print("\n💡 Sample Feedback by Sentiment:")
for sentiment in ["positive", "negative", "neutral"]:
    print(f"\n{sentiment.upper()} Examples:")
    examples = sentiment_analysis.filter(col("sentiment") == sentiment).limit(3)
    for row in examples.collect():
        print(f"   • {row.value[:80]}...")

# 5. Topic Analysis (Simple keyword extraction)
print("\n🏷 Topic Analysis - Key Themes:")

# Define banking topics
topics = {
    "mobile_app": ["mobile", "app", "application"],
    "atm_services": ["atm", "cash", "withdrawal", "machine"],
    "customer_service": ["service", "support", "representative", "help"],
    "fees_charges": ["fee", "charge", "cost", "expensive", "high"],
    "online_banking": ["online", "website", "internet", "digital"],
    "credit_mortgage": ["credit", "mortgage", "loan", "approval"],
    "security": ["security", "fraud", "safe", "protection"]
}

# Calculate topic scores
topic_scores = feedback_df
for topic, keywords in topics.items():
    topic_scores = topic_scores.withColumn(
        f"{topic}_score",
        sum([when(lower(col("value")).contains(word), 1).otherwise(0) for word in keywords])
    )

# Aggregate topic mentions
topic_summary = topic_scores.select(*[sum(col(f"{topic}_score")).alias(topic) for topic in topics.keys()])

print("📊 Banking Topics Mentioned:")
topic_results = topic_summary.collect()[0].asDict()
sorted_topics = sorted(topic_results.items(), key=lambda x: x[1], reverse=True)

for topic, count in sorted_topics:
    print(f"   {topic.replace('_', ' ').title()}: {count} mentions")

## 5. Multi-Format Data Integration

### 🔗 **The Banking Data Integration Challenge**

Real banking analytics requires combining:
- **Transaction data** (structured) from core banking systems
- **Market data** (semi-structured) from external APIs
- **Customer feedback** (unstructured) from various channels
- **System logs** (semi-structured) for operational insights

### 💡 **Spark's Unified Data Processing**
Spark provides a single API to:
- Read from multiple data sources
- Apply consistent transformations
- Join across different formats
- Output to optimal formats for downstream use

In [None]:
# Comprehensive Multi-Format Data Integration
print("🔗 Banking Data Integration: Combining All Formats")
print("=" * 55)

# 1. Load all data formats
print("📊 Loading data from multiple formats...")

# Structured data
transactions_df = spark.read.parquet(f"{sample_dir}/transactions.parquet")
customers_df = spark.read.parquet(f"{sample_dir}/customers.parquet")

# Semi-structured data
market_data_df = spark.read.json(f"{sample_dir}/market_data.json")
logs_parsed_df = parsed_logs  # From previous cell

# Processed unstructured data
feedback_sentiment_df = sentiment_analysis

print("✅ All data sources loaded")
print(f"   📊 Transactions: {transactions_df.count():,} records")
print(f"   👥 Customers: {customers_df.count():,} records")
print(f"   📈 Market Data: {market_data_df.count():,} records")
print(f"   📋 Logs: {logs_parsed_df.count():,} records")
print(f"   💬 Feedback: {feedback_sentiment_df.count():,} records")

# 2. Create a unified customer 360 view
print(f"\n🎯 Creating Unified Customer 360 View...")

# Start with customer base data
customer_360 = customers_df

# Add transaction metrics
transaction_metrics = transactions_df.groupBy("customer_id").agg(
    count("*").alias("total_transactions"),
    sum("amount").alias("total_spent"),
    avg("amount").alias("avg_transaction_amount"),
    max("amount").alias("max_transaction"),
    countDistinct("merchant").alias("unique_merchants"),
    countDistinct("channel").alias("channels_used")
)

customer_360 = customer_360.join(transaction_metrics, "customer_id", "left")

# Add activity metrics from logs
activity_metrics = logs_parsed_df.groupBy("customer_id").agg(
    count("*").alias("system_interactions"),
    countDistinct("event_type").alias("interaction_types"),
    avg("duration").alias("avg_interaction_duration")
)

customer_360 = customer_360.join(activity_metrics, "customer_id", "left")

# Create customer segments based on multiple data sources
customer_360_segmented = customer_360.withColumn(
    "customer_tier",
    when(col("total_spent") > 50000, "VIP")
    .when(col("total_spent") > 20000, "Premium")
    .when(col("total_spent") > 5000, "Standard")
    .otherwise("Basic")
).withColumn(
    "digital_engagement",
    when(col("channels_used") >= 3, "High")
    .when(col("channels_used") >= 2, "Medium")
    .otherwise("Low")
)

print("✅ Customer 360 view created")
print(f"   📊 Integrated customers: {customer_360_segmented.count():,}")

# Show sample of integrated data
print("\n📋 Sample Customer 360 View:")
customer_360_segmented.select(
    "customer_id", "customer_segment", "customer_tier", "digital_engagement",
    "total_spent", "total_transactions", "system_interactions"
).show(10)

# 3. Market-influenced analytics
print("\n📈 Market Data Integration for Risk Analysis...")

# Get market data for risk correlation
market_stats = market_data_df.agg(
    avg("dax_index").alias("avg_dax"),
    avg("market_volatility").alias("avg_volatility"),
    avg("interest_rate_ecb").alias("avg_interest_rate")
).collect()[0]

print(f"📊 Market Context:")
print(f"   Average DAX: {market_stats.avg_dax:.2f}")
print(f"   Average Volatility: {market_stats.avg_volatility:.2f}%")
print(f"   Average Interest Rate: {market_stats.avg_interest_rate:.2f}%")

# Classify customers based on market conditions and behavior
risk_profile = customer_360_segmented.withColumn(
    "market_risk_profile",
    when((col("max_transaction") > 10000) & (market_stats.avg_volatility > 20), "High Risk")
    .when((col("total_spent") > 30000) & (col("channels_used") == 1), "Medium Risk")
    .otherwise("Low Risk")
)

print("\n⚠️ Customer Risk Distribution:")
risk_distribution = risk_profile.groupBy("market_risk_profile").count().orderBy(desc("count"))
risk_distribution.show()

# 4. Performance comparison: Integrated vs Single-format queries
print("\n⚡ Performance Analysis: Multi-Format Integration")

def single_format_analysis():
    """Query using only transaction data"""
    return transactions_df.groupBy("customer_id") \
        .agg(sum("amount").alias("total_spent")) \
        .orderBy(desc("total_spent")) \
        .limit(100) \
        .collect()

def integrated_analysis():
    """Query using integrated multi-format data"""
    return risk_profile.select("customer_id", "total_spent", "market_risk_profile") \
        .orderBy(desc("total_spent")) \
        .limit(100) \
        .collect()

# Benchmark both approaches
single_result = performance_monitor.benchmark_function(
    single_format_analysis, 
    "Single Format Analysis"
)

integrated_result = performance_monitor.benchmark_function(
    integrated_analysis,
    "Multi-Format Integration"
)

print(f"\n📊 Integration Performance Impact:")
integration_overhead = integrated_result["metrics"]["execution_time"] / single_result["metrics"]["execution_time"]
print(f"   Integration overhead: {integration_overhead:.2f}x")
print(f"   Trade-off: {integration_overhead:.1f}x slower but {len(risk_profile.columns) - len(transactions_df.columns)} additional insights")

# 5. Data Quality Assessment Across Formats
print(f"\n🔍 Data Quality Assessment Across Formats:")

formats_quality = {
    "Structured (Parquet)": {
        "completeness": transactions_df.count() / transactions_df.count() * 100,
        "consistency": "High - Schema enforced",
        "timeliness": "Real-time capable"
    },
    "Semi-Structured (JSON)": {
        "completeness": market_data_df.count() / 365 * 100,  # Expected daily records
        "consistency": "Medium - Schema flexibility",
        "timeliness": "Daily updates"
    },
    "Unstructured (Text)": {
        "completeness": feedback_sentiment_df.count() / feedback_sentiment_df.count() * 100,
        "consistency": "Low - Free text",
        "timeliness": "Batch processing"
    }
}

for format_type, quality in formats_quality.items():
    print(f"\n   📊 {format_type}:")
    print(f"      Completeness: {quality['completeness']:.1f}%")
    print(f"      Consistency: {quality['consistency']}")
    print(f"      Timeliness: {quality['timeliness']}")

print(f"\n💡 Integration Best Practices:")
print(f"   • Use Parquet for analytical workloads")
print(f"   • JSON for flexible schema evolution")
print(f"   • Process unstructured data into structured insights")
print(f"   • Cache frequently joined datasets")
print(f"   • Monitor data quality across all sources")

## 🎯 Lab Summary & Key Takeaways

### 📊 **What We Accomplished**

1. **Multi-Format Data Processing**: Successfully processed structured (CSV/Parquet), semi-structured (JSON), and unstructured (text) banking data
2. **Performance Analysis**: Compared processing speeds across different formats
3. **Schema Management**: Understood schema inference, enforcement, and evolution
4. **Text Analytics**: Applied basic NLP techniques to customer feedback
5. **Data Integration**: Created unified customer views combining multiple data sources

### 🏦 **Banking Industry Insights**

- **70%** of banking data is unstructured (customer communications, documents)
- **Parquet format** provides 2-5x better performance for analytics workloads
- **Multi-format integration** enables comprehensive customer 360° views
- **Real-time processing** requires different strategies for different data types

### 🛠 **Technical Skills Developed**

- ✅ Multi-format data ingestion with Spark
- ✅ Schema inference and management
- ✅ Text processing and sentiment analysis
- ✅ Cross-format data joins and integration
- ✅ Performance optimization strategies

### 🎯 **Next Steps**

In **Lab 01.3**, we'll focus on data quality assessment and cleansing techniques for banking data.

In [None]:
# Cleanup and Resource Management
print("🧹 Cleaning up resources...")

# Cache cleanup
spark.catalog.clearCache()
print("✅ Spark cache cleared")

# Unpersist any cached DataFrames
try:
    if 'customer_360_segmented' in locals():
        customer_360_segmented.unpersist()
    if 'risk_profile' in locals():
        risk_profile.unpersist()
except:
    pass

print("✅ DataFrames unpersisted")

# Performance monitoring summary
print(f"\n📊 Lab Performance Summary:")
print(f"   Total benchmarks run: {len(performance_monitor.results)}")

if performance_monitor.results:
    avg_time = np.mean([r["execution_time"] for r in performance_monitor.results if r["success"]])
    print(f"   Average execution time: {avg_time:.4f}s")
    
    successful_runs = len([r for r in performance_monitor.results if r["success"]])
    print(f"   Successful operations: {successful_runs}/{len(performance_monitor.results)}")

print(f"\n🎓 Lab 01.2 Complete!")
print(f"   ✅ Multi-format data processing")
print(f"   ✅ Performance analysis across formats")
print(f"   ✅ Banking data integration patterns")
print(f"   ✅ Text analytics fundamentals")

print(f"\n🔄 Ready for Lab 01.3: Data Quality Assessment")

# Keep Spark session running for next lab
# spark.stop()  # Uncomment if you want to stop Spark

# Lab 01.2: Data Variety - Multiple Formats in Banking
## Banking Analytics Workshop - The 3rd V of Big Data

**Duration:** 45 minutes  
**Learning Objectives:**
- Understand data variety challenges in banking
- Work with structured, semi-structured, and unstructured data
- Demonstrate schema-on-read vs schema-on-write
- Integrate multiple data sources for comprehensive analytics

### 📊 Banking Data Variety Reality

**Modern banking data comes in many formats:**
- **Structured (20%):** Core banking transactions, customer tables
- **Semi-structured (30%):** JSON APIs, XML messages, log files
- **Unstructured (50%):** Emails, chat logs, documents, social media

Let's explore how Big Data tools handle this variety!