# Banking Data Analysis - Live Coding Workshop
## Big Data Analytics im Banking | 13:00-15:40

### 🎯 **Workshop Agenda**
- **13:00-13:45:** Einführung in Datenanalyse + Banking Transaction Analysis
- **13:55-14:40:** Spark Deep-Dive & GCP Setup
- **14:50-15:40:** Datenbeschaffung und -integration

### 🛠 **Was wir heute lernen:**
1. **Datenanalyseprozess** in der Praxis
2. **Data Mining** für Banking-Patterns
3. **Spark Setup** und SQL-Queries
4. **GCP/Databricks** Configuration
5. **Web Scraping** für Financial Data
6. **Multi-Source Integration**

### 📋 **Live Coding Approach**
- **Instructor demonstrates** → **Students modify/extend**
- **Short code blocks** with thorough comments
- **Interactive exercises** at each step

## 1. Load Large Banking Transactions (PySpark) 🏦
**Goal:** Load a >1GB CSV efficiently using PySpark and prepare it for analysis

Dataset: `transactions_data.csv` (set the path below)

### 🎓 Live Coding Exercise:
- **Instructor:** Sets up Spark and loads the dataset with an explicit schema or fast inference
- **Students:** Add derived columns and validate data quality

In [1]:
# Essential PySpark setup for large-scale banking data
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Path to the large CSV dataset (>1GB), relative to this notebook's working directory
# Ensure the file 'transactions_data.csv' is in the same folder as this notebook
dataset_path = "transactions_data.csv"

# Create Spark session optimized for local analysis of large CSVs
spark = (
    SparkSession.builder
    .appName("Banking Transactions Analysis")
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
    .config("spark.sql.shuffle.partitions", "200")  # tune based on cores
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.sql.session.timeZone", "Europe/Berlin")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print("✅ Spark initialized for large dataset processing")
print(f"🔧 Spark version: {spark.version}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/10 12:06:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✅ Spark initialized for large dataset processing
🔧 Spark version: 3.5.3


In [2]:
# 📦 LOAD LARGE DATASET - Banking Transactions CSV (>1GB)
# This cell loads the dataset using PySpark and prepares standard columns

from pyspark.sql import functions as F
from pyspark.sql.types import *

# Optional: Define an explicit schema for best performance (fill in when known)
# Example (adjust to your dataset columns):
# explicit_schema = StructType([
#     StructField("transaction_id", StringType(), True),
#     StructField("customer_id", StringType(), True),
#     StructField("merchant", StringType(), True),
#     StructField("amount", DoubleType(), True),
#     StructField("currency", StringType(), True),
#     StructField("timestamp", StringType(), True),
#     # ... add other fields
# ])
explicit_schema = None  # set to the StructType above when ready

read_builder = (
    spark.read
        .option("header", True)
        .option("inferSchema", explicit_schema is None)
        .option("multiLine", False)
        .option("mode", "PERMISSIVE")
)

transactions_raw = (
    read_builder.csv(dataset_path) if explicit_schema is None
    else read_builder.schema(explicit_schema).csv(dataset_path)
)

print("📋 Raw schema:")
transactions_raw.printSchema()

cols = set([c.lower() for c in transactions_raw.columns])

# Identify and standardize key columns
# 1) transaction_date (timestamp)
candidate_date_cols = [c for c in ["transaction_date", "timestamp", "event_time", "date", "datetime"] if c in cols]
if candidate_date_cols:
    date_col = [c for c in transactions_raw.columns if c.lower() == candidate_date_cols[0]][0]
    transactions_std = transactions_raw.withColumn(
        "transaction_date",
        F.to_timestamp(F.col(date_col))
    )
else:
    transactions_std = transactions_raw  # proceed without date if missing

# 2) amount (double)
candidate_amount_cols = [c for c in ["amount", "amt", "value", "transaction_amount"] if c in cols]
if candidate_amount_cols:
    amount_src = [c for c in transactions_raw.columns if c.lower() == candidate_amount_cols[0]][0]
    if amount_src != "amount":
        transactions_std = transactions_std.withColumn("amount", F.col(amount_src).cast("double"))
    else:
        transactions_std = transactions_std.withColumn("amount", F.col("amount").cast("double"))

# 3) merchant (string)
if "merchant" not in cols:
    for alt in ["merchant_name", "store", "vendor"]:
        if alt in cols:
            alt_src = [c for c in transactions_raw.columns if c.lower() == alt][0]
            transactions_std = transactions_std.withColumnRenamed(alt_src, "merchant")
            break

# 4) customer_id (string)
if "customer_id" not in cols:
    for alt in ["customer", "customerid", "cust_id", "account_id"]:
        if alt in cols:
            alt_src = [c for c in transactions_raw.columns if c.lower() == alt][0]
            transactions_std = transactions_std.withColumnRenamed(alt_src, "customer_id")
            break

# Light-weight normalization/derivations (safe for large data)
transactions_std = (
    transactions_std
    .withColumn("transaction_date", F.col("transaction_date"))  # ensure exists if created
)

# Repartition and persist for interactive analysis
spark_banking_df = transactions_std.repartition(200).persist()

print("\n🔎 Sample rows:")
spark_banking_df.show(5, truncate=False)

# Create a temp view for SQL queries
spark_banking_df.createOrReplaceTempView("banking_transactions")
print("✅ Temp view 'banking_transactions' is ready for SQL queries")

                                                                                

📋 Raw schema:
root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_id: integer (nullable = true)
 |-- amount: string (nullable = true)
 |-- use_chip: string (nullable = true)
 |-- merchant_id: integer (nullable = true)
 |-- merchant_city: string (nullable = true)
 |-- merchant_state: string (nullable = true)
 |-- zip: double (nullable = true)
 |-- mcc: integer (nullable = true)
 |-- errors: string (nullable = true)


🔎 Sample rows:


25/08/10 12:07:28 WARN MemoryStore: Not enough space to cache rdd_17_123 in memory! (computed 3.5 MiB so far)
25/08/10 12:07:28 WARN BlockManager: Persisting block rdd_17_123 to disk instead.
25/08/10 12:07:28 WARN MemoryStore: Not enough space to cache rdd_17_123 in memory! (computed 3.5 MiB so far)
25/08/10 12:07:28 WARN MemoryStore: Not enough space to cache rdd_17_121 in memory! (computed 3.5 MiB so far)
25/08/10 12:07:28 WARN BlockManager: Persisting block rdd_17_121 to disk instead.
25/08/10 12:07:28 WARN MemoryStore: Not enough space to cache rdd_17_122 in memory! (computed 3.5 MiB so far)
25/08/10 12:07:28 WARN BlockManager: Persisting block rdd_17_122 to disk instead.
25/08/10 12:07:28 WARN MemoryStore: Not enough space to cache rdd_17_121 in memory! (computed 3.5 MiB so far)
25/08/10 12:07:28 WARN MemoryStore: Not enough space to cache rdd_17_124 in memory! (computed 3.5 MiB so far)
25/08/10 12:07:28 WARN BlockManager: Persisting block rdd_17_124 to disk instead.
25/08/10 12:

+-------+-------------------+---------+-------+------+-----------------+-----------+-------------+--------------+-------+----+------+-------------------+
|id     |date               |client_id|card_id|amount|use_chip         |merchant_id|merchant_city|merchant_state|zip    |mcc |errors|transaction_date   |
+-------+-------------------+---------+-------+------+-----------------+-----------+-------------+--------------+-------+----+------+-------------------+
|9123100|2011-02-07 06:44:00|1664     |5147   |NULL  |Swipe Transaction|83480      |Ann Arbor    |MI            |48103.0|9402|NULL  |2011-02-07 06:44:00|
|9048269|2011-01-20 10:18:00|1575     |2112   |NULL  |Swipe Transaction|61195      |Sarasota     |FL            |34232.0|5541|NULL  |2011-01-20 10:18:00|
|8263643|2010-07-16 11:36:00|1857     |5089   |NULL  |Swipe Transaction|91128      |Morris Plains|NJ            |7950.0 |5411|NULL  |2010-07-16 11:36:00|
|8791129|2010-11-20 10:52:00|96       |3695   |NULL  |Swipe Transaction|4178

In [None]:
# 👨‍💻 STUDENT EXERCISE: Derive features in Spark (big dataset)
# Goal: Enrich the loaded DataFrame without collecting to the driver
print("🎯 TASK: Add useful derived columns at scale (Spark-only)")
print("=" * 60)

# TODO 1: Create time features
# - txn_date (to_date(transaction_date))
# - txn_hour (hour(transaction_date))
# - weekday_short (Mon, Tue, ...)
# - is_weekend (Sat/Sun)
# HINT: use to_date, hour, date_format, when
# YOUR CODE HERE:
# spark_banking_df = spark_banking_df.withColumn("txn_date", ...)
# ...

# TODO 2: Clean/standardize merchant values
# - merchant_std = upper(trim(merchant))
# - Optional: remove obvious noise values (NULL, '')
# YOUR CODE HERE:

# TODO 3: Amount quality flags
# - is_amount_null = amount IS NULL
# - is_amount_negative = amount < 0
# - Optional: amount_abs = abs(amount)
# YOUR CODE HERE:

# TODO 4: Recreate/refresh temp view after enrichment
# YOUR CODE HERE:
# spark_banking_df.createOrReplaceTempView("banking_transactions")

print("✅ After you complete the TODOs, preview a few rows with .show(5)")

## 2. Basic Data Exploration with Spark 🐼➡️🔥
**Goal:** Explore the 1GB+ dataset with Spark (no pandas copies)

### 🎓 Live Coding Exercise:
- **Instructor:** Demonstrates Spark actions and SQL
- **Students:** Build aggregations and quality checks at scale

In [None]:
# 🧑‍🏫 INSTRUCTOR: Basic Spark exploration (precoded)
def explore_banking_data_spark(df):
    """
    Scalable data exploration using Spark
    - Schema, counts, ranges, basic distributions
    - No driver-side collect() on large datasets
    """
    from pyspark.sql import functions as F
    
    print("📊 BANKING DATA OVERVIEW (Spark)")
    print("=" * 50)
    
    print(f"Total rows: {df.count():,}")
    df.printSchema()
    
    # Columns we expect (best effort)
    available_cols = set([c.lower() for c in df.columns])
    
    if "transaction_date" in available_cols:
        print("\n📅 Date range:")
        df.select(F.min("transaction_date").alias("min_date"), F.max("transaction_date").alias("max_date")).show()
        
        print("\n📆 Transactions by weekday:")
        df.withColumn("weekday", F.date_format(F.col("transaction_date"), "E")).groupBy("weekday").count().orderBy("weekday").show()
    
    if "customer_id" in available_cols:
        print("\n👥 Unique customers:")
        df.select(F.countDistinct("customer_id").alias("unique_customers")).show()
    
    if "amount" in available_cols:
        print("\n💰 Amount stats:")
        df.select(
            F.count("amount").alias("n"),
            F.mean("amount").alias("avg"),
            F.expr("percentile_approx(amount, array(0.25,0.5,0.75), 10000)").alias("quantiles"),
            F.min("amount").alias("min"),
            F.max("amount").alias("max")
        ).show(truncate=False)
    
    if "merchant" in available_cols:
        print("\n🏪 Top merchants:")
        df.groupBy("merchant").count().orderBy(F.desc("count")).show(10, truncate=False)

# Run the exploration
explore_banking_data_spark(spark_banking_df)

In [None]:
# 👨‍💻 STUDENT EXERCISE: Custom data queries (Spark)
print("🎯 YOUR TURN: Find interesting patterns at scale!")
print("=" * 50)

# TODO 1: Find customers with highest spending (Spark)
print("💸 TOP SPENDERS:")
# HINT: df.groupBy('customer_id').agg(sum('amount').alias('total')).orderBy(desc('total'))
# YOUR CODE HERE:


print("\n💳 SPENDING BY MERCHANT CATEGORY:")
# TODO 2: Create merchant categories and analyze spending (Spark-only)
# HINT: Build a mapping with when().otherwise() without collecting
# Example:
# df.withColumn('category', when(col('merchant').isin('REWE','EDEKA'), 'Food')
#                              .when(col('merchant').isin('Deutsche Bahn','BVG'),'Transport')
#                              .otherwise('Other'))
# YOUR CODE HERE:


print("\n📈 DAILY SPENDING TRENDS:")
# TODO 3: Show daily total spending (Spark)
# HINT: groupBy(to_date('transaction_date')).agg(sum('amount'))
# YOUR CODE HERE:

## 3. Spark Session Recap 🚀
Spark is already initialized. We’ll keep this short and move to SQL analytics.

- Session tuned for local development and large CSVs
- Temp view `banking_transactions` is ready
- Proceed to analytics at scale

In [None]:
# (Optional) Spark utilities
from pyspark.sql import functions as F
from pyspark.sql.types import *

print("ℹ️ Spark utilities available. Session already created above.")

In [None]:
# 👨‍💻 STUDENT EXERCISE: Spark DataFrame operations
print("🎯 YOUR TURN: Practice Spark DataFrame operations!")
print("=" * 50)

# TODO 1: Basic Spark DataFrame exploration
print("📊 BASIC SPARK OPERATIONS:")
# HINT: Use .count(), .show(5), .describe()
print(f"Total rows: {spark_banking_df.count():,}")

print("\nFirst 5 rows:")
# YOUR CODE HERE to show first 5 rows:


print("\n📈 SPARK vs PANDAS COMPARISON:")
# TODO 2: Compare Spark and pandas operations
print("Counting unique customers:")
print(f"Pandas: {banking_data['customer_id'].nunique()} customers")
# YOUR CODE HERE to count distinct customer_id in Spark:


# TODO 3: Create a temporary view for SQL queries
print("\n🗄️  CREATING TEMPORARY VIEW:")
# HINT: Use createOrReplaceTempView()
# YOUR CODE HERE to create view named 'banking_transactions':


# Test the view
print("✅ Testing the view with SQL:")
spark.sql("SELECT COUNT(*) as total_transactions FROM banking_transactions").show()

## 4. Advanced Spark SQL Analytics 🔍
**Goal:** Complex banking analytics using SQL on big data

### 🏦 Real Banking Use Cases:
- **Fraud Detection:** Unusual spending patterns
- **Customer Segmentation:** Spending behavior analysis
- **Risk Assessment:** Transaction pattern analysis

In [None]:
# 🧑‍🏫 INSTRUCTOR: Advanced SQL analytics (precoded)
def run_banking_analytics():
    print("🔍 ADVANCED BANKING ANALYTICS (Spark SQL)")
    print("=" * 50)

    # 1. Customer spending ranking with window functions
    print("👑 TOP CUSTOMERS BY SPENDING:")
    query1 = """
    SELECT 
        customer_id,
        SUM(amount) as total_spent,
        COUNT(*) as transaction_count,
        AVG(amount) as avg_transaction,
        RANK() OVER (ORDER BY SUM(amount) DESC) as spending_rank
    FROM banking_transactions 
    GROUP BY customer_id 
    ORDER BY total_spent DESC 
    LIMIT 10
    """
    spark.sql(query1).show(truncate=False)

    # 2. Merchant performance analysis
    print("\n🏪 MERCHANT REVENUE ANALYSIS:")
    query2 = """
    SELECT 
        merchant,
        COUNT(*) as transactions,
        SUM(amount) as total_revenue,
        AVG(amount) as avg_transaction,
        STDDEV_POP(amount) as amount_volatility,
        MIN(amount) as min_amount,
        MAX(amount) as max_amount
    FROM banking_transactions 
    GROUP BY merchant 
    HAVING COUNT(*) >= 100
    ORDER BY total_revenue DESC
    """
    spark.sql(query2).show(truncate=False)

    # 3. Time-based patterns (fraud detection)
    print("\n⏰ HOURLY TRANSACTION PATTERNS:")
    query3 = """
    SELECT 
        HOUR(transaction_date) as hour,
        COUNT(*) as transactions,
        SUM(amount) as total_amount,
        AVG(amount) as avg_amount,
        CASE 
            WHEN HOUR(transaction_date) BETWEEN 9 AND 17 THEN 'Business Hours'
            WHEN HOUR(transaction_date) BETWEEN 18 AND 22 THEN 'Evening'
            ELSE 'Off Hours'
        END as time_category
    FROM banking_transactions 
    GROUP BY HOUR(transaction_date)
    ORDER BY hour
    """
    spark.sql(query3).show()

    print("✅ Advanced analytics complete!")

# Run the analytics
run_banking_analytics()

In [None]:
# 👨‍💻 STUDENT EXERCISE: Write complex SQL queries
print("🎯 YOUR TURN: Build fraud detection queries!")
print("=" * 50)

# TODO 1: Fraud Detection - Unusual spending patterns
print("🚨 POTENTIAL FRAUD DETECTION:")
print("Find customers with transactions > 3 standard deviations from their average")
# HINT: Use window functions with AVG() and STDDEV() OVER (PARTITION BY customer_id)
fraud_query = """
-- YOUR SQL QUERY HERE
-- HINT: 
-- 1. Calculate customer's average and standard deviation
-- 2. Find transactions that are unusually large
-- 3. Use ABS(amount - avg_amount) > 3 * stddev_amount
"""

# YOUR CODE HERE to execute the query:
# spark.sql(fraud_query).show()

print("\n💳 CUSTOMER BEHAVIOR SEGMENTATION:")
# TODO 2: Segment customers by spending behavior
segmentation_query = """
-- YOUR SQL QUERY HERE
-- Create customer segments based on:
-- 1. Total spending (High/Medium/Low)
-- 2. Transaction frequency (Frequent/Regular/Occasional)
-- 3. Average transaction size (Large/Medium/Small)
-- HINT: Use CASE WHEN with NTILE() or percentile-based bucketing
"""

# YOUR CODE HERE:


print("\n📊 WEEKEND vs WEEKDAY SPENDING:")
# TODO 3: Compare spending patterns
weekend_query = """
-- YOUR SQL QUERY HERE
-- Compare weekend vs weekday spending:
-- 1. Use DAYOFWEEK() function
-- 2. Calculate totals, averages, transaction counts
-- 3. Show difference in spending behavior
"""

# YOUR CODE HERE:

## 5. GCP Databricks Setup ☁️
**Goal:** Deploy our banking analysis to Google Cloud Platform

### 🌟 Why GCP + Databricks?
- **Scalability:** Handle millions of banking transactions
- **Security:** Enterprise-grade data protection
- **Compliance:** Meet banking regulatory requirements
- **Integration:** Connect to various data sources

### 📋 Pre-requisites:
- GCP Account with billing enabled
- Databricks workspace access
- Service account with proper permissions

In [None]:
# 🧑‍🏫 INSTRUCTOR: GCP Databricks configuration (precoded)
import os
import json
from google.cloud import storage

def setup_gcp_connection():
    """
    Configure GCP connection for banking data upload
    
    This demonstrates:
    - Service account authentication
    - Cloud Storage bucket creation
    - Data upload preparation
    """
    
    print("☁️  GCP DATABRICKS SETUP")
    print("=" * 50)
    
    # Configuration for GCP
    gcp_config = {
        "project_id": "your-banking-project-id",  # Change this
        "bucket_name": "banking-data-analytics",   # Change this
        "dataset_location": "europe-west3",       # Frankfurt region
        "service_account_path": "/path/to/service-account.json"
    }
    
    print("📋 GCP Configuration:")
    for key, value in gcp_config.items():
        print(f"  {key}: {value}")
    
    # Sample Databricks cluster configuration
    databricks_config = {
        "cluster_name": "banking-analytics-cluster",
        "spark_version": "11.3.x-scala2.12",
        "node_type_id": "n1-standard-4",
        "num_workers": 2,
        "autotermination_minutes": 60,
        "spark_conf": {
            "spark.sql.adaptive.enabled": "true",
            "spark.sql.adaptive.coalescePartitions.enabled": "true",
            "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
        }
    }
    
    print("\n🚀 Databricks Cluster Config:")
    print(json.dumps(databricks_config, indent=2))
    
    return gcp_config, databricks_config

def prepare_data_for_upload(df, output_path="banking_data.parquet"):
    """
    Prepare banking data for GCP upload
    
    Best practices:
    - Use Parquet format for efficiency
    - Partition by date for query performance
    - Add metadata for governance
    """
    
    print(f"\n📦 PREPARING DATA FOR UPLOAD")
    print("-" * 30)
    
    # Convert to Spark DataFrame if pandas
    if hasattr(df, 'to_pandas'):
        print("✅ Already Spark DataFrame")
        spark_df = df
    else:
        print("🔄 Converting pandas to Spark")
        spark_df = spark.createDataFrame(df)
    
    # Add metadata columns
    spark_df_enhanced = spark_df.withColumn("upload_date", current_date()) \
                               .withColumn("data_source", lit("synthetic_banking")) \
                               .withColumn("data_version", lit("v1.0"))
    
    # Write to local parquet (simulate GCS upload)
    print(f"💾 Writing to {output_path}...")
    spark_df_enhanced.coalesce(1).write.mode("overwrite").parquet(output_path)
    
    print(f"✅ Data prepared: {spark_df_enhanced.count():,} records")
    return spark_df_enhanced

# Run the setup
gcp_config, databricks_config = setup_gcp_connection()
enhanced_data = prepare_data_for_upload(spark_banking_df)

In [None]:
# 👨‍💻 STUDENT EXERCISE: Customize GCP deployment
print("🎯 YOUR TURN: Customize the deployment configuration!")
print("=" * 50)

# TODO 1: Update configuration for your environment
print("⚙️ CUSTOMIZE YOUR CONFIGURATION:")
my_gcp_config = {
    "project_id": "",  # YOUR GCP PROJECT ID
    "bucket_name": "",  # YOUR BUCKET NAME
    "region": "",      # YOUR PREFERRED REGION
    # Add more config as needed
}

print("📝 Your GCP Config:")
# YOUR CODE HERE to display your configuration


# TODO 2: Create a deployment checklist
print("\n✅ DEPLOYMENT CHECKLIST:")
deployment_checklist = [
    # Add your checklist items here
    # Example: "GCP project created and billing enabled"
]

# YOUR CODE HERE to create and display checklist:


# TODO 3: Estimate costs for your banking analytics
print("\n💰 COST ESTIMATION:")
# Calculate estimated costs based on:
# - Data size
# - Query frequency
# - Cluster running time
# - Storage requirements

data_size_gb = 0  # YOUR CALCULATION
queries_per_day = 0  # YOUR ESTIMATE
cluster_hours_per_day = 0  # YOUR ESTIMATE

# YOUR CODE HERE to calculate estimated monthly costs:

print("\n🎓 NEXT STEPS:")
print("1. Set up your GCP account")
print("2. Create Databricks workspace")
print("3. Configure service account")
print("4. Upload sample data")
print("5. Test connection from Databricks")

## 6. Web Scraping for Financial Data 🕷️
**Goal:** Integrate external financial data sources

### 💡 Real-World Banking Use Cases:
- **Exchange Rates:** Currency conversion for international transactions  
- **Stock Prices:** Portfolio valuation and risk assessment
- **Economic Indicators:** Market analysis and forecasting
- **Regulatory Updates:** Compliance monitoring

### ⚖️ Ethical Considerations:
- Always check `robots.txt` and terms of service
- Respect rate limits and server resources
- Use APIs when available instead of scraping
- Consider data privacy and compliance requirements

In [None]:
# 🧑‍🏫 INSTRUCTOR: Web scraping setup (precoded)
import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime, timedelta
import pandas as pd

def scrape_exchange_rates():
    """
    Simulate scraping EUR/USD exchange rates
    
    In production, use:
    - Official APIs (ECB, Federal Reserve, etc.)
    - Financial data providers (Alpha Vantage, Yahoo Finance API)
    - Respect rate limits and terms of service
    """
    
    print("💱 SIMULATING EXCHANGE RATE SCRAPING")
    print("=" * 40)
    
    # Simulate exchange rate data (in production, scrape from real source)
    print("⚠️  NOTE: This is simulated data for demo purposes")
    print("🔗 Real sources: European Central Bank API, Yahoo Finance, etc.")
    
    # Generate realistic EUR/USD rates for past 30 days
    base_rate = 1.10
    dates = pd.date_range(start=datetime.now() - timedelta(days=30), 
                         end=datetime.now(), freq='D')
    
    exchange_rates = []
    for i, date in enumerate(dates):
        # Simulate rate fluctuation
        rate = base_rate + (i * 0.001) + (0.02 * (i % 7 - 3) / 7)
        exchange_rates.append({
            'date': date.strftime('%Y-%m-%d'),
            'currency_pair': 'EUR/USD',
            'rate': round(rate, 4),
            'source': 'simulated_ecb'
        })
    
    rates_df = pd.DataFrame(exchange_rates)
    
    print(f"📊 Retrieved {len(rates_df)} exchange rates")
    print("\n📈 Sample rates:")
    print(rates_df.tail())
    
    return rates_df

def scrape_economic_indicators():
    """
    Simulate scraping economic indicators relevant to banking
    
    Real indicators to track:
    - Interest rates (ECB, Federal Reserve)
    - Inflation rates
    - GDP growth
    - Unemployment rates
    """
    
    print("\n📊 SIMULATING ECONOMIC INDICATORS SCRAPING")
    print("=" * 45)
    
    # Simulated economic data
    indicators = [
        {'indicator': 'ECB_Interest_Rate', 'value': 4.25, 'date': '2024-01-15'},
        {'indicator': 'EUR_Inflation_Rate', 'value': 2.8, 'date': '2024-01-15'},
        {'indicator': 'DE_Unemployment_Rate', 'value': 5.9, 'date': '2024-01-15'},
        {'indicator': 'EUR_GDP_Growth', 'value': 1.2, 'date': '2024-01-15'},
    ]
    
    indicators_df = pd.DataFrame(indicators)
    
    print("🏛️ Key Economic Indicators:")
    print(indicators_df)
    
    return indicators_df

# Execute the scraping functions
exchange_rates_df = scrape_exchange_rates()
economic_indicators_df = scrape_economic_indicators()

print("\n✅ External data sources ready for integration!")

In [None]:
# 👨‍💻 STUDENT EXERCISE: Build your own scraper
print("🎯 YOUR TURN: Create custom financial data scrapers!")
print("=" * 50)

# TODO 1: Create a scraper for stock prices (simulated)
def scrape_stock_prices(symbols=['DAX', 'BMW', 'SAP']):
    """
    Simulate scraping German stock prices
    
    YOUR TASK: Create a function that would scrape stock prices
    for major German companies relevant to banking portfolios
    """
    
    print("📈 STOCK PRICE SCRAPER:")
    # YOUR CODE HERE
    # Create simulated stock data for German companies
    # Include: symbol, price, change, volume, timestamp
    
    stock_data = []  # YOUR DATA STRUCTURE
    
    return pd.DataFrame(stock_data)

# YOUR CODE HERE to test your function:
# stock_prices_df = scrape_stock_prices()

# TODO 2: Implement rate limiting and error handling
def safe_scraper(url, delay=1, retries=3):
    """
    Create a robust scraper with proper error handling
    
    YOUR TASK: Implement:
    - Rate limiting (time delays)
    - Retry logic for failed requests
    - User-agent rotation
    - Timeout handling
    """
    
    print("🛡️ IMPLEMENTING SAFE SCRAPING:")
    # YOUR CODE HERE
    
    pass

# TODO 3: Create a data validation function
def validate_financial_data(df, data_type='exchange_rates'):
    """
    Validate scraped financial data
    
    YOUR TASK: Check for:
    - Missing values
    - Unrealistic values (e.g., negative exchange rates)
    - Date format consistency
    - Duplicate entries
    """
    
    print(f"✅ VALIDATING {data_type.upper()} DATA:")
    # YOUR VALIDATION CODE HERE
    
    return True  # Return validation status

print("\n🎓 BONUS CHALLENGE:")
print("Research and list 3 official financial APIs that would be better than scraping:")
print("1. ")  # YOUR ANSWER
print("2. ")  # YOUR ANSWER  
print("3. ")  # YOUR ANSWER

## 7. Multi-Source Data Integration 🔗
**Goal:** Combine banking transactions with external financial data

### 🏦 Enterprise Banking Reality:
- **Internal Data:** Transactions, customer profiles, account balances
- **External Data:** Market data, economic indicators, regulatory feeds  
- **Real-time Streams:** Payment networks, fraud detection systems
- **Historical Archives:** Years of transaction history for analysis

### 🎯 Integration Challenges:
- **Schema Variations:** Different data formats and structures
- **Data Quality:** Missing values, duplicates, inconsistencies
- **Time Synchronization:** Aligning data from different time zones
- **Scale:** Processing billions of records efficiently

In [None]:
# 🧑‍🏫 INSTRUCTOR: Multi-source data integration (precoded)
from pyspark.sql.functions import *
from pyspark.sql.types import *

def integrate_financial_data():
    """
    Demonstrate enterprise-level data integration
    
    This function shows:
    - Converting external data to Spark DataFrames
    - Schema alignment and data type conversions  
    - Time-based joins for financial analysis
    - Data quality checks and validation
    """
    
    print("🔗 MULTI-SOURCE DATA INTEGRATION")
    print("=" * 40)
    
    # 1. Convert external data to Spark DataFrames
    print("📊 Converting external data to Spark...")
    
    # Exchange rates to Spark DataFrame
    exchange_rates_spark = spark.createDataFrame(exchange_rates_df) \
        .withColumn("rate_date", to_date(col("date"), "yyyy-MM-dd")) \
        .withColumn("rate", col("rate").cast("double"))
    
    # Economic indicators to Spark DataFrame  
    indicators_spark = spark.createDataFrame(economic_indicators_df) \
        .withColumn("indicator_date", to_date(col("date"), "yyyy-MM-dd")) \
        .withColumn("value", col("value").cast("double"))
    
    # 2. Prepare banking data for joins
    print("🏦 Preparing banking transactions...")
    
    banking_with_date = spark_banking_df \
        .withColumn("transaction_date_only", 
                   to_date(col("transaction_date"))) \
        .withColumn("month_year", 
                   date_format(col("transaction_date"), "yyyy-MM"))
    
    # 3. Join banking data with exchange rates (for international analysis)
    print("💱 Integrating exchange rate data...")
    
    banking_with_rates = banking_with_date.join(
        exchange_rates_spark.select("rate_date", "rate", "currency_pair"),
        banking_with_date.transaction_date_only == exchange_rates_spark.rate_date,
        "left"
    ).withColumn("amount_usd", 
                col("amount") * col("rate")) \
     .drop("rate_date")
    
    # 4. Add economic context
    print("📈 Adding economic indicators...")
    
    # Get monthly economic data (simplified join)
    monthly_indicators = indicators_spark \
        .withColumn("month_year", date_format(col("indicator_date"), "yyyy-MM")) \
        .groupBy("month_year") \
        .agg(
            avg(when(col("indicator") == "ECB_Interest_Rate", col("value"))).alias("interest_rate"),
            avg(when(col("indicator") == "EUR_Inflation_Rate", col("value"))).alias("inflation_rate")
        )
    
    # Final integrated dataset
    integrated_banking_data = banking_with_rates.join(
        monthly_indicators,
        "month_year",
        "left"
    )
    
    # 5. Create summary view
    print("\n📋 INTEGRATED DATA SUMMARY:")
    integrated_banking_data.select(
        "customer_id", "merchant", "amount", "amount_usd", 
        "rate", "interest_rate", "inflation_rate", "transaction_date"
    ).show(5, truncate=False)
    
    print(f"\n✅ Integration complete: {integrated_banking_data.count():,} enriched transactions")
    
    return integrated_banking_data

# Execute integration
integrated_data = integrate_financial_data()

# Create temporary view for final analysis
integrated_data.createOrReplaceTempView("integrated_banking_data")

In [None]:
# 👨‍💻 FINAL STUDENT CHALLENGE: Comprehensive Banking Analytics
print("🎯 FINAL CHALLENGE: Build a complete banking analytics dashboard!")
print("=" * 60)

# TODO 1: Economic Impact Analysis
print("📊 ECONOMIC IMPACT ANALYSIS:")
economic_analysis_query = """
-- YOUR FINAL SQL QUERY HERE
-- Analyze how economic indicators affect spending patterns:
-- 1. Compare spending in high vs low interest rate periods
-- 2. Correlate inflation rates with transaction amounts
-- 3. Show currency impact on international transactions (amount_usd vs amount)
-- 4. Identify economic-sensitive merchant categories
"""

# YOUR CODE HERE:
# spark.sql(economic_analysis_query).show()

# TODO 2: Advanced Customer Segmentation
print("\n👥 ADVANCED CUSTOMER SEGMENTATION:")
segmentation_query = """
-- YOUR ADVANCED SEGMENTATION QUERY HERE
-- Create sophisticated customer segments based on:
-- 1. Spending behavior during different economic conditions
-- 2. Currency usage patterns (EUR vs USD spending)  
-- 3. Merchant category preferences
-- 4. Time-based spending patterns (business hours vs off-hours)
-- 5. Risk profile (transaction volatility)
"""

# YOUR CODE HERE:

# TODO 3: Predictive Risk Indicators  
print("\n🚨 RISK AND FRAUD INDICATORS:")
risk_analysis_query = """
-- YOUR RISK ANALYSIS QUERY HERE
-- Build risk indicators combining all data sources:
-- 1. Unusual spending during economic volatility periods
-- 2. Currency arbitrage patterns (suspicious EUR/USD transactions)
-- 3. Off-hours high-value transactions
-- 4. Merchant pattern deviations
-- 5. Cross-reference with economic indicator anomalies
"""

# YOUR CODE HERE:

# TODO 4: Executive Summary Dashboard
print("\n📈 EXECUTIVE SUMMARY DASHBOARD:")
print("Create a summary for bank executives showing:")
print("- Total portfolio value in EUR and USD")
print("- Customer segment breakdown")  
print("- Economic risk exposure")
print("- Fraud alerts summary")
print("- Regulatory compliance metrics")

dashboard_metrics = {
    "total_transactions": 0,  # YOUR CALCULATION
    "total_volume_eur": 0,    # YOUR CALCULATION  
    "total_volume_usd": 0,    # YOUR CALCULATION
    "high_risk_customers": 0, # YOUR CALCULATION
    "economic_exposure": 0,   # YOUR CALCULATION
}

# YOUR CODE HERE to populate dashboard_metrics:

print("\n🎉 CONGRATULATIONS!")
print("You've completed a full big data banking analytics pipeline:")
print("✅ Data Generation & Quality Assessment")  
print("✅ Scalable Processing with Spark")
print("✅ Advanced SQL Analytics")
print("✅ Cloud Deployment Preparation")
print("✅ External Data Integration")
print("✅ Multi-source Analytics")

print("\n🚀 NEXT STEPS FOR PRODUCTION:")
print("1. Implement real-time streaming with Kafka")
print("2. Add machine learning for fraud detection")  
print("3. Create automated reporting pipelines")
print("4. Implement data governance and lineage")
print("5. Add regulatory compliance monitoring")