# Banking Data Analysis - Live Coding Workshop
## Big Data Analytics im Banking | 13:00-15:40

### 🎯 **Workshop Agenda**
- **13:00-13:45:** Einführung in Datenanalyse + Banking Transaction Analysis
- **13:55-14:40:** Spark Deep-Dive & GCP Setup
- **14:50-15:40:** Datenbeschaffung und -integration

### 🛠 **Was wir heute lernen:**
1. **Datenanalyseprozess** in der Praxis
2. **Data Mining** für Banking-Patterns
3. **Spark Setup** und SQL-Queries
4. **GCP/Databricks** Configuration
5. **Web Scraping** für Financial Data
6. **Multi-Source Integration**

### 📋 **Live Coding Approach**
- **Instructor demonstrates** → **Students modify/extend**
- **Short code blocks** with thorough comments
- **Interactive exercises** at each step

## 1. Load Large Banking Transactions (PySpark) 🏦
**Goal:** Load a >1GB CSV efficiently using PySpark and prepare it for analysis

Dataset: `transactions_data.csv` (set the path below)

### 🎓 Live Coding Exercise:
- **Instructor:** Sets up Spark and loads the dataset with an explicit schema or fast inference
- **Students:** Add derived columns and validate data quality

In [1]:
# Essential PySpark setup for large-scale banking data
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Path to the large CSV dataset (>1GB), relative to this notebook's working directory
# Ensure the file 'transactions_data.csv' is in the same folder as this notebook
dataset_path = "transactions_data.csv"

# Create Spark session optimized for local analysis of large CSVs
spark = (
    SparkSession.builder
    .appName("Banking Transactions Analysis")
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
    .config("spark.sql.shuffle.partitions", "200")  # tune based on cores
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.sql.session.timeZone", "Europe/Berlin")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print("✅ Spark initialized for large dataset processing")
print(f"🔧 Spark version: {spark.version}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/10 12:16:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/10 12:16:47 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/08/10 12:16:47 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


✅ Spark initialized for large dataset processing
🔧 Spark version: 3.5.3


In [2]:
# 📦 LOAD LARGE DATASET - Banking Transactions CSV (>1GB)
# This cell loads the dataset using PySpark and prepares standard columns

from pyspark.sql import functions as F
from pyspark.sql.types import *

# Optional: Define an explicit schema for best performance (fill in when known)
# Example (adjust to your dataset columns):
# explicit_schema = StructType([
#     StructField("transaction_id", StringType(), True),
#     StructField("customer_id", StringType(), True),
#     StructField("merchant", StringType(), True),
#     StructField("amount", DoubleType(), True),
#     StructField("currency", StringType(), True),
#     StructField("timestamp", StringType(), True),
#     # ... add other fields
# ])
explicit_schema = None  # set to the StructType above when ready

read_builder = (
    spark.read
        .option("header", True)
        .option("inferSchema", explicit_schema is None)
        .option("multiLine", False)
        .option("mode", "PERMISSIVE")
)

transactions_raw = (
    read_builder.csv(dataset_path) if explicit_schema is None
    else read_builder.schema(explicit_schema).csv(dataset_path)
)

print("📋 Raw schema:")
transactions_raw.printSchema()

cols = set([c.lower() for c in transactions_raw.columns])

# Identify and standardize key columns
# 1) transaction_date (timestamp)
candidate_date_cols = [c for c in ["transaction_date", "timestamp", "event_time", "date", "datetime"] if c in cols]
if candidate_date_cols:
    date_col = [c for c in transactions_raw.columns if c.lower() == candidate_date_cols[0]][0]
    transactions_std = transactions_raw.withColumn(
        "transaction_date",
        F.to_timestamp(F.col(date_col))
    )
else:
    transactions_std = transactions_raw  # proceed without date if missing

# 2) amount (double)
candidate_amount_cols = [c for c in ["amount", "amt", "value", "transaction_amount"] if c in cols]
if candidate_amount_cols:
    amount_src = [c for c in transactions_raw.columns if c.lower() == candidate_amount_cols[0]][0]
    if amount_src != "amount":
        transactions_std = transactions_std.withColumn("amount", F.col(amount_src).cast("double"))
    else:
        transactions_std = transactions_std.withColumn("amount", F.col("amount").cast("double"))

# 3) merchant (string)
if "merchant" not in cols:
    for alt in ["merchant_name", "store", "vendor"]:
        if alt in cols:
            alt_src = [c for c in transactions_raw.columns if c.lower() == alt][0]
            transactions_std = transactions_std.withColumnRenamed(alt_src, "merchant")
            break

# 4) customer_id (string)
if "customer_id" not in cols:
    for alt in ["customer", "customerid", "cust_id", "account_id"]:
        if alt in cols:
            alt_src = [c for c in transactions_raw.columns if c.lower() == alt][0]
            transactions_std = transactions_std.withColumnRenamed(alt_src, "customer_id")
            break

# Light-weight normalization/derivations (safe for large data)
transactions_std = (
    transactions_std
    .withColumn("transaction_date", F.col("transaction_date"))  # ensure exists if created
)

# Repartition and persist for interactive analysis
spark_banking_df = transactions_std.repartition(200).persist()

print("\n🔎 Sample rows:")
spark_banking_df.show(5, truncate=False)

# Create a temp view for SQL queries
spark_banking_df.createOrReplaceTempView("banking_transactions")
print("✅ Temp view 'banking_transactions' is ready for SQL queries")

                                                                                

📋 Raw schema:
root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_id: integer (nullable = true)
 |-- amount: string (nullable = true)
 |-- use_chip: string (nullable = true)
 |-- merchant_id: integer (nullable = true)
 |-- merchant_city: string (nullable = true)
 |-- merchant_state: string (nullable = true)
 |-- zip: double (nullable = true)
 |-- mcc: integer (nullable = true)
 |-- errors: string (nullable = true)


🔎 Sample rows:


25/08/10 12:17:22 WARN MemoryStore: Not enough space to cache rdd_17_120 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:22 WARN BlockManager: Persisting block rdd_17_120 to disk instead.
25/08/10 12:17:22 WARN MemoryStore: Not enough space to cache rdd_17_120 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:22 WARN MemoryStore: Not enough space to cache rdd_17_122 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:22 WARN BlockManager: Persisting block rdd_17_122 to disk instead.
25/08/10 12:17:22 WARN MemoryStore: Not enough space to cache rdd_17_120 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:22 WARN BlockManager: Persisting block rdd_17_120 to disk instead.
25/08/10 12:17:22 WARN MemoryStore: Not enough space to cache rdd_17_120 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:22 WARN MemoryStore: Not enough space to cache rdd_17_122 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:22 WARN BlockManager: Persisting block rdd_17_122 to disk instead.
25/08/10 12:

+-------+-------------------+---------+-------+------+-----------------+-----------+-------------+--------------+-------+----+------+-------------------+
|id     |date               |client_id|card_id|amount|use_chip         |merchant_id|merchant_city|merchant_state|zip    |mcc |errors|transaction_date   |
+-------+-------------------+---------+-------+------+-----------------+-----------+-------------+--------------+-------+----+------+-------------------+
|9123100|2011-02-07 06:44:00|1664     |5147   |NULL  |Swipe Transaction|83480      |Ann Arbor    |MI            |48103.0|9402|NULL  |2011-02-07 06:44:00|
|9048269|2011-01-20 10:18:00|1575     |2112   |NULL  |Swipe Transaction|61195      |Sarasota     |FL            |34232.0|5541|NULL  |2011-01-20 10:18:00|
|8263643|2010-07-16 11:36:00|1857     |5089   |NULL  |Swipe Transaction|91128      |Morris Plains|NJ            |7950.0 |5411|NULL  |2010-07-16 11:36:00|
|8791129|2010-11-20 10:52:00|96       |3695   |NULL  |Swipe Transaction|4178

25/08/10 12:17:27 WARN MemoryStore: Not enough space to cache rdd_17_198 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:27 WARN BlockManager: Persisting block rdd_17_198 to disk instead.
25/08/10 12:17:27 WARN MemoryStore: Not enough space to cache rdd_17_199 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:27 WARN BlockManager: Persisting block rdd_17_199 to disk instead.
25/08/10 12:17:27 WARN MemoryStore: Not enough space to cache rdd_17_196 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:27 WARN MemoryStore: Not enough space to cache rdd_17_197 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:27 WARN BlockManager: Persisting block rdd_17_197 to disk instead.
25/08/10 12:17:27 WARN MemoryStore: Not enough space to cache rdd_17_198 in memory! (computed 3.5 MiB so far)
25/08/10 12:17:27 WARN MemoryStore: Not enough space to cache rdd_17_197 in memory! (computed 3.5 MiB so far)
                                                                                

In [None]:
# ✅ COMPLETE SOLUTION: Derive features in Spark (big dataset)
# Goal: Enrich the loaded DataFrame without collecting to the driver
print("🎯 COMPLETE: Add useful derived columns at scale (Spark-only)")
print("=" * 60)

from pyspark.sql.functions import *

# Solution 1: Create time features
print("📅 Adding time-based features...")
spark_banking_df = spark_banking_df.withColumn("txn_date", to_date(col("transaction_date"))) \
                                   .withColumn("txn_hour", hour(col("transaction_date"))) \
                                   .withColumn("weekday_short", date_format(col("transaction_date"), "E")) \
                                   .withColumn("is_weekend", 
                                              when(dayofweek(col("transaction_date")).isin([1, 7]), True)
                                              .otherwise(False))

# Solution 2: Clean/standardize merchant values
print("🏪 Standardizing merchant names...")

# Create a merchant column by combining merchant_id and merchant_city
spark_banking_df = spark_banking_df.withColumn(
    "merchant",
    concat_ws("_", col("merchant_id").cast("string"), col("merchant_city"))
)

spark_banking_df = spark_banking_df.withColumn("merchant_std", upper(trim(col("merchant")))) \
                                   .withColumn("merchant_clean", 
                                              when(col("merchant").isNull() | (col("merchant") == ""), "UNKNOWN")
                                              .otherwise(col("merchant_std")))

# Solution 3: Amount quality flags
print("💰 Adding amount quality flags...")
spark_banking_df = spark_banking_df.withColumn("is_amount_null", col("amount").isNull()) \
                                   .withColumn("is_amount_negative", col("amount") < 0) \
                                   .withColumn("amount_abs", abs(col("amount"))) \
                                   .withColumn("amount_category",
                                              when(col("amount") < 10, "Micro")
                                              .when(col("amount") < 100, "Small")
                                              .when(col("amount") < 1000, "Medium")
                                              .otherwise("Large"))

# Solution 4: Recreate/refresh temp view after enrichment
spark_banking_df.createOrReplaceTempView("banking_transactions")
print("✅ Temp view refreshed with new features")

# Preview enriched data
print("\n📊 Sample of enriched data:")
spark_banking_df.select("customer_id", "merchant_clean", "amount", "amount_category", 
                        "txn_date", "weekday_short", "is_weekend", "txn_hour").show(5)

🎯 COMPLETE: Add useful derived columns at scale (Spark-only)
📅 Adding time-based features...
🏪 Standardizing merchant names...


AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `merchant` cannot be resolved. Did you mean one of the following? [`merchant_id`, `errors`, `amount`, `mcc`, `merchant_city`].;
'Project [id#17, date#18, client_id#19, card_id#20, amount#56, use_chip#22, merchant_id#23, merchant_city#24, merchant_state#25, zip#26, mcc#27, errors#28, transaction_date#70, txn_date#474, txn_hour#489, weekday_short#505, is_weekend#522, upper(trim('merchant, None)) AS merchant_std#540]
+- Project [id#17, date#18, client_id#19, card_id#20, amount#56, use_chip#22, merchant_id#23, merchant_city#24, merchant_state#25, zip#26, mcc#27, errors#28, transaction_date#70, txn_date#474, txn_hour#489, weekday_short#505, CASE WHEN dayofweek(cast(transaction_date#70 as date)) IN (1,7) THEN true ELSE false END AS is_weekend#522]
   +- Project [id#17, date#18, client_id#19, card_id#20, amount#56, use_chip#22, merchant_id#23, merchant_city#24, merchant_state#25, zip#26, mcc#27, errors#28, transaction_date#70, txn_date#474, txn_hour#489, date_format(transaction_date#70, E, Some(Europe/Berlin)) AS weekday_short#505]
      +- Project [id#17, date#18, client_id#19, card_id#20, amount#56, use_chip#22, merchant_id#23, merchant_city#24, merchant_state#25, zip#26, mcc#27, errors#28, transaction_date#70, txn_date#474, hour(transaction_date#70, Some(Europe/Berlin)) AS txn_hour#489]
         +- Project [id#17, date#18, client_id#19, card_id#20, amount#56, use_chip#22, merchant_id#23, merchant_city#24, merchant_state#25, zip#26, mcc#27, errors#28, transaction_date#70, to_date(transaction_date#70, None, Some(Europe/Berlin), false) AS txn_date#474]
            +- Repartition 200, true
               +- Project [id#17, date#18, client_id#19, card_id#20, amount#56, use_chip#22, merchant_id#23, merchant_city#24, merchant_state#25, zip#26, mcc#27, errors#28, transaction_date#41 AS transaction_date#70]
                  +- Project [id#17, date#18, client_id#19, card_id#20, cast(amount#21 as double) AS amount#56, use_chip#22, merchant_id#23, merchant_city#24, merchant_state#25, zip#26, mcc#27, errors#28, transaction_date#41]
                     +- Project [id#17, date#18, client_id#19, card_id#20, amount#21, use_chip#22, merchant_id#23, merchant_city#24, merchant_state#25, zip#26, mcc#27, errors#28, to_timestamp(date#18, None, TimestampType, Some(Europe/Berlin), false) AS transaction_date#41]
                        +- Relation [id#17,date#18,client_id#19,card_id#20,amount#21,use_chip#22,merchant_id#23,merchant_city#24,merchant_state#25,zip#26,mcc#27,errors#28] csv


## 2. Basic Data Exploration with Spark 🐼➡️🔥
**Goal:** Explore the 1GB+ dataset with Spark (no pandas copies)

### 🎓 Live Coding Exercise:
- **Instructor:** Demonstrates Spark actions and SQL
- **Students:** Build aggregations and quality checks at scale

In [None]:
# 🧑‍🏫 INSTRUCTOR: Basic Spark exploration (precoded)
def explore_banking_data_spark(df):
    """
    Scalable data exploration using Spark
    - Schema, counts, ranges, basic distributions
    - No driver-side collect() on large datasets
    """
    from pyspark.sql import functions as F
    
    print("📊 BANKING DATA OVERVIEW (Spark)")
    print("=" * 50)
    
    print(f"Total rows: {df.count():,}")
    df.printSchema()
    
    # Columns we expect (best effort)
    available_cols = set([c.lower() for c in df.columns])
    
    if "transaction_date" in available_cols:
        print("\n📅 Date range:")
        df.select(F.min("transaction_date").alias("min_date"), F.max("transaction_date").alias("max_date")).show()
        
        print("\n📆 Transactions by weekday:")
        df.withColumn("weekday", F.date_format(F.col("transaction_date"), "E")).groupBy("weekday").count().orderBy("weekday").show()
    
    if "customer_id" in available_cols:
        print("\n👥 Unique customers:")
        df.select(F.countDistinct("customer_id").alias("unique_customers")).show()
    
    if "amount" in available_cols:
        print("\n💰 Amount stats:")
        df.select(
            F.count("amount").alias("n"),
            F.mean("amount").alias("avg"),
            F.expr("percentile_approx(amount, array(0.25,0.5,0.75), 10000)").alias("quantiles"),
            F.min("amount").alias("min"),
            F.max("amount").alias("max")
        ).show(truncate=False)
    
    if "merchant" in available_cols:
        print("\n🏪 Top merchants:")
        df.groupBy("merchant").count().orderBy(F.desc("count")).show(10, truncate=False)

# Run the exploration
explore_banking_data_spark(spark_banking_df)

In [None]:
# ✅ COMPLETE SOLUTION: Custom data queries (Spark)
print("🎯 COMPLETE: Find interesting patterns at scale!")
print("=" * 50)

from pyspark.sql.functions import *

# Solution 1: Find customers with highest spending (Spark)
print("💸 TOP SPENDERS:")
top_spenders = spark_banking_df.groupBy('customer_id') \
                              .agg(sum('amount').alias('total_spent'),
                                   count('*').alias('transaction_count'),
                                   avg('amount').alias('avg_transaction')) \
                              .orderBy(desc('total_spent'))
top_spenders.show(10)

print("\n💳 SPENDING BY MERCHANT CATEGORY:")
# Solution 2: Create merchant categories and analyze spending
categorized_spending = spark_banking_df.withColumn('category', 
    when(col('merchant_clean').isin('REWE', 'EDEKA', 'ALDI', 'LIDL'), 'Food')
    .when(col('merchant_clean').isin('DEUTSCHE BAHN', 'BVG', 'UBER'), 'Transport')
    .when(col('merchant_clean').isin('AMAZON.DE', 'MEDIAMARKT', 'H&M'), 'Shopping')
    .when(col('merchant_clean').isin('SHELL', 'ARAL', 'ESSO'), 'Fuel')
    .when(col('merchant_clean').isin('SPARKASSE ATM', 'COMMERZBANK ATM'), 'Banking')
    .when(col('merchant_clean').isin('MCDONALD\'S', 'BURGER KING', 'STARBUCKS'), 'Dining')
    .otherwise('Other')
) \
.groupBy('category') \
.agg(sum('amount').alias('total_spending'),
     count('*').alias('transaction_count'),
     avg('amount').alias('avg_amount')) \
.orderBy(desc('total_spending'))

categorized_spending.show()

print("\n📈 DAILY SPENDING TRENDS:")
# Solution 3: Show daily total spending
daily_trends = spark_banking_df.groupBy(to_date('transaction_date').alias('date')) \
                               .agg(sum('amount').alias('total_daily_spending'),
                                    count('*').alias('daily_transactions'),
                                    avg('amount').alias('avg_daily_transaction')) \
                               .orderBy('date')
daily_trends.show(10)

## 3. Spark Session Recap 🚀
Spark is already initialized. We’ll keep this short and move to SQL analytics.

- Session tuned for local development and large CSVs
- Temp view `banking_transactions` is ready
- Proceed to analytics at scale

In [None]:
# (Optional) Spark utilities
from pyspark.sql import functions as F
from pyspark.sql.types import *

print("ℹ️ Spark utilities available. Session already created above.")

In [None]:
# ✅ COMPLETE SOLUTION: Spark DataFrame operations
print("🎯 COMPLETE: Practice Spark DataFrame operations!")
print("=" * 50)

from pyspark.sql.functions import *

# Solution 1: Basic Spark DataFrame exploration
print("📊 BASIC SPARK OPERATIONS:")
print(f"Total rows: {spark_banking_df.count():,}")

print("\nFirst 5 rows:")
spark_banking_df.show(5, truncate=False)

print("\nDataFrame description:")
spark_banking_df.describe().show()

print("\n📈 SPARK OPERATIONS:")
# Solution 2: Spark operations (updated without pandas reference)
print("Counting unique customers:")
unique_customers = spark_banking_df.select("customer_id").distinct().count()
print(f"Spark: {unique_customers:,} unique customers")

print(f"\nUnique merchants: {spark_banking_df.select('merchant_clean').distinct().count()}")
print(f"Date range: {spark_banking_df.select(min('transaction_date'), max('transaction_date')).collect()[0]}")

# Solution 3: Temporary view already created above, but let's verify
print("\n🗄️  TEMPORARY VIEW STATUS:")
print("✅ 'banking_transactions' view already created with enriched data")

# Test the view with additional queries
print("✅ Testing the view with SQL:")
spark.sql("SELECT COUNT(*) as total_transactions FROM banking_transactions").show()

print("\nSample SQL query - Weekend vs Weekday analysis:")
spark.sql("""
SELECT 
    is_weekend,
    COUNT(*) as transaction_count,
    SUM(amount) as total_amount,
    AVG(amount) as avg_amount
FROM banking_transactions 
GROUP BY is_weekend
ORDER BY is_weekend
""").show()

## 4. Advanced Spark SQL Analytics 🔍
**Goal:** Complex banking analytics using SQL on big data

### 🏦 Real Banking Use Cases:
- **Fraud Detection:** Unusual spending patterns
- **Customer Segmentation:** Spending behavior analysis
- **Risk Assessment:** Transaction pattern analysis

In [None]:
# 🧑‍🏫 INSTRUCTOR: Advanced SQL analytics (precoded)
def run_banking_analytics():
    print("🔍 ADVANCED BANKING ANALYTICS (Spark SQL)")
    print("=" * 50)

    # 1. Customer spending ranking with window functions
    print("👑 TOP CUSTOMERS BY SPENDING:")
    query1 = """
    SELECT 
        customer_id,
        SUM(amount) as total_spent,
        COUNT(*) as transaction_count,
        AVG(amount) as avg_transaction,
        RANK() OVER (ORDER BY SUM(amount) DESC) as spending_rank
    FROM banking_transactions 
    GROUP BY customer_id 
    ORDER BY total_spent DESC 
    LIMIT 10
    """
    spark.sql(query1).show(truncate=False)

    # 2. Merchant performance analysis
    print("\n🏪 MERCHANT REVENUE ANALYSIS:")
    query2 = """
    SELECT 
        merchant,
        COUNT(*) as transactions,
        SUM(amount) as total_revenue,
        AVG(amount) as avg_transaction,
        STDDEV_POP(amount) as amount_volatility,
        MIN(amount) as min_amount,
        MAX(amount) as max_amount
    FROM banking_transactions 
    GROUP BY merchant 
    HAVING COUNT(*) >= 100
    ORDER BY total_revenue DESC
    """
    spark.sql(query2).show(truncate=False)

    # 3. Time-based patterns (fraud detection)
    print("\n⏰ HOURLY TRANSACTION PATTERNS:")
    query3 = """
    SELECT 
        HOUR(transaction_date) as hour,
        COUNT(*) as transactions,
        SUM(amount) as total_amount,
        AVG(amount) as avg_amount,
        CASE 
            WHEN HOUR(transaction_date) BETWEEN 9 AND 17 THEN 'Business Hours'
            WHEN HOUR(transaction_date) BETWEEN 18 AND 22 THEN 'Evening'
            ELSE 'Off Hours'
        END as time_category
    FROM banking_transactions 
    GROUP BY HOUR(transaction_date)
    ORDER BY hour
    """
    spark.sql(query3).show()

    print("✅ Advanced analytics complete!")

# Run the analytics
run_banking_analytics()

In [None]:
# ✅ COMPLETE SOLUTION: Complex SQL queries for fraud detection
print("🎯 COMPLETE: Build fraud detection queries!")
print("=" * 50)

# Solution 1: Fraud Detection - Unusual spending patterns
print("🚨 POTENTIAL FRAUD DETECTION:")
print("Find customers with transactions > 3 standard deviations from their average")

fraud_query = """
WITH customer_stats AS (
    SELECT 
        customer_id,
        transaction_date,
        amount,
        AVG(amount) OVER (PARTITION BY customer_id) as avg_amount,
        STDDEV_POP(amount) OVER (PARTITION BY customer_id) as stddev_amount
    FROM banking_transactions
),
potential_fraud AS (
    SELECT 
        customer_id,
        transaction_date,
        amount,
        avg_amount,
        stddev_amount,
        ABS(amount - avg_amount) as deviation,
        CASE 
            WHEN stddev_amount > 0 AND ABS(amount - avg_amount) > 3 * stddev_amount 
            THEN 'HIGH_RISK'
            WHEN stddev_amount > 0 AND ABS(amount - avg_amount) > 2 * stddev_amount 
            THEN 'MEDIUM_RISK'
            ELSE 'NORMAL'
        END as risk_level
    FROM customer_stats
)
SELECT 
    risk_level,
    COUNT(*) as transaction_count,
    AVG(amount) as avg_suspicious_amount,
    MIN(amount) as min_amount,
    MAX(amount) as max_amount
FROM potential_fraud 
WHERE risk_level != 'NORMAL'
GROUP BY risk_level
ORDER BY transaction_count DESC
"""

print("Fraud detection results:")
spark.sql(fraud_query).show()

print("\n💳 CUSTOMER BEHAVIOR SEGMENTATION:")
# Solution 2: Segment customers by spending behavior
segmentation_query = """
WITH customer_behavior AS (
    SELECT 
        customer_id,
        SUM(amount) as total_spending,
        COUNT(*) as transaction_frequency,
        AVG(amount) as avg_transaction_size,
        STDDEV_POP(amount) as spending_volatility
    FROM banking_transactions 
    GROUP BY customer_id
),
customer_segments AS (
    SELECT 
        customer_id,
        total_spending,
        transaction_frequency,
        avg_transaction_size,
        spending_volatility,
        CASE 
            WHEN total_spending > 10000 THEN 'High Spender'
            WHEN total_spending > 2000 THEN 'Medium Spender'
            ELSE 'Low Spender'
        END as spending_segment,
        CASE 
            WHEN transaction_frequency > 50 THEN 'Frequent'
            WHEN transaction_frequency > 15 THEN 'Regular'
            ELSE 'Occasional'
        END as frequency_segment,
        CASE 
            WHEN avg_transaction_size > 500 THEN 'Large Transactions'
            WHEN avg_transaction_size > 100 THEN 'Medium Transactions'
            ELSE 'Small Transactions'
        END as size_segment
    FROM customer_behavior
)
SELECT 
    spending_segment,
    frequency_segment,
    size_segment,
    COUNT(*) as customer_count,
    AVG(total_spending) as avg_total_spending,
    AVG(transaction_frequency) as avg_frequency
FROM customer_segments
GROUP BY spending_segment, frequency_segment, size_segment
ORDER BY customer_count DESC
"""

print("Customer segmentation results:")
spark.sql(segmentation_query).show(20, truncate=False)

print("\n📊 WEEKEND vs WEEKDAY SPENDING:")
# Solution 3: Compare spending patterns
weekend_query = """
SELECT 
    is_weekend,
    weekday_short,
    COUNT(*) as transaction_count,
    SUM(amount) as total_spending,
    AVG(amount) as avg_transaction,
    MIN(amount) as min_transaction,
    MAX(amount) as max_transaction,
    PERCENTILE_APPROX(amount, 0.5) as median_transaction
FROM banking_transactions 
GROUP BY is_weekend, weekday_short
ORDER BY is_weekend, weekday_short
"""

print("Weekend vs Weekday spending patterns:")
spark.sql(weekend_query).show()

# Additional analysis: Hourly patterns
print("\n⏰ HOURLY SPENDING PATTERNS:")
hourly_query = """
SELECT 
    txn_hour,
    COUNT(*) as transactions,
    SUM(amount) as total_amount,
    AVG(amount) as avg_amount,
    CASE 
        WHEN txn_hour BETWEEN 9 AND 17 THEN 'Business Hours'
        WHEN txn_hour BETWEEN 18 AND 22 THEN 'Evening'
        WHEN txn_hour BETWEEN 23 AND 6 THEN 'Night'
        ELSE 'Early Morning'
    END as time_category
FROM banking_transactions 
GROUP BY txn_hour
ORDER BY txn_hour
"""

spark.sql(hourly_query).show(24)

## 5. GCP Databricks Setup ☁️
**Goal:** Deploy our banking analysis to Google Cloud Platform

### 🌟 Why GCP + Databricks?
- **Scalability:** Handle millions of banking transactions
- **Security:** Enterprise-grade data protection
- **Compliance:** Meet banking regulatory requirements
- **Integration:** Connect to various data sources

### 📋 Pre-requisites:
- GCP Account with billing enabled
- Databricks workspace access
- Service account with proper permissions

In [None]:
# 🧑‍🏫 INSTRUCTOR: GCP Databricks configuration (precoded)
import os
import json
from google.cloud import storage

def setup_gcp_connection():
    """
    Configure GCP connection for banking data upload
    
    This demonstrates:
    - Service account authentication
    - Cloud Storage bucket creation
    - Data upload preparation
    """
    
    print("☁️  GCP DATABRICKS SETUP")
    print("=" * 50)
    
    # Configuration for GCP
    gcp_config = {
        "project_id": "your-banking-project-id",  # Change this
        "bucket_name": "banking-data-analytics",   # Change this
        "dataset_location": "europe-west3",       # Frankfurt region
        "service_account_path": "/path/to/service-account.json"
    }
    
    print("📋 GCP Configuration:")
    for key, value in gcp_config.items():
        print(f"  {key}: {value}")
    
    # Sample Databricks cluster configuration
    databricks_config = {
        "cluster_name": "banking-analytics-cluster",
        "spark_version": "11.3.x-scala2.12",
        "node_type_id": "n1-standard-4",
        "num_workers": 2,
        "autotermination_minutes": 60,
        "spark_conf": {
            "spark.sql.adaptive.enabled": "true",
            "spark.sql.adaptive.coalescePartitions.enabled": "true",
            "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
        }
    }
    
    print("\n🚀 Databricks Cluster Config:")
    print(json.dumps(databricks_config, indent=2))
    
    return gcp_config, databricks_config

def prepare_data_for_upload(df, output_path="banking_data.parquet"):
    """
    Prepare banking data for GCP upload
    
    Best practices:
    - Use Parquet format for efficiency
    - Partition by date for query performance
    - Add metadata for governance
    """
    
    print(f"\n📦 PREPARING DATA FOR UPLOAD")
    print("-" * 30)
    
    # Convert to Spark DataFrame if pandas
    if hasattr(df, 'to_pandas'):
        print("✅ Already Spark DataFrame")
        spark_df = df
    else:
        print("🔄 Converting pandas to Spark")
        spark_df = spark.createDataFrame(df)
    
    # Add metadata columns
    spark_df_enhanced = spark_df.withColumn("upload_date", current_date()) \
                               .withColumn("data_source", lit("synthetic_banking")) \
                               .withColumn("data_version", lit("v1.0"))
    
    # Write to local parquet (simulate GCS upload)
    print(f"💾 Writing to {output_path}...")
    spark_df_enhanced.coalesce(1).write.mode("overwrite").parquet(output_path)
    
    print(f"✅ Data prepared: {spark_df_enhanced.count():,} records")
    return spark_df_enhanced

# Run the setup
gcp_config, databricks_config = setup_gcp_connection()
enhanced_data = prepare_data_for_upload(spark_banking_df)

In [None]:
# ✅ COMPLETE SOLUTION: Customize GCP deployment
print("🎯 COMPLETE: Customize the deployment configuration!")
print("=" * 50)

# Solution 1: Update configuration for your environment
print("⚙️ CUSTOMIZE YOUR CONFIGURATION:")
my_gcp_config = {
    "project_id": "banking-analytics-demo-2025",  # Example project ID
    "bucket_name": "banking-data-workshop-eu",   # Example bucket name
    "region": "europe-west3",                    # Frankfurt region for GDPR compliance
    "dataset_location": "EU",                    # European Union for data residency
    "service_account_email": "banking-analytics@banking-analytics-demo-2025.iam.gserviceaccount.com",
    "vpc_network": "banking-vpc",                # Custom VPC for security
    "subnet": "banking-subnet-eu-west3"          # Specific subnet
}

print("📝 Your GCP Config:")
import json
print(json.dumps(my_gcp_config, indent=2))

# Solution 2: Create a deployment checklist
print("\n✅ DEPLOYMENT CHECKLIST:")
deployment_checklist = [
    "GCP project created and billing enabled",
    "Service account created with necessary permissions",
    "Cloud Storage bucket created in EU region",
    "Databricks workspace provisioned",
    "VPC and firewall rules configured",
    "BigQuery dataset created for analytics",
    "Cloud IAM roles assigned properly",
    "Data governance policies defined",
    "Monitoring and alerting set up",
    "Backup and disaster recovery plan ready",
    "Security scan completed",
    "Compliance review (GDPR, PCI DSS) passed"
]

for i, item in enumerate(deployment_checklist, 1):
    print(f"{i:2d}. ☐ {item}")

# Solution 3: Estimate costs for your banking analytics
print("\n💰 COST ESTIMATION:")
# Calculate estimated costs based on realistic banking scenarios

# Data assumptions
data_size_gb = 50  # 50GB of transaction data
queries_per_day = 100  # 100 analytical queries daily
cluster_hours_per_day = 8  # Cluster running 8 hours per day
storage_retention_months = 12  # 1 year data retention

# Cost estimates (EUR, approximate 2025 pricing)
costs = {
    "compute_daily": cluster_hours_per_day * 2.5,  # €2.50/hour for cluster
    "storage_monthly": data_size_gb * 0.02,        # €0.02/GB/month
    "bigquery_monthly": queries_per_day * 30 * 0.005,  # €0.005 per query
    "network_monthly": 10,  # €10/month for network egress
    "monitoring_monthly": 25  # €25/month for monitoring
}

monthly_compute = costs["compute_daily"] * 30
monthly_total = (monthly_compute + costs["storage_monthly"] + 
                costs["bigquery_monthly"] + costs["network_monthly"] + 
                costs["monitoring_monthly"])

print(f"📊 MONTHLY COST BREAKDOWN:")
print(f"  Compute (Databricks): €{monthly_compute:,.2f}")
print(f"  Storage (Cloud Storage): €{costs['storage_monthly']:,.2f}")
print(f"  Analytics (BigQuery): €{costs['bigquery_monthly']:,.2f}")
print(f"  Network: €{costs['network_monthly']:,.2f}")
print(f"  Monitoring: €{costs['monitoring_monthly']:,.2f}")
print(f"  TOTAL MONTHLY: €{monthly_total:,.2f}")
print(f"  TOTAL YEARLY: €{monthly_total * 12:,.2f}")

print("\n💡 COST OPTIMIZATION TIPS:")
optimization_tips = [
    "Use preemptible VMs for non-critical workloads (60-70% savings)",
    "Schedule cluster auto-shutdown during non-business hours",
    "Implement data lifecycle policies (move old data to Coldline storage)",
    "Use BigQuery slots for predictable query costs",
    "Optimize Spark jobs to reduce compute time",
    "Use committed use discounts for sustained workloads"
]

for tip in optimization_tips:
    print(f"  • {tip}")

print("\n🎓 NEXT STEPS:")
next_steps = [
    "Set up your GCP account with billing alerts",
    "Create Databricks workspace with auto-scaling enabled",
    "Configure service account with minimal required permissions",
    "Upload sample data and test data pipeline",
    "Test connection from Databricks to BigQuery",
    "Set up monitoring dashboards and alerts",
    "Create backup and restore procedures",
    "Document the architecture for team knowledge sharing"
]

for i, step in enumerate(next_steps, 1):
    print(f"{i}. {step}")

## 6. Web Scraping for Financial Data 🕷️
**Goal:** Integrate external financial data sources

### 💡 Real-World Banking Use Cases:
- **Exchange Rates:** Currency conversion for international transactions  
- **Stock Prices:** Portfolio valuation and risk assessment
- **Economic Indicators:** Market analysis and forecasting
- **Regulatory Updates:** Compliance monitoring

### ⚖️ Ethical Considerations:
- Always check `robots.txt` and terms of service
- Respect rate limits and server resources
- Use APIs when available instead of scraping
- Consider data privacy and compliance requirements

In [None]:
# 🧑‍🏫 INSTRUCTOR: Web scraping setup (precoded)
import requests
from bs4 import BeautifulSoup
import time
from datetime import datetime, timedelta
import pandas as pd

def scrape_exchange_rates():
    """
    Simulate scraping EUR/USD exchange rates
    
    In production, use:
    - Official APIs (ECB, Federal Reserve, etc.)
    - Financial data providers (Alpha Vantage, Yahoo Finance API)
    - Respect rate limits and terms of service
    """
    
    print("💱 SIMULATING EXCHANGE RATE SCRAPING")
    print("=" * 40)
    
    # Simulate exchange rate data (in production, scrape from real source)
    print("⚠️  NOTE: This is simulated data for demo purposes")
    print("🔗 Real sources: European Central Bank API, Yahoo Finance, etc.")
    
    # Generate realistic EUR/USD rates for past 30 days
    base_rate = 1.10
    dates = pd.date_range(start=datetime.now() - timedelta(days=30), 
                         end=datetime.now(), freq='D')
    
    exchange_rates = []
    for i, date in enumerate(dates):
        # Simulate rate fluctuation
        rate = base_rate + (i * 0.001) + (0.02 * (i % 7 - 3) / 7)
        exchange_rates.append({
            'date': date.strftime('%Y-%m-%d'),
            'currency_pair': 'EUR/USD',
            'rate': round(rate, 4),
            'source': 'simulated_ecb'
        })
    
    rates_df = pd.DataFrame(exchange_rates)
    
    print(f"📊 Retrieved {len(rates_df)} exchange rates")
    print("\n📈 Sample rates:")
    print(rates_df.tail())
    
    return rates_df

def scrape_economic_indicators():
    """
    Simulate scraping economic indicators relevant to banking
    
    Real indicators to track:
    - Interest rates (ECB, Federal Reserve)
    - Inflation rates
    - GDP growth
    - Unemployment rates
    """
    
    print("\n📊 SIMULATING ECONOMIC INDICATORS SCRAPING")
    print("=" * 45)
    
    # Simulated economic data
    indicators = [
        {'indicator': 'ECB_Interest_Rate', 'value': 4.25, 'date': '2024-01-15'},
        {'indicator': 'EUR_Inflation_Rate', 'value': 2.8, 'date': '2024-01-15'},
        {'indicator': 'DE_Unemployment_Rate', 'value': 5.9, 'date': '2024-01-15'},
        {'indicator': 'EUR_GDP_Growth', 'value': 1.2, 'date': '2024-01-15'},
    ]
    
    indicators_df = pd.DataFrame(indicators)
    
    print("🏛️ Key Economic Indicators:")
    print(indicators_df)
    
    return indicators_df

# Execute the scraping functions
exchange_rates_df = scrape_exchange_rates()
economic_indicators_df = scrape_economic_indicators()

print("\n✅ External data sources ready for integration!")

In [None]:
# ✅ COMPLETE SOLUTION: Build custom financial data scrapers
print("🎯 COMPLETE: Create custom financial data scrapers!")
print("=" * 50)

import random
import time
from datetime import datetime, timedelta
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Solution 1: Create a scraper for stock prices (simulated)
def scrape_stock_prices(symbols=['DAX', 'BMW', 'SAP', 'ADIDAS', 'SIEMENS']):
    """
    Simulate scraping German stock prices
    
    Complete implementation that would scrape stock prices
    for major German companies relevant to banking portfolios
    """
    
    print("📈 STOCK PRICE SCRAPER:")
    
    stock_data = []
    
    for symbol in symbols:
        # Simulate realistic stock data
        base_price = {
            'DAX': 15800, 'BMW': 95, 'SAP': 180, 
            'ADIDAS': 220, 'SIEMENS': 155
        }.get(symbol, 100)
        
        # Add some realistic volatility
        price_change = random.uniform(-5, 5)  # +/- 5%
        current_price = base_price * (1 + price_change/100)
        
        volume = random.randint(100000, 2000000)
        
        stock_info = {
            'symbol': symbol,
            'price': round(current_price, 2),
            'change': round(price_change, 2),
            'change_pct': f"{price_change:+.2f}%",
            'volume': volume,
            'timestamp': datetime.now(),
            'market': 'XETRA',  # German stock exchange
            'currency': 'EUR'
        }
        
        stock_data.append(stock_info)
        
        # Simulate API rate limiting
        time.sleep(0.1)
    
    df = pd.DataFrame(stock_data)
    print(f"✅ Scraped {len(df)} stock prices")
    print(df.to_string(index=False))
    
    return df

# Test the function
stock_prices_df = scrape_stock_prices()

# Solution 2: Implement rate limiting and error handling
def safe_scraper(url, delay=1, retries=3, timeout=10):
    """
    Create a robust scraper with proper error handling
    
    Complete implementation with:
    - Rate limiting (time delays)
    - Retry logic for failed requests
    - User-agent rotation
    - Timeout handling
    """
    
    print("🛡️ IMPLEMENTING SAFE SCRAPING:")
    
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    ]
    
    for attempt in range(retries):
        try:
            print(f"Attempt {attempt + 1}/{retries} for {url}")
            
            # Random user agent rotation
            headers = {
                'User-Agent': random.choice(user_agents),
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language': 'en-US,en;q=0.5',
                'Accept-Encoding': 'gzip, deflate',
                'Connection': 'keep-alive',
            }
            
            # Make request with timeout
            response = requests.get(url, headers=headers, timeout=timeout)
            response.raise_for_status()  # Raise exception for bad status codes
            
            print(f"✅ Success: {response.status_code}")
            return response
            
        except requests.exceptions.Timeout:
            print(f"⏰ Timeout on attempt {attempt + 1}")
        except requests.exceptions.ConnectionError:
            print(f"🔌 Connection error on attempt {attempt + 1}")
        except requests.exceptions.HTTPError as e:
            print(f"🚫 HTTP error on attempt {attempt + 1}: {e}")
        except Exception as e:
            print(f"❌ Unexpected error on attempt {attempt + 1}: {e}")
        
        if attempt < retries - 1:
            wait_time = delay * (2 ** attempt)  # Exponential backoff
            print(f"⏳ Waiting {wait_time} seconds before retry...")
            time.sleep(wait_time)
    
    print(f"💥 Failed to scrape {url} after {retries} attempts")
    return None

# Solution 3: Create a data validation function
def validate_financial_data(df, data_type='exchange_rates'):
    """
    Validate scraped financial data
    
    Complete implementation checking for:
    - Missing values
    - Unrealistic values (e.g., negative exchange rates)
    - Date format consistency
    - Duplicate entries
    """
    
    print(f"✅ VALIDATING {data_type.upper()} DATA:")
    
    validation_results = {
        'total_records': len(df),
        'issues_found': [],
        'is_valid': True
    }
    
    if data_type == 'exchange_rates':
        # Check for missing values
        if df['rate'].isnull().any():
            missing_count = df['rate'].isnull().sum()
            validation_results['issues_found'].append(f"Missing rates: {missing_count}")
            validation_results['is_valid'] = False
        
        # Check for unrealistic values
        if (df['rate'] <= 0).any():
            negative_count = (df['rate'] <= 0).sum()
            validation_results['issues_found'].append(f"Non-positive rates: {negative_count}")
            validation_results['is_valid'] = False
        
        # Check for extreme values (rates should be reasonable)
        if (df['rate'] > 10).any() or (df['rate'] < 0.1).any():
            extreme_count = ((df['rate'] > 10) | (df['rate'] < 0.1)).sum()
            validation_results['issues_found'].append(f"Extreme rates: {extreme_count}")
    
    elif data_type == 'stock_prices':
        # Check for missing prices
        if df['price'].isnull().any():
            missing_count = df['price'].isnull().sum()
            validation_results['issues_found'].append(f"Missing prices: {missing_count}")
            validation_results['is_valid'] = False
        
        # Check for negative prices
        if (df['price'] < 0).any():
            negative_count = (df['price'] < 0).sum()
            validation_results['issues_found'].append(f"Negative prices: {negative_count}")
            validation_results['is_valid'] = False
    
    # Check for duplicates
    if df.duplicated().any():
        duplicate_count = df.duplicated().sum()
        validation_results['issues_found'].append(f"Duplicate records: {duplicate_count}")
    
    # Check date consistency (if date column exists)
    if 'date' in df.columns:
        try:
            pd.to_datetime(df['date'])
        except:
            validation_results['issues_found'].append("Invalid date formats found")
            validation_results['is_valid'] = False
    
    # Print results
    print(f"Total records: {validation_results['total_records']}")
    if validation_results['issues_found']:
        print("Issues found:")
        for issue in validation_results['issues_found']:
            print(f"  ⚠️  {issue}")
    else:
        print("✅ No issues found")
    
    print(f"Data validation: {'PASSED' if validation_results['is_valid'] else 'FAILED'}")
    
    return validation_results['is_valid']

# Test validation
validate_financial_data(stock_prices_df, 'stock_prices')
validate_financial_data(exchange_rates_df, 'exchange_rates')

print("\n🎓 BONUS CHALLENGE ANSWERS:")
print("Research and list 3 official financial APIs that would be better than scraping:")
financial_apis = [
    "1. European Central Bank (ECB) API - Official exchange rates and monetary policy data",
    "2. Alpha Vantage API - Real-time and historical stock market data with free tier",
    "3. Yahoo Finance API - Comprehensive financial data including stocks, forex, and commodities"
]

for api in financial_apis:
    print(api)

print("\n🔗 Additional recommended APIs:")
additional_apis = [
    "• Quandl (now part of Nasdaq) - Economic and financial data",
    "• IEX Cloud - US stock market data with generous free tier",
    "• Frankfurter API - European Central Bank exchange rates (free)",
    "• Federal Reserve Economic Data (FRED) - US economic indicators",
    "• Financial Modeling Prep - Financial statements and market data"
]

for api in additional_apis:
    print(api)

## 7. Multi-Source Data Integration 🔗
**Goal:** Combine banking transactions with external financial data

### 🏦 Enterprise Banking Reality:
- **Internal Data:** Transactions, customer profiles, account balances
- **External Data:** Market data, economic indicators, regulatory feeds  
- **Real-time Streams:** Payment networks, fraud detection systems
- **Historical Archives:** Years of transaction history for analysis

### 🎯 Integration Challenges:
- **Schema Variations:** Different data formats and structures
- **Data Quality:** Missing values, duplicates, inconsistencies
- **Time Synchronization:** Aligning data from different time zones
- **Scale:** Processing billions of records efficiently

In [None]:
# 🧑‍🏫 INSTRUCTOR: Multi-source data integration (precoded)
from pyspark.sql.functions import *
from pyspark.sql.types import *

def integrate_financial_data():
    """
    Demonstrate enterprise-level data integration
    
    This function shows:
    - Converting external data to Spark DataFrames
    - Schema alignment and data type conversions  
    - Time-based joins for financial analysis
    - Data quality checks and validation
    """
    
    print("🔗 MULTI-SOURCE DATA INTEGRATION")
    print("=" * 40)
    
    # 1. Convert external data to Spark DataFrames
    print("📊 Converting external data to Spark...")
    
    # Exchange rates to Spark DataFrame
    exchange_rates_spark = spark.createDataFrame(exchange_rates_df) \
        .withColumn("rate_date", to_date(col("date"), "yyyy-MM-dd")) \
        .withColumn("rate", col("rate").cast("double"))
    
    # Economic indicators to Spark DataFrame  
    indicators_spark = spark.createDataFrame(economic_indicators_df) \
        .withColumn("indicator_date", to_date(col("date"), "yyyy-MM-dd")) \
        .withColumn("value", col("value").cast("double"))
    
    # 2. Prepare banking data for joins
    print("🏦 Preparing banking transactions...")
    
    banking_with_date = spark_banking_df \
        .withColumn("transaction_date_only", 
                   to_date(col("transaction_date"))) \
        .withColumn("month_year", 
                   date_format(col("transaction_date"), "yyyy-MM"))
    
    # 3. Join banking data with exchange rates (for international analysis)
    print("💱 Integrating exchange rate data...")
    
    banking_with_rates = banking_with_date.join(
        exchange_rates_spark.select("rate_date", "rate", "currency_pair"),
        banking_with_date.transaction_date_only == exchange_rates_spark.rate_date,
        "left"
    ).withColumn("amount_usd", 
                col("amount") * col("rate")) \
     .drop("rate_date")
    
    # 4. Add economic context
    print("📈 Adding economic indicators...")
    
    # Get monthly economic data (simplified join)
    monthly_indicators = indicators_spark \
        .withColumn("month_year", date_format(col("indicator_date"), "yyyy-MM")) \
        .groupBy("month_year") \
        .agg(
            avg(when(col("indicator") == "ECB_Interest_Rate", col("value"))).alias("interest_rate"),
            avg(when(col("indicator") == "EUR_Inflation_Rate", col("value"))).alias("inflation_rate")
        )
    
    # Final integrated dataset
    integrated_banking_data = banking_with_rates.join(
        monthly_indicators,
        "month_year",
        "left"
    )
    
    # 5. Create summary view
    print("\n📋 INTEGRATED DATA SUMMARY:")
    integrated_banking_data.select(
        "customer_id", "merchant", "amount", "amount_usd", 
        "rate", "interest_rate", "inflation_rate", "transaction_date"
    ).show(5, truncate=False)
    
    print(f"\n✅ Integration complete: {integrated_banking_data.count():,} enriched transactions")
    
    return integrated_banking_data

# Execute integration
integrated_data = integrate_financial_data()

# Create temporary view for final analysis
integrated_data.createOrReplaceTempView("integrated_banking_data")

In [None]:
# ✅ COMPLETE SOLUTION: Comprehensive Banking Analytics Dashboard
print("🎯 COMPLETE SOLUTION: Complete banking analytics dashboard!")
print("=" * 60)

# Solution 1: Economic Impact Analysis
print("📊 ECONOMIC IMPACT ANALYSIS:")
economic_analysis_query = """
SELECT 
    -- Economic period classification
    CASE 
        WHEN interest_rate >= 2.0 THEN 'High Interest Period'
        WHEN interest_rate BETWEEN 0.5 AND 2.0 THEN 'Medium Interest Period' 
        ELSE 'Low Interest Period'
    END as interest_period,
    
    CASE
        WHEN inflation_rate >= 3.0 THEN 'High Inflation Period'
        WHEN inflation_rate BETWEEN 1.0 AND 3.0 THEN 'Normal Inflation Period'
        ELSE 'Low Inflation Period' 
    END as inflation_period,
    
    -- Spending analysis
    COUNT(*) as transaction_count,
    ROUND(AVG(amount), 2) as avg_transaction_eur,
    ROUND(AVG(amount_usd), 2) as avg_transaction_usd,
    ROUND(SUM(amount), 2) as total_spending_eur,
    ROUND(SUM(amount_usd), 2) as total_spending_usd,
    
    -- Currency impact
    COUNT(CASE WHEN rate IS NOT NULL THEN 1 END) as international_transactions,
    ROUND(AVG(rate), 4) as avg_exchange_rate,
    
    -- Most affected merchant categories
    merchant_category,
    COUNT(*) as category_transactions
    
FROM integrated_banking_data
WHERE interest_rate IS NOT NULL AND inflation_rate IS NOT NULL
GROUP BY 
    CASE 
        WHEN interest_rate >= 2.0 THEN 'High Interest Period'
        WHEN interest_rate BETWEEN 0.5 AND 2.0 THEN 'Medium Interest Period' 
        ELSE 'Low Interest Period'
    END,
    CASE
        WHEN inflation_rate >= 3.0 THEN 'High Inflation Period'
        WHEN inflation_rate BETWEEN 1.0 AND 3.0 THEN 'Normal Inflation Period'
        ELSE 'Low Inflation Period' 
    END,
    merchant_category
ORDER BY total_spending_eur DESC
"""

print("✅ ECONOMIC IMPACT ANALYSIS RESULTS:")
economic_results = spark.sql(economic_analysis_query)
economic_results.show(20)

# Solution 2: Advanced Customer Segmentation
print("\n👥 ADVANCED CUSTOMER SEGMENTATION:")
segmentation_query = """
WITH customer_profiles AS (
    SELECT 
        customer_id,
        COUNT(*) as total_transactions,
        ROUND(AVG(amount), 2) as avg_transaction,
        ROUND(STDDEV(amount), 2) as spending_volatility,
        ROUND(SUM(amount), 2) as total_spending,
        
        -- Currency usage patterns
        COUNT(CASE WHEN rate IS NOT NULL THEN 1 END) as international_txns,
        ROUND(COUNT(CASE WHEN rate IS NOT NULL THEN 1 END) * 100.0 / COUNT(*), 1) as intl_percentage,
        
        -- Economic period behavior
        AVG(CASE WHEN interest_rate >= 2.0 THEN amount ELSE 0 END) as high_interest_spending,
        AVG(CASE WHEN inflation_rate >= 3.0 THEN amount ELSE 0 END) as high_inflation_spending,
        
        -- Time patterns
        COUNT(CASE WHEN hour BETWEEN 9 AND 17 THEN 1 END) as business_hours_txns,
        COUNT(CASE WHEN hour NOT BETWEEN 9 AND 17 THEN 1 END) as off_hours_txns,
        
        -- Merchant diversity
        COUNT(DISTINCT merchant_category) as merchant_categories,
        
        -- Risk indicators
        COUNT(CASE WHEN amount > 1000 THEN 1 END) as high_value_txns
        
    FROM integrated_banking_data 
    GROUP BY customer_id
),
customer_segments AS (
    SELECT *,
        CASE 
            WHEN total_spending >= 5000 AND intl_percentage >= 20 THEN 'Premium International'
            WHEN total_spending >= 3000 AND spending_volatility >= 200 THEN 'High-Volume Variable'
            WHEN intl_percentage >= 30 THEN 'International Focused' 
            WHEN off_hours_txns > business_hours_txns THEN 'Off-Hours Active'
            WHEN merchant_categories >= 5 THEN 'Diverse Spender'
            WHEN total_spending <= 1000 THEN 'Conservative Spender'
            ELSE 'Standard Customer'
        END as customer_segment,
        
        CASE
            WHEN high_value_txns >= 3 OR spending_volatility >= 500 THEN 'High Risk'
            WHEN intl_percentage >= 50 OR off_hours_txns >= 10 THEN 'Medium Risk'
            ELSE 'Low Risk'
        END as risk_profile
        
    FROM customer_profiles
)
SELECT 
    customer_segment,
    risk_profile,
    COUNT(*) as customer_count,
    ROUND(AVG(total_spending), 2) as avg_total_spending,
    ROUND(AVG(avg_transaction), 2) as avg_transaction_size,
    ROUND(AVG(intl_percentage), 1) as avg_intl_percentage,
    ROUND(AVG(merchant_categories), 1) as avg_merchant_diversity
FROM customer_segments
GROUP BY customer_segment, risk_profile
ORDER BY customer_count DESC
"""

print("✅ ADVANCED CUSTOMER SEGMENTATION RESULTS:")
segmentation_results = spark.sql(segmentation_query)
segmentation_results.show()

# Solution 3: Predictive Risk Indicators
print("\n🚨 RISK AND FRAUD INDICATORS:")
risk_analysis_query = """
WITH risk_indicators AS (
    SELECT 
        customer_id,
        transaction_id,
        amount,
        amount_usd,
        merchant_category,
        hour,
        interest_rate,
        inflation_rate,
        rate as exchange_rate,
        
        -- Risk flags
        CASE WHEN amount > 2000 THEN 1 ELSE 0 END as high_value_flag,
        CASE WHEN hour BETWEEN 22 AND 6 THEN 1 ELSE 0 END as unusual_time_flag,
        CASE WHEN rate IS NOT NULL AND ABS(amount_usd/amount - rate) > rate * 0.1 THEN 1 ELSE 0 END as currency_anomaly_flag,
        
        -- Economic volatility periods
        CASE WHEN interest_rate > 3.0 OR inflation_rate > 4.0 THEN 1 ELSE 0 END as economic_volatility_flag,
        
        -- Customer historical context (using window functions)
        AVG(amount) OVER (PARTITION BY customer_id) as customer_avg_amount,
        STDDEV(amount) OVER (PARTITION BY customer_id) as customer_stddev_amount
        
    FROM integrated_banking_data
    WHERE amount IS NOT NULL
),
risk_scores AS (
    SELECT *,
        -- Deviation from customer norm
        CASE WHEN ABS(amount - customer_avg_amount) > 2 * customer_stddev_amount THEN 1 ELSE 0 END as amount_deviation_flag,
        
        -- Composite risk score
        (high_value_flag + unusual_time_flag + currency_anomaly_flag + 
         economic_volatility_flag) as composite_risk_score
         
    FROM risk_indicators
)
SELECT 
    -- Risk level distribution
    CASE 
        WHEN composite_risk_score >= 3 THEN 'Critical Risk'
        WHEN composite_risk_score = 2 THEN 'High Risk' 
        WHEN composite_risk_score = 1 THEN 'Medium Risk'
        ELSE 'Low Risk'
    END as risk_level,
    
    COUNT(*) as transaction_count,
    COUNT(DISTINCT customer_id) as affected_customers,
    ROUND(AVG(amount), 2) as avg_amount,
    ROUND(SUM(amount), 2) as total_amount,
    
    -- Risk breakdown
    SUM(high_value_flag) as high_value_transactions,
    SUM(unusual_time_flag) as unusual_time_transactions, 
    SUM(currency_anomaly_flag) as currency_anomaly_transactions,
    SUM(economic_volatility_flag) as economic_volatility_transactions,
    SUM(amount_deviation_flag) as customer_deviation_transactions
    
FROM risk_scores
GROUP BY 
    CASE 
        WHEN composite_risk_score >= 3 THEN 'Critical Risk'
        WHEN composite_risk_score = 2 THEN 'High Risk' 
        WHEN composite_risk_score = 1 THEN 'Medium Risk'
        ELSE 'Low Risk'
    END
ORDER BY 
    CASE 
        WHEN risk_level = 'Critical Risk' THEN 1
        WHEN risk_level = 'High Risk' THEN 2 
        WHEN risk_level = 'Medium Risk' THEN 3
        ELSE 4
    END
"""

print("✅ RISK AND FRAUD ANALYSIS RESULTS:")
risk_results = spark.sql(risk_analysis_query)
risk_results.show()

# Solution 4: Executive Summary Dashboard
print("\n📈 EXECUTIVE SUMMARY DASHBOARD:")
print("Creating comprehensive dashboard for bank executives...")

# Calculate dashboard metrics
dashboard_query = """
SELECT 
    COUNT(*) as total_transactions,
    COUNT(DISTINCT customer_id) as total_customers,
    ROUND(SUM(amount), 2) as total_volume_eur,
    ROUND(SUM(COALESCE(amount_usd, amount)), 2) as total_volume_usd,
    COUNT(CASE WHEN rate IS NOT NULL THEN 1 END) as international_transactions,
    ROUND(AVG(interest_rate), 2) as avg_interest_rate,
    ROUND(AVG(inflation_rate), 2) as avg_inflation_rate
FROM integrated_banking_data
"""

dashboard_base = spark.sql(dashboard_query).collect()[0]

# High-risk customers count
high_risk_query = """
WITH customer_risk AS (
    SELECT 
        customer_id,
        COUNT(*) as txn_count,
        AVG(amount) as avg_amount,
        STDDEV(amount) as stddev_amount,
        COUNT(CASE WHEN hour BETWEEN 22 AND 6 THEN 1 END) as unusual_time_count,
        COUNT(CASE WHEN amount > 2000 THEN 1 END) as high_value_count
    FROM integrated_banking_data
    GROUP BY customer_id
)
SELECT COUNT(*) as high_risk_customers
FROM customer_risk 
WHERE unusual_time_count >= 2 OR high_value_count >= 2 OR stddev_amount > 500
"""

high_risk_count = spark.sql(high_risk_query).collect()[0]['high_risk_customers']

# Economic exposure calculation
economic_exposure_query = """
SELECT 
    ROUND(SUM(CASE WHEN interest_rate > 2.0 OR inflation_rate > 3.0 THEN amount ELSE 0 END), 2) as economic_exposure_eur
FROM integrated_banking_data
WHERE interest_rate IS NOT NULL AND inflation_rate IS NOT NULL
"""

economic_exposure = spark.sql(economic_exposure_query).collect()[0]['economic_exposure_eur'] or 0

# Compile dashboard metrics
dashboard_metrics = {
    "total_transactions": dashboard_base['total_transactions'],
    "total_customers": dashboard_base['total_customers'],
    "total_volume_eur": dashboard_base['total_volume_eur'],
    "total_volume_usd": dashboard_base['total_volume_usd'], 
    "international_transactions": dashboard_base['international_transactions'],
    "high_risk_customers": high_risk_count,
    "economic_exposure_eur": economic_exposure,
    "avg_interest_rate": dashboard_base['avg_interest_rate'] or 0,
    "avg_inflation_rate": dashboard_base['avg_inflation_rate'] or 0,
}

print("🏦 EXECUTIVE DASHBOARD SUMMARY:")
print("=" * 50)
print(f"📊 Portfolio Overview:")
print(f"  • Total Transactions: {dashboard_metrics['total_transactions']:,}")
print(f"  • Total Customers: {dashboard_metrics['total_customers']:,}")
print(f"  • Total Volume (EUR): €{dashboard_metrics['total_volume_eur']:,.2f}")
print(f"  • Total Volume (USD): ${dashboard_metrics['total_volume_usd']:,.2f}")

print(f"\n🌍 International Exposure:")
print(f"  • International Transactions: {dashboard_metrics['international_transactions']:,}")
intl_percentage = (dashboard_metrics['international_transactions'] / dashboard_metrics['total_transactions']) * 100 if dashboard_metrics['total_transactions'] > 0 else 0
print(f"  • International Percentage: {intl_percentage:.1f}%")

print(f"\n📈 Economic Context:")
print(f"  • Average Interest Rate: {dashboard_metrics['avg_interest_rate']:.2f}%")
print(f"  • Average Inflation Rate: {dashboard_metrics['avg_inflation_rate']:.2f}%")
print(f"  • Economic Risk Exposure: €{dashboard_metrics['economic_exposure_eur']:,.2f}")

print(f"\n🚨 Risk Management:")
print(f"  • High-Risk Customers: {dashboard_metrics['high_risk_customers']:,}")
risk_percentage = (dashboard_metrics['high_risk_customers'] / dashboard_metrics['total_customers']) * 100 if dashboard_metrics['total_customers'] > 0 else 0
print(f"  • Risk Customer Percentage: {risk_percentage:.1f}%")

print(f"\n✅ Regulatory Compliance:")
print(f"  • AML Alerts Generated: {dashboard_metrics['high_risk_customers']:,}")
print(f"  • Data Quality Score: 98.5%")  # Simulated
print(f"  • Reporting Completeness: 100%")  # Simulated

print("\n🎉 CONGRATULATIONS!")
print("You've completed a full big data banking analytics pipeline:")
print("✅ Data Generation & Quality Assessment")  
print("✅ Scalable Processing with Spark")
print("✅ Advanced SQL Analytics")
print("✅ Cloud Deployment Preparation")
print("✅ External Data Integration")
print("✅ Multi-source Analytics")
print("✅ Executive Dashboard Creation")
print("✅ Risk Management & Compliance")

print("\n🚀 NEXT STEPS FOR PRODUCTION:")
print("1. Implement real-time streaming with Kafka")
print("2. Add machine learning for fraud detection")  
print("3. Create automated reporting pipelines")
print("4. Implement data governance and lineage")
print("5. Add regulatory compliance monitoring")
print("6. Deploy to cloud infrastructure (GCP/AWS/Azure)")
print("7. Add API endpoints for business applications")
print("8. Implement data lakehouse architecture")

print("\n📚 WORKSHOP SUMMARY:")
print("You have successfully:")
print("• Built a scalable big data pipeline using PySpark")
print("• Performed advanced analytics on >1GB banking dataset")
print("• Integrated multiple external data sources")
print("• Created enterprise-level risk management dashboards")
print("• Prepared for cloud deployment and production scaling")
print("• Demonstrated production-ready data engineering skills")

print("\n🎓 You're now ready to work with enterprise big data systems!")
print("Thank you for participating in this intensive banking analytics workshop!")