# Banking Data Analysis - Live Coding Workshop
## Big Data Analytics im Banking | 13:00-15:40

### 🎯 **Workshop Agenda**
- **13:00-13:45:** Einführung in Datenanalyse + Banking Transaction Analysis
- **13:55-14:40:** Spark Deep-Dive & GCP Setup
- **14:50-15:40:** Datenbeschaffung und -integration

### 🛠 **Was wir heute lernen:**
1. **Datenanalyseprozess** in der Praxis
2. **Data Mining** für Banking-Patterns
3. **Spark Setup** und SQL-Queries
4. **GCP/Databricks** Configuration
5. **Web Scraping** für Financial Data
6. **Multi-Source Integration**

### 📋 **Live Coding Approach**
- **Instructor demonstrates** → **Students modify/extend**
- **Short code blocks** with thorough comments
- **Interactive exercises** at each step

## 1. Load Large Banking Transactions (PySpark) 🏦
**Goal:** Load a >1GB CSV efficiently using PySpark and prepare it for analysis

Dataset: `transactions_data.csv` (set the path below)

### 🎓 Live Coding Exercise:
- **Instructor:** Sets up Spark and loads the dataset with an explicit schema or fast inference
- **Students:** Add derived columns and validate data quality

In [1]:
# Simple PySpark setup for banking data
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Dataset path
dataset_path = "../data/transactions_data.csv"

# Create basic Spark session
spark = SparkSession.builder.appName("Banking Analysis").getOrCreate()
spark.sparkContext.setLogLevel("WARN")

print("✅ Spark ready!")
print(f"Version: {spark.version}")

25/08/10 21:34:12 WARN Utils: Your hostname, Maclook-Bro.local resolves to a loopback address: 127.0.0.1; using 172.18.160.138 instead (on interface en0)
25/08/10 21:34:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/10 21:34:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/10 21:34:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


✅ Spark ready!
Version: 3.5.3


In [2]:
# Load and prepare banking data
print("📦 Loading banking dataset...")

# Simple data loading
df = spark.read.option("header", True).option("inferSchema", True).csv(dataset_path)

print("📋 Schema:")
df.printSchema()

# Basic column mapping
df = df.withColumnRenamed("client_id", "customer_id") \
       .withColumnRenamed("id", "transaction_id") \
       .withColumn("transaction_date", to_timestamp(col("date")))

print(f"📊 Loaded {df.count():,} transactions")
df.show(5)

📦 Loading banking dataset...


                                                                                

📋 Schema:
root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_id: integer (nullable = true)
 |-- amount: string (nullable = true)
 |-- use_chip: string (nullable = true)
 |-- merchant_id: integer (nullable = true)
 |-- merchant_city: string (nullable = true)
 |-- merchant_state: string (nullable = true)
 |-- zip: double (nullable = true)
 |-- mcc: integer (nullable = true)
 |-- errors: string (nullable = true)





📊 Loaded 13,305,915 transactions
+--------------+-------------------+-----------+-------+-------+-----------------+-----------+-------------+--------------+-------+----+------+-------------------+
|transaction_id|               date|customer_id|card_id| amount|         use_chip|merchant_id|merchant_city|merchant_state|    zip| mcc|errors|   transaction_date|
+--------------+-------------------+-----------+-------+-------+-----------------+-----------+-------------+--------------+-------+----+------+-------------------+
|       7475327|2010-01-01 00:01:00|       1556|   2972|$-77.00|Swipe Transaction|      59935|       Beulah|            ND|58523.0|5499|  NULL|2010-01-01 00:01:00|
|       7475328|2010-01-01 00:02:00|        561|   4575| $14.57|Swipe Transaction|      67570|   Bettendorf|            IA|52722.0|5311|  NULL|2010-01-01 00:02:00|
|       7475329|2010-01-01 00:02:00|       1129|    102| $80.00|Swipe Transaction|      27092|        Vista|            CA|92084.0|4829|  NULL|2010

                                                                                

In [112]:
# Add basic features
print("🔧 Adding basic features...")

# First, ensure amount is properly numeric - convert to amount_usd as double
df = df.withColumn("amount_usd", regexp_replace(col("amount"), "[$]", "").cast("double"))

print(f"✅ Amount data types - USD: {dict(df.dtypes)['amount_usd']}")

# Add simple time features
df = df.withColumn("hour", hour(col("transaction_date"))) \
       .withColumn("is_weekend", dayofweek(col("transaction_date")).isin([1, 7]))

# Add merchant category (simplified)
df = df.withColumn("merchant_category",
                   when(col("mcc").isin(5411, 5441), "Grocery")
                   .when(col("mcc").isin(5812, 5813), "Restaurant") 
                   .when(col("mcc").isin(5541, 5542), "Gas Station")
                   .otherwise("Other"))

# Add zip_region column
df = df.withColumn("zip_region",
                   when(col("zip").substr(1,1).isin(["0", "1", "2"]), "Northeast")
                   .when(col("zip").substr(1,1).isin(["3", "4", "5"]), "Southeast") 
                   .when(col("zip").substr(1,1).isin(["6", "7"]), "Central")
                   .when(col("zip").substr(1,1).isin(["8", "9"]), "West")
                   .otherwise("Unknown"))

# Create temp view for SQL
df.createOrReplaceTempView("transactions")
print("✅ Features added and temp view created!")

# Show sample with numeric amounts
df.select("customer_id", "amount_usd", "merchant_category", "hour", "is_weekend", "zip_region").show()

🔧 Adding basic features...
✅ Amount data types - USD: double
✅ Features added and temp view created!
+-----------+----------+-----------------+----+----------+----------+
|customer_id|amount_usd|merchant_category|hour|is_weekend|zip_region|
+-----------+----------+-----------------+----+----------+----------+
|       1556|     -77.0|            Other|   0|     false| Southeast|
|        561|     14.57|            Other|   0|     false| Southeast|
|       1129|      80.0|            Other|   0|     false|      West|
|        430|     200.0|            Other|   0|     false| Southeast|
|        848|     46.41|       Restaurant|   0|     false| Northeast|
|       1807|      4.81|            Other|   0|     false| Northeast|
|       1556|      77.0|            Other|   0|     false| Southeast|
|       1684|     26.46|            Other|   0|     false|   Unknown|
|        335|    261.58|            Other|   0|     false|   Unknown|
|        351|     10.74|       Restaurant|   0|     false| 

## 2. Basic Data Exploration with Spark 🐼➡️🔥
**Goal:** Explore the 1GB+ dataset with Spark (no pandas copies)

### 🎓 Live Coding Exercise:
- **Instructor:** Demonstrates Spark actions and SQL
- **Students:** Build aggregations and quality checks at scale

In [113]:
# 🧑‍🏫 INSTRUCTOR: Basic Spark exploration (precoded)
def explore_banking_data_spark(df):
    """
    Scalable data exploration using Spark
    - Schema, counts, ranges, basic distributions
    - No driver-side collect() on large datasets
    """
    from pyspark.sql import functions as F
    
    print("📊 BANKING DATA OVERVIEW (Spark)")
    print("=" * 50)
    
    print(f"Total rows: {df.count():,}")
    df.printSchema()
    
    # Columns we expect (best effort)
    available_cols = set([c.lower() for c in df.columns])
    
    if "transaction_date" in available_cols:
        print("\n📅 Date range:")
        df.select(F.min("transaction_date").alias("min_date"), F.max("transaction_date").alias("max_date")).show()
        
        print("\n📆 Transactions by weekday:")
        df.withColumn("weekday", F.date_format(F.col("transaction_date"), "E")).groupBy("weekday").count().orderBy("weekday").show()
    
    if "customer_id" in available_cols:
        print("\n👥 Unique customers:")
        df.select(F.countDistinct("customer_id").alias("unique_customers")).show()
    
    if "amount" in available_cols:
        print("\n💰 Amount stats:")
        df.select(
            F.count("amount").alias("n"),
            F.mean("amount").alias("avg"),
            F.expr("percentile_approx(amount, array(0.25,0.5,0.75), 10000)").alias("quantiles"),
            F.min("amount").alias("min"),
            F.max("amount").alias("max")
        ).show(truncate=False)
    
    if "merchant_id" in available_cols:
        print("\n🏪 Top merchant IDs:")
        df.groupBy("merchant_id").count().orderBy(F.desc("count")).show(10, truncate=False)

# Run the exploration
explore_banking_data_spark(df)

📊 BANKING DATA OVERVIEW (Spark)


                                                                                

Total rows: 13,305,915
root
 |-- transaction_id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- card_id: integer (nullable = true)
 |-- amount: string (nullable = true)
 |-- use_chip: string (nullable = true)
 |-- merchant_id: integer (nullable = true)
 |-- merchant_city: string (nullable = true)
 |-- merchant_state: string (nullable = true)
 |-- zip: double (nullable = true)
 |-- mcc: integer (nullable = true)
 |-- errors: string (nullable = true)
 |-- transaction_date: timestamp (nullable = true)
 |-- amount_usd: double (nullable = true)
 |-- hour: integer (nullable = true)
 |-- is_weekend: boolean (nullable = true)
 |-- merchant_category: string (nullable = false)
 |-- zip_region: string (nullable = false)


📅 Date range:


                                                                                

+-------------------+-------------------+
|           min_date|           max_date|
+-------------------+-------------------+
|2010-01-01 00:01:00|2019-10-31 23:59:00|
+-------------------+-------------------+


📆 Transactions by weekday:


                                                                                

+-------+-------+
|weekday|  count|
+-------+-------+
|    Fri|1895372|
|    Mon|1896914|
|    Sat|1902370|
|    Sun|1899044|
|    Thu|1918666|
|    Tue|1897678|
|    Wed|1895871|
+-------+-------+


👥 Unique customers:


                                                                                

+----------------+
|unique_customers|
+----------------+
|            1219|
+----------------+


💰 Amount stats:


                                                                                

+--------+----+---------+------+-------+
|n       |avg |quantiles|min   |max    |
+--------+----+---------+------+-------+
|13305915|NULL|NULL     |$-0.00|$999.97|
+--------+----+---------+------+-------+


🏪 Top merchant IDs:




+-----------+------+
|merchant_id|count |
+-----------+------+
|59935      |610053|
|27092      |589140|
|61195      |562410|
|39021      |440281|
|43293      |362842|
|22204      |347511|
|14528      |333505|
|60569      |301657|
|50783      |298231|
|75781      |273351|
+-----------+------+
only showing top 10 rows



                                                                                

In [114]:
# Basic data exploration
print("📊 Basic dataset overview:")
print(f"Total transactions: {df.count():,}")
print(f"Unique customers: {df.select('customer_id').distinct().count():,}")

# Simple aggregations using numeric amount columns
print("\n💰 Spending by merchant category:")
df.groupBy("merchant_category") \
  .agg(sum("amount_usd").alias("total_spending_usd"),
       count("*").alias("transaction_count")) \
  .orderBy(desc("total_spending_usd")) \
  .show()

print("✅ Basic exploration complete!")

📊 Basic dataset overview:


                                                                                

Total transactions: 13,305,915


                                                                                

Unique customers: 1,219

💰 Spending by merchant category:




+-----------------+--------------------+-----------------+
|merchant_category|  total_spending_usd|transaction_count|
+-----------------+--------------------+-----------------+
|            Other| 4.686805615000004E8|          9040312|
|          Grocery| 4.097075415000012E7|          1592584|
|       Restaurant| 3.261377996999998E7|          1248308|
|      Gas Station|2.9570426660000026E7|          1424711|
+-----------------+--------------------+-----------------+

✅ Basic exploration complete!


                                                                                

## 3. Spark Session Recap 🚀
Spark is already initialized. We’ll keep this short and move to SQL analytics.

- Session tuned for local development and large CSVs
- Temp view `banking_transactions` is ready
- Proceed to analytics at scale

In [115]:
# (Optional) Spark utilities
from pyspark.sql import functions as F
from pyspark.sql.types import *

print("ℹ️ Spark utilities available. Session already created above.")

ℹ️ Spark utilities available. Session already created above.


In [116]:
# Simple Spark DataFrame operations
print("🔧 Basic Spark operations:")

# Show dataset info for numeric columns
df.select("amount_usd").describe().show()

# Weekend vs weekday analysis using SQL with numeric amounts
print("📅 Weekend vs Weekday spending:")
spark.sql("""
SELECT 
    is_weekend,
    COUNT(*) as transaction_count,
    ROUND(SUM(amount_usd), 2) as total_amount_usd
FROM transactions 
GROUP BY is_weekend
""").show()

print("✅ Basic operations complete!")

🔧 Basic Spark operations:


                                                                                

+-------+-----------------+
|summary|       amount_usd|
+-------+-----------------+
|  count|         13305915|
|   mean|42.97603902324682|
| stddev|81.65574765375871|
|    min|           -500.0|
|    max|           6820.2|
+-------+-----------------+

📅 Weekend vs Weekday spending:




+----------+-----------------+----------------+
|is_weekend|transaction_count|total_amount_usd|
+----------+-----------------+----------------+
|      true|          3801414|  1.6384547355E8|
|     false|          9504501|  4.0799004873E8|
+----------+-----------------+----------------+

✅ Basic operations complete!


                                                                                

## 4. Advanced Spark SQL Analytics 🔍
**Goal:** Complex banking analytics using SQL on big data

### 🏦 Real Banking Use Cases:
- **Fraud Detection:** Unusual spending patterns
- **Customer Segmentation:** Spending behavior analysis
- **Risk Assessment:** Transaction pattern analysis

In [117]:
# 🧑‍🏫 INSTRUCTOR: Simple banking analytics demo
print("🔍 SIMPLE BANKING ANALYTICS")

# Top customers using numeric amounts
print("👑 TOP 3 CUSTOMERS BY SPENDING:")
spark.sql("""
SELECT customer_id, 
       ROUND(SUM(amount_usd), 2) as total_spent_usd
FROM transactions 
GROUP BY customer_id 
ORDER BY total_spent_usd DESC 
LIMIT 3
""").show()

# Simple merchant analysis
print("🏪 TOP 3 MERCHANTS BY TRANSACTIONS:")
spark.sql("""
SELECT merchant_id, 
       COUNT(*) as transaction_count,
       ROUND(SUM(amount_usd), 2) as total_revenue_usd
FROM transactions 
GROUP BY merchant_id 
ORDER BY transaction_count DESC 
LIMIT 3
""").show()

print("✅ Simple analytics complete!")

🔍 SIMPLE BANKING ANALYTICS
👑 TOP 3 CUSTOMERS BY SPENDING:


                                                                                

+-----------+---------------+
|customer_id|total_spent_usd|
+-----------+---------------+
|         96|     2445773.25|
|       1686|      2167880.9|
|       1340|     2039921.23|
+-----------+---------------+

🏪 TOP 3 MERCHANTS BY TRANSACTIONS:




+-----------+-----------------+-----------------+
|merchant_id|transaction_count|total_revenue_usd|
+-----------+-----------------+-----------------+
|      59935|           610053|       8937586.07|
|      27092|           589140|    5.315851564E7|
|      61195|           562410|    1.201308365E7|
+-----------+-----------------+-----------------+

✅ Simple analytics complete!


                                                                                

In [None]:
# Simple fraud detection demo
print("🚨 BASIC FRAUD DETECTION")

# High amount transactions (potential fraud) using numeric amounts
print("💰 Transactions above $200:")
spark.sql("""
SELECT customer_id, 
       ROUND(amount_usd, 2) as amount_usd,
       merchant_id
FROM transactions 
WHERE amount_usd > 200
ORDER BY amount_usd DESC
LIMIT 5
""").show()

print("✅ Basic fraud detection complete!")
print("🚨 POTENTIAL FRAUD DETECTION:")
print("Find customers with transactions > 3 standard deviations from their average")

fraud_query = """
WITH customer_stats AS (
    SELECT 
        customer_id,
        transaction_date,
        amount,
        AVG(amount) OVER (PARTITION BY customer_id) as avg_amount,
        STDDEV_POP(amount) OVER (PARTITION BY customer_id) as stddev_amount
    FROM banking_transactions
),
potential_fraud AS (
    SELECT 
        customer_id,
        transaction_date,
        amount,
        avg_amount,
        stddev_amount,
        ABS(amount - avg_amount) as deviation,
        CASE 
            WHEN stddev_amount > 0 AND ABS(amount - avg_amount) > 3 * stddev_amount 
            THEN 'HIGH_RISK'
            WHEN stddev_amount > 0 AND ABS(amount - avg_amount) > 2 * stddev_amount 
            THEN 'MEDIUM_RISK'
            ELSE 'NORMAL'
        END as risk_level
    FROM customer_stats
)
SELECT 
    risk_level, 
    COUNT(*) as transaction_count,
    SUM(amount_usd) as total_amount,
    AVG(amount_usd) as avg_amount
FROM potential_fraud
GROUP BY risk_level
ORDER BY risk_level
"""
spark.sql(fraud_query).show()
print("✅ Basic fraud detection complete!")


🚨 BASIC FRAUD DETECTION
💰 Transactions above $200:


                                                                                

+-----------+----------+-----------+
|customer_id|amount_usd|merchant_id|
+-----------+----------+-----------+
|        708|    6820.2|      34524|
|       1081|   6613.44|       9026|
|       1259|   5913.37|      85983|
|       1487|   5813.78|       9026|
|        278|   5696.78|       7202|
+-----------+----------+-----------+

✅ Basic fraud detection complete!
🚨 POTENTIAL FRAUD DETECTION:
Find customers with transactions > 3 standard deviations from their average


[Stage 798:>                                                        (0 + 8) / 8]

+----------+-----------------+------------+----------+
|risk_level|transaction_count|total_amount|avg_amount|
+----------+-----------------+------------+----------+
|    NORMAL|         13305915|        NULL|      NULL|
+----------+-----------------+------------+----------+

✅ Basic fraud detection complete!


                                                                                

## 5. Simple Cloud Deployment 🌥️
**Goal:** Basic overview of deploying to cloud

### 🌟 Why Cloud?
- **Scale:** Handle big datasets
- **Storage:** Secure data storage  
- **Compute:** More processing power

In [119]:
# 🧑‍🏫 INSTRUCTOR: Simple cloud overview (demo only)
print("☁️ CLOUD DEPLOYMENT BASICS")
print("Key concepts:")
print("1. Upload data to cloud storage")
print("2. Create compute cluster") 
print("3. Run Spark jobs")
print("4. Store results")

print("\n✅ Cloud overview complete!")

☁️ CLOUD DEPLOYMENT BASICS
Key concepts:
1. Upload data to cloud storage
2. Create compute cluster
3. Run Spark jobs
4. Store results

✅ Cloud overview complete!


In [120]:
# ✅ COMPLETE SOLUTION: Customize GCP deployment
print("🎯 COMPLETE: Customize the deployment configuration!")
print("=" * 50)

# Solution 1: Update configuration for your environment
print("⚙️ CUSTOMIZE YOUR CONFIGURATION:")
my_gcp_config = {
    "project_id": "banking-analytics-demo-2025",  # Example project ID
    "bucket_name": "banking-data-workshop-eu",   # Example bucket name
    "region": "europe-west3",                    # Frankfurt region for GDPR compliance
    "dataset_location": "EU",                    # European Union for data residency
    "service_account_email": "banking-analytics@banking-analytics-demo-2025.iam.gserviceaccount.com",
    "vpc_network": "banking-vpc",                # Custom VPC for security
    "subnet": "banking-subnet-eu-west3"          # Specific subnet
}

print("📝 Your GCP Config:")
import json
print(json.dumps(my_gcp_config, indent=2))

# Solution 2: Create a deployment checklist
print("\n✅ DEPLOYMENT CHECKLIST:")
deployment_checklist = [
    "GCP project created and billing enabled",
    "Service account created with necessary permissions",
    "Cloud Storage bucket created in EU region",
    "Databricks workspace provisioned"
]
print("\n✅ Cloud overview complete!")

🎯 COMPLETE: Customize the deployment configuration!
⚙️ CUSTOMIZE YOUR CONFIGURATION:
📝 Your GCP Config:
{
  "project_id": "banking-analytics-demo-2025",
  "bucket_name": "banking-data-workshop-eu",
  "region": "europe-west3",
  "dataset_location": "EU",
  "service_account_email": "banking-analytics@banking-analytics-demo-2025.iam.gserviceaccount.com",
  "vpc_network": "banking-vpc",
  "subnet": "banking-subnet-eu-west3"
}

✅ DEPLOYMENT CHECKLIST:

✅ Cloud overview complete!


## 6. Simple Data Integration 🔗
**Goal:** Combine banking data with external sources

### 💡 Real Banking Examples:
- **Exchange Rates:** Convert international transactions
- **Interest Rates:** Economic impact on spending
- **Merchant Data:** Enhanced merchant information

In [121]:
# 🧑‍🏫 INSTRUCTOR: Simple API integration for live exchange rates
print("🔗 SIMPLE API INTEGRATION")

# Get live exchange rates
print("📊 Fetching live exchange rates...")
import requests

# Simple API call
api_url = "https://api.exchangeratesapi.io/v1/latest?access_key=24da234d4ded987472b5ece3b4981c9b&format=1"

try:
    response = requests.get(api_url)
    data = response.json()

    if data.get('success'):
        print("✅ API call successful!")
        print(f"📅 Date: {data['date']}")
        print(f"💱 Base: {data['base']}")
        
        # Get key rates we need
        usd_rate = data['rates']['USD']
        gbp_rate = data['rates']['GBP']
        
        print(f"\n💰 Today's rates:")
        print(f"EUR → USD: {usd_rate}")
        print(f"EUR → GBP: {gbp_rate}")
        
        # Create EUR column from USD using live rate
        print("🔄 Creating EUR amounts using live rates...")
        df = df.withColumn("amount_eur", col("amount_usd") / usd_rate)
        
        # Recreate temp view with updated data
        df.createOrReplaceTempView("transactions")
        print("✅ Temp view updated with live exchange rates!")
        
    else:
        print("🔄 Using fallback rates...")
        usd_rate = 1.16
        gbp_rate = 0.87
        df = df.withColumn("amount_eur", col("amount_usd") / usd_rate)
        df.createOrReplaceTempView("transactions")
        
except Exception as e:
    print(f"🚨 Error: {e}")
    print("🔄 Using fallback rates...")
    usd_rate = 1.16
    gbp_rate = 0.87
    df = df.withColumn("amount_eur", col("amount_usd") / usd_rate)
    df.createOrReplaceTempView("transactions")

# Simple live currency conversion example
print("\n💡 Sample transactions with live exchange rates:")
conversion_query = """
SELECT 
    customer_id,
    ROUND(amount_usd, 2) as amount_usd,
    ROUND(amount_eur, 2) as amount_eur,
    merchant_category
FROM transactions 
LIMIT 5
"""

spark.sql(conversion_query).show()

print("✅ Simple live integration complete!")

🔗 SIMPLE API INTEGRATION
📊 Fetching live exchange rates...
✅ API call successful!
📅 Date: 2025-08-10
💱 Base: EUR

💰 Today's rates:
EUR → USD: 1.164821
EUR → GBP: 0.867295
🔄 Creating EUR amounts using live rates...
✅ Temp view updated with live exchange rates!

💡 Sample transactions with live exchange rates:
+-----------+----------+----------+-----------------+
|customer_id|amount_usd|amount_eur|merchant_category|
+-----------+----------+----------+-----------------+
|       1556|     -77.0|     -66.1|            Other|
|        561|     14.57|     12.51|            Other|
|       1129|      80.0|     68.68|            Other|
|        430|     200.0|     171.7|            Other|
|        848|     46.41|     39.84|       Restaurant|
+-----------+----------+----------+-----------------+

✅ Simple live integration complete!
✅ API call successful!
📅 Date: 2025-08-10
💱 Base: EUR

💰 Today's rates:
EUR → USD: 1.164821
EUR → GBP: 0.867295
🔄 Creating EUR amounts using live rates...
✅ Temp view u

In [122]:
# Integration analysis using live exchange rates
print("📊 LIVE EXCHANGE RATE ANALYSIS")

print("💡 Let's analyze our banking data using LIVE exchange rates from the API!")
print("Business question: How do live currency fluctuations affect spending patterns?")

# Analysis using live EUR/USD conversion
print(f"\n🔍 Analysis using LIVE rate: 1 EUR = {usd_rate} USD")

live_rate_analysis = """
SELECT 
    merchant_category,
    COUNT(*) as transactions,
    ROUND(SUM(amount_usd), 2) as total_usd,
    ROUND(SUM(amount_eur), 2) as total_eur,
    ROUND(SUM(amount_usd) - SUM(amount_eur), 2) as usd_eur_difference,
    ROUND(AVG(amount_usd), 2) as avg_usd,
    ROUND(AVG(amount_eur), 2) as avg_eur
FROM transactions 
WHERE amount_usd > 50
GROUP BY merchant_category
ORDER BY total_usd DESC
"""

result = spark.sql(live_rate_analysis)
result.show()

# Advanced analysis: Currency impact by spending category
print(f"\n📈 CURRENCY IMPACT ANALYSIS (Live Rate: {usd_rate}):")

currency_impact_query = """
SELECT 
    merchant_category,
    ROUND(AVG(amount_usd), 2) as avg_usd_per_transaction,
    ROUND(AVG(amount_eur), 2) as avg_eur_per_transaction,
    ROUND((AVG(amount_usd) - AVG(amount_eur)) / AVG(amount_eur) * 100, 2) as currency_impact_percent
FROM transactions 
WHERE amount_usd > 20
GROUP BY merchant_category
ORDER BY currency_impact_percent DESC
"""

spark.sql(currency_impact_query).show()

# Business insights with live data
print("\n💡 LIVE EXCHANGE RATE INSIGHTS:")
print(f"• Today's EUR→USD rate: {usd_rate}")
print("• 'USD-EUR Difference' shows currency conversion impact")
print("• 'Currency Impact %' shows relative cost difference for EUR vs USD customers") 
print("• Higher impact % = more expensive for EUR-based customers")
print("• This analysis updates with LIVE market rates!")

print("✅ Live exchange rate analysis complete!")

📊 LIVE EXCHANGE RATE ANALYSIS
💡 Let's analyze our banking data using LIVE exchange rates from the API!
Business question: How do live currency fluctuations affect spending patterns?

🔍 Analysis using LIVE rate: 1 EUR = 1.164821 USD


                                                                                

+-----------------+------------+--------------+--------------+------------------+-------+-------+
|merchant_category|transactions|     total_usd|     total_eur|usd_eur_difference|avg_usd|avg_eur|
+-----------------+------------+--------------+--------------+------------------+-------+-------+
|            Other|     3469683|4.0457538372E8|3.4732837382E8|      5.72470099E7|  116.6|  100.1|
|      Gas Station|      509646| 4.101957026E7| 3.521534232E7|        5804227.94|  80.49|   69.1|
|          Grocery|      259596| 2.451713686E7| 2.104798665E7|        3469150.21|  94.44|  81.08|
|       Restaurant|      208070| 1.465861958E7| 1.258443965E7|        2074179.93|  70.45|  60.48|
+-----------------+------------+--------------+--------------+------------------+-------+-------+


📈 CURRENCY IMPACT ANALYSIS (Live Rate: 1.164821):




+-----------------+-----------------------+-----------------------+-----------------------+
|merchant_category|avg_usd_per_transaction|avg_eur_per_transaction|currency_impact_percent|
+-----------------+-----------------------+-----------------------+-----------------------+
|          Grocery|                  63.69|                  54.68|                  16.48|
|            Other|                  81.58|                  70.04|                  16.48|
|      Gas Station|                  68.39|                  58.71|                  16.48|
|       Restaurant|                  47.26|                  40.58|                  16.48|
+-----------------+-----------------------+-----------------------+-----------------------+


💡 LIVE EXCHANGE RATE INSIGHTS:
• Today's EUR→USD rate: 1.164821
• 'USD-EUR Difference' shows currency conversion impact
• 'Currency Impact %' shows relative cost difference for EUR vs USD customers
• Higher impact % = more expensive for EUR-based customers
• Thi

                                                                                

## 🎯 Workshop Summary

You've learned the essentials of big data analytics:

1. **PySpark Basics:** Loading and processing data
2. **SQL Analysis:** Simple aggregations and insights
3. **Banking Analytics:** Basic fraud detection
4. **Cloud Concepts:** Deployment overview