# Banking Data Analysis - Live Coding Workshop
## Big Data Analytics im Banking | 13:00-15:40

### 🎯 **Workshop Agenda**
- **13:00-13:45:** Einführung in Datenanalyse + Banking Transaction Analysis
- **13:55-14:40:** Spark Deep-Dive & GCP Setup
- **14:50-15:40:** Datenbeschaffung und -integration

### 🛠 **Was wir heute lernen:**
1. **Datenanalyseprozess** in der Praxis
2. **Data Mining** für Banking-Patterns
3. **Spark Setup** und SQL-Queries
4. **GCP/Databricks** Configuration
5. **Web Scraping** für Financial Data
6. **Multi-Source Integration**

### 📋 **Live Coding Approach**
- **Instructor demonstrates** → **Students modify/extend**
- **Short code blocks** with thorough comments
- **Interactive exercises** at each step

## 1. Load Large Banking Transactions (PySpark) 🏦
**Goal:** Load a >1GB CSV efficiently using PySpark and prepare it for analysis

Dataset: `transactions_data.csv` (set the path below)

### 🎓 Live Coding Exercise:
- **Instructor:** Sets up Spark and loads the dataset with an explicit schema or fast inference
- **Students:** Add derived columns and validate data quality

In [2]:
# Simple PySpark setup for banking data
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Dataset path
dataset_path = "../data/transactions_data.csv"

if 'spark' in locals():
    print("Stopping existing Spark session...")
    spark.stop()  # Stop any existing Spark session


# Create basic Spark session
# We set the driver memory to 4GB for better performance
# We use local[*] to utilize all available cores on our machine
spark = SparkSession.builder.appName("Banking Analysis").config("spark.driver.memory", "4g").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("WARN") # Set log level to WARN to reduce noise

print("✅ Spark ready!")
print(f"Version: {spark.version}")

25/08/11 14:29:19 WARN Utils: Your hostname, Maclook-Bro.local resolves to a loopback address: 127.0.0.1; using 192.168.222.131 instead (on interface en0)
25/08/11 14:29:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/11 14:29:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/11 14:29:20 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


✅ Spark ready!
Version: 3.5.3


In [3]:
# Load and prepare banking data
print("📦 Loading banking dataset...")

# Simple data loading
df = spark.read.option("header", True).option("inferSchema", True).csv(dataset_path)

print("📋 Schema:")
df.printSchema()


print("type:", type(df))


📦 Loading banking dataset...




📋 Schema:
root
 |-- id: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- client_id: integer (nullable = true)
 |-- card_id: integer (nullable = true)
 |-- amount: string (nullable = true)
 |-- use_chip: string (nullable = true)
 |-- merchant_id: integer (nullable = true)
 |-- merchant_city: string (nullable = true)
 |-- merchant_state: string (nullable = true)
 |-- zip: double (nullable = true)
 |-- mcc: integer (nullable = true)
 |-- errors: string (nullable = true)

type: <class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

In [4]:
# Basic column mapping
df = (
    df.withColumnRenamed("client_id", "customer_id")
    .withColumnRenamed("id", "transaction_id")
    .withColumn("transaction_date", to_timestamp(col("date")))
)


print(f"📊 Loaded {df.count():,} transactions")
df.show(5)



📊 Loaded 13,305,915 transactions
+--------------+-------------------+-----------+-------+-------+-----------------+-----------+-------------+--------------+-------+----+------+-------------------+
|transaction_id|               date|customer_id|card_id| amount|         use_chip|merchant_id|merchant_city|merchant_state|    zip| mcc|errors|   transaction_date|
+--------------+-------------------+-----------+-------+-------+-----------------+-----------+-------------+--------------+-------+----+------+-------------------+
|       7475327|2010-01-01 00:01:00|       1556|   2972|$-77.00|Swipe Transaction|      59935|       Beulah|            ND|58523.0|5499|  NULL|2010-01-01 00:01:00|
|       7475328|2010-01-01 00:02:00|        561|   4575| $14.57|Swipe Transaction|      67570|   Bettendorf|            IA|52722.0|5311|  NULL|2010-01-01 00:02:00|
|       7475329|2010-01-01 00:02:00|       1129|    102| $80.00|Swipe Transaction|      27092|        Vista|            CA|92084.0|4829|  NULL|2010

                                                                                

In [None]:
# Add basic features
print("🔧 Adding basic features...")

# First, ensure amount is properly numeric - convert to amount_usd as double
df = df.withColumn("amount_usd", regexp_replace(col("amount"), "[$]", "").cast("double"))

print(f"✅ Amount data types - USD: {dict(df.dtypes)['amount_usd']}")

# Add simple time features
df = df.withColumn("hour", hour(col("transaction_date"))) \
       .withColumn("is_weekend", dayofweek(col("transaction_date")).isin([1, 7]))

# Add merchant category (simplified)
df = df.withColumn("merchant_category",
                   when(col("mcc").isin(5411, 5441), "Grocery")
                   .when(col("mcc").isin(5812, 5813), "Restaurant") 
                   .when(col("mcc").isin(5541, 5542), "Gas Station")
                   .otherwise("Other"))

# Add zip_region column
df = df.withColumn("zip_region",
                   when(col("zip").substr(1,1).isin(["0", "1", "2"]), "Northeast")
                   .when(col("zip").substr(1,1).isin(["3", "4", "5"]), "Southeast") 
                   .when(col("zip").substr(1,1).isin(["6", "7"]), "Central")
                   .when(col("zip").substr(1,1).isin(["8", "9"]), "West")
                   .otherwise("Unknown"))



# Show sample with numeric amounts
df.select("customer_id", "amount_usd", "merchant_category", "hour", "is_weekend", "zip_region").show()

🔧 Adding basic features...
✅ Amount data types - USD: double
✅ Features added and temp view created!
+-----------+----------+-----------------+----+----------+----------+
|customer_id|amount_usd|merchant_category|hour|is_weekend|zip_region|
+-----------+----------+-----------------+----+----------+----------+
|       1556|     -77.0|            Other|   0|     false| Southeast|
|        561|     14.57|            Other|   0|     false| Southeast|
|       1129|      80.0|            Other|   0|     false|      West|
|        430|     200.0|            Other|   0|     false| Southeast|
|        848|     46.41|       Restaurant|   0|     false| Northeast|
|       1807|      4.81|            Other|   0|     false| Northeast|
|       1556|      77.0|            Other|   0|     false| Southeast|
|       1684|     26.46|            Other|   0|     false|   Unknown|
|        335|    261.58|            Other|   0|     false|   Unknown|
|        351|     10.74|       Restaurant|   0|     false| 

## 2. Basic Data Exploration with Spark 🐼➡️🔥
**Goal:** Explore the 1GB+ dataset with Spark (no pandas copies)

### 🎓 Live Coding Exercise:
- **Instructor:** Demonstrates Spark actions and SQL
- **Students:** Build aggregations and quality checks at scale

In [11]:
# 🧑‍🏫 INSTRUCTOR: Basic Spark exploration (LIVE CODING)
# TODO: Live code the data exploration function

from pyspark.sql import functions as F

print("📊 BANKING DATA OVERVIEW")
print("🎯 LIVE CODING: We'll build a comprehensive exploration function together")

# - Date range analysis with min/max
df.select(
    min("transaction_date").alias("min_date"),
    max("transaction_date").alias("max_date")
).show()
# - Groupby operations for weekday analysis
df.withColumn("weekday", date_format(col("transaction_date"), "E")).groupBy("weekday").count().orderBy("weekday").show()
# - Distinct counts for unique customers
df.select(F.countDistinct("customer_id").alias("unique_customers"))
# - Statistical functions and percentiles
# - Top N analysis with orderBy
df.select(
    F.count("amount_usd").alias("n"),
    F.mean("amount_usd").alias("avg"),
    F.expr("percentile_approx(amount_usd, array(0.25,0.5,0.75), 10000)").alias("quantiles"),
    F.min("amount_usd").alias("min"),
    F.max("amount_usd").alias("max"),
).show(truncate=False)



print("✅ Ready for live coding session!")

📊 BANKING DATA OVERVIEW
🎯 LIVE CODING: We'll build a comprehensive exploration function together


                                                                                

+-------------------+-------------------+
|           min_date|           max_date|
+-------------------+-------------------+
|2010-01-01 00:01:00|2019-10-31 23:59:00|
+-------------------+-------------------+



                                                                                

+-------+-------+
|weekday|  count|
+-------+-------+
|    Fri|1895372|
|    Mon|1896914|
|    Sat|1902370|
|    Sun|1899044|
|    Thu|1918666|
|    Tue|1897678|
|    Wed|1895871|
+-------+-------+





+--------+-----------------+--------------------+------+------+
|n       |avg              |quantiles           |min   |max   |
+--------+-----------------+--------------------+------+------+
|13305915|42.97603902324682|[8.93, 28.99, 63.71]|-500.0|6820.2|
+--------+-----------------+--------------------+------+------+

✅ Ready for live coding session!


                                                                                

In [None]:
# 🧑‍🎓 STUDENT EXERCISE: Basic data exploration
print("📊 EXERCISE: Complete the basic dataset analysis")

# TODO: Students complete these basic operations
print("✏️ YOUR TASKS:")
print("1. Show total transaction count")
print("2. Show unique customer count") 
print("3. Create spending analysis by merchant category")
print("4. Order results by total spending")

print("\n🎯 EXERCISE OBJECTIVES:")
print("• Practice DataFrame operations")
print("• Learn aggregation functions (sum, count)")
print("• Use groupBy and orderBy")
print("• Work with column aliases")

# Hint: Use df.count(), df.select().distinct().count()
# Hint: Use df.groupBy().agg(sum(), count()).orderBy()

print("✅ Ready for your solution!")

## 3. Spark Session Recap 🚀
Spark is already initialized. We’ll keep this short and move to SQL analytics.

- Session tuned for local development and large CSVs
- Temp view `banking_transactions` is ready
- Proceed to analytics at scale

In [None]:
# (Optional) Spark utilities
from pyspark.sql import functions as F
from pyspark.sql.types import *

print("ℹ️ Spark utilities available. Session already created above.")

In [None]:
# Simple Spark DataFrame operations
print("🔧 LIVE CODING: Basic Spark operations")

# Show basic dataset statistics
print("💰 Amount statistics:")
df.select("amount_usd").describe().show()

# Create temp view for SQL
df.createOrReplaceTempView("transactions")
print("✅ Features added and temp view created!")

print("✏️ EXERCISE: Weekend vs Weekday Analysis")
print("TODO: Write SQL query to compare weekend vs weekday spending")
print("Hint: Use is_weekend column, COUNT(*), SUM(amount_usd)")
print("Hint: GROUP BY is_weekend")

# Template:
spark.sql(
    """
SELECT 
    is_weekend,
    COUNT(*) AS transaction_count,
    AVG(amount_usd) AS avg_spending,
    SUM(amount_usd) AS total_spending,
    MEDIAN(amount_usd) AS median_spending 
FROM transactions 
GROUP BY is_weekend
"""
).show()

print("✅ Ready for your SQL solution!")

🔧 LIVE CODING: Basic Spark operations
💰 Amount statistics:


                                                                                

+-------+-----------------+
|summary|       amount_usd|
+-------+-----------------+
|  count|         13305915|
|   mean|42.97603902324682|
| stddev|81.65574765375871|
|    min|           -500.0|
|    max|           6820.2|
+-------+-----------------+

✅ Features added and temp view created!
✏️ EXERCISE: Weekend vs Weekday Analysis
TODO: Write SQL query to compare weekend vs weekday spending
Hint: Use is_weekend column, COUNT(*), SUM(amount_usd)
Hint: GROUP BY is_weekend




+----------+-----------------+------------------+--------------------+---------------+
|is_weekend|transaction_count|      avg_spending|      total_spending|median_spending|
+----------+-----------------+------------------+--------------------+---------------+
|      true|          3801414|43.101191701298404|1.6384547354999956E8|          29.06|
|     false|          9504501| 42.92598304003574| 4.079900487300027E8|          28.96|
+----------+-----------------+------------------+--------------------+---------------+

✅ Ready for your SQL solution!


                                                                                

## 4. Advanced Spark SQL Analytics 🔍
**Goal:** Complex banking analytics using SQL on big data

### 🏦 Real Banking Use Cases:
- **Fraud Detection:** Unusual spending patterns
- **Customer Segmentation:** Spending behavior analysis
- **Risk Assessment:** Transaction pattern analysis

In [None]:
# 🧑‍🏫 INSTRUCTOR: Banking analytics (LIVE CODING)
print("🔍 LIVE CODING: Banking Analytics Session")

print("🎯 We'll build together:")
print("1. Top customers by spending")
print("2. Top merchants by transaction volume")
print("3. Revenue analysis by merchant")

print("• Complex SQL with aggregate functions")
print("• Window functions and ranking")
print("• Business KPI calculations")
print("• ORDER BY with LIMIT for top-N queries")

# TODO: Live code the customer analysis
print("\n👑 COMING UP: Top customers analysis")

# TODO: Live code the merchant analysis  
print("🏪 COMING UP: Merchant performance analysis")

print("✅ Ready for live banking analytics session!")

In [15]:
# 🧑‍🎓 STUDENT EXERCISE: Fraud Detection
print("🚨 BASIC FRAUD DETECTION")

# High amount transactions (potential fraud) using numeric amounts
print("💰 Transactions above $200:")
spark.sql(
    """
SELECT customer_id, 
       ROUND(amount_usd, 2) as amount_usd,
       merchant_id
FROM transactions 
WHERE amount_usd > 200
ORDER BY amount_usd DESC
LIMIT 5
"""
).show()

print("✅ Basic fraud detection complete!")
print("🚨 POTENTIAL FRAUD DETECTION:")
print("Find customers with transactions > 3 standard deviations from their average")

fraud_query = """
WITH customer_stats AS (
    SELECT 
        customer_id,
        transaction_date,
        amount_usd,
        AVG(amount_usd) OVER (PARTITION BY customer_id) as avg_amount,
        ABS(STDDEV_POP(amount_usd) OVER (PARTITION BY customer_id)) as stddev_amount
    FROM transactions
),
potential_fraud AS (
    SELECT 
        customer_id,
        transaction_date,
        amount_usd,
        avg_amount,
        stddev_amount,
        ABS(amount_usd - avg_amount) as deviation,
        CASE 
            WHEN ABS(amount_usd - avg_amount) > 3 * stddev_amount 
            THEN 'HIGH_RISK'
            WHEN ABS(amount_usd - avg_amount) > 2 * stddev_amount 
            THEN 'MEDIUM_RISK'
            ELSE 'NORMAL'
        END as risk_level
    FROM customer_stats
)
SELECT 
    risk_level, 
    COUNT(*) as transaction_count,
    SUM(amount_usd) as total_amount,
    AVG(amount_usd) as avg_amount
FROM potential_fraud
GROUP BY risk_level
ORDER BY risk_level
"""
spark.sql(fraud_query).show()
print("✅ Basic fraud detection complete!")

🚨 BASIC FRAUD DETECTION
💰 Transactions above $200:


                                                                                

+-----------+----------+-----------+
|customer_id|amount_usd|merchant_id|
+-----------+----------+-----------+
|        708|    6820.2|      34524|
|       1081|   6613.44|       9026|
|       1259|   5913.37|      85983|
|       1487|   5813.78|       9026|
|        278|   5696.78|       7202|
+-----------+----------+-----------+

✅ Basic fraud detection complete!
🚨 POTENTIAL FRAUD DETECTION:
Find customers with transactions > 3 standard deviations from their average


[Stage 47:>                                                         (0 + 8) / 9]

+-----------+-----------------+--------------------+------------------+
| risk_level|transaction_count|        total_amount|        avg_amount|
+-----------+-----------------+--------------------+------------------+
|  HIGH_RISK|           209928| 5.057713439000009E7|240.92610032963725|
|MEDIUM_RISK|           249851|2.8490776459999997E7| 114.0310683567406|
|     NORMAL|         12846136|4.9276761142999566E8|  38.3592086702177|
+-----------+-----------------+--------------------+------------------+

✅ Basic fraud detection complete!


                                                                                

## 5. Simple Cloud Deployment 🌥️
**Goal:** Basic overview of deploying to cloud

### 🌟 Why Cloud?
- **Scale:** Handle big datasets
- **Storage:** Secure data storage  
- **Compute:** More processing power

In [None]:
# 🧑‍🏫 INSTRUCTOR: Simple cloud overview (demo only)
print("☁️ CLOUD DEPLOYMENT BASICS")
print("Key concepts:")
print("1. Upload data to cloud storage")
print("2. Create compute cluster") 
print("3. Run Spark jobs")
print("4. Store results")

print("\n✅ Cloud overview complete!")

In [None]:
# ✅ COMPLETE SOLUTION: Customize GCP deployment
print("🎯 COMPLETE: Customize the deployment configuration!")
print("=" * 50)

# Solution 1: Update configuration for your environment
print("⚙️ CUSTOMIZE YOUR CONFIGURATION:")
my_gcp_config = {
    "project_id": "banking-analytics-demo-2025",  # Example project ID
    "bucket_name": "banking-data-workshop-eu",   # Example bucket name
    "region": "europe-west3",                    # Frankfurt region for GDPR compliance
    "dataset_location": "EU",                    # European Union for data residency
    "service_account_email": "banking-analytics@banking-analytics-demo-2025.iam.gserviceaccount.com",
    "vpc_network": "banking-vpc",                # Custom VPC for security
    "subnet": "banking-subnet-eu-west3"          # Specific subnet
}

print("📝 Your GCP Config:")
import json
print(json.dumps(my_gcp_config, indent=2))

# Solution 2: Create a deployment checklist
print("\n✅ DEPLOYMENT CHECKLIST:")
deployment_checklist = [
    "GCP project created and billing enabled",
    "Service account created with necessary permissions",
    "Cloud Storage bucket created in EU region",
    "Databricks workspace provisioned"
]
print("\n✅ Cloud overview complete!")

## 6. Simple Data Integration 🔗
**Goal:** Combine banking data with external sources

### 💡 Real Banking Examples:
- **Exchange Rates:** Convert international transactions
- **Interest Rates:** Economic impact on spending
- **Merchant Data:** Enhanced merchant information

In [None]:
# 🧑‍🏫 INSTRUCTOR: API Integration (LIVE CODING)
print("🔗 LIVE CODING: External Data Integration")

print("🎯 We'll demonstrate:")
print("1. Making API calls to financial services")
print("2. Handling JSON responses")
print("3. Error handling and fallbacks")
print("4. Integrating live data into Spark DataFrames")

# Students will learn:
import requests

print("📊 COMING UP: Live exchange rate API integration")
print("💡 Key concepts:")
print("• REST API calls with requests library")
print("• JSON data parsing")
print("• Try/except error handling")
print("• DataFrame transformations with live data")

# TODO: Live code the API integration
api_url = "https://api.exchangeratesapi.io/v1/latest?access_key=24da234d4ded987472b5ece3b4981c9b&format=1"

print("✅ Ready for live API integration session!")

In [None]:
# 🧑‍🎓 FINAL EXERCISE: Integration Analysis
print("📊 FINAL CHALLENGE: Live Exchange Rate Analysis")

print("🎯 CAPSTONE EXERCISE:")
print("Combine everything you've learned to analyze currency impact!")

print("\n✏️ YOUR MISSION:")
print("1. Use the live exchange rate data we just integrated")
print("2. Compare USD vs EUR spending by merchant category") 
print("3. Calculate currency impact percentages")
print("4. Generate business insights")

print("\n💡 REQUIRED QUERIES:")
print("Query 1: Basic comparison")
print("• SELECT merchant_category, SUM(amount_usd), SUM(amount_eur)")
print("• Calculate totals and averages for both currencies")
print("• GROUP BY merchant_category")

print("\nQuery 2: Advanced analysis")  
print("• Calculate (amount_usd - amount_eur) differences")
print("• Compute percentage impact: (usd-eur)/eur * 100")
print("• Order by currency impact")

print("\n🎯 LEARNING OBJECTIVES:")
print("• Integration of live external data")
print("• Multi-currency financial analysis") 
print("• Business insight generation")
print("• Advanced SQL calculations")

# TODO: Students write their analysis queries
live_rate_analysis = """
-- Write your currency comparison query here
-- Include: merchant_category, totals, averages, differences
"""

currency_impact_query = """
-- Write your advanced impact analysis here
-- Calculate percentage impacts and ranking
"""

print("✅ Ready for your final analysis challenge!")

## 🎯 Workshop Summary

You've learned the essentials of big data analytics:

1. **PySpark Basics:** Loading and processing data
2. **SQL Analysis:** Simple aggregations and insights
3. **Banking Analytics:** Basic fraud detection
4. **Cloud Concepts:** Deployment overview