# Banking Data Analysis - Live Coding Workshop
## Big Data Analytics im Banking | 13:00-15:40

### 🎯 **Workshop Agenda**
- **13:00-13:45:** Einführung in Datenanalyse + Banking Transaction Analysis
- **13:55-14:40:** Spark Deep-Dive & GCP Setup
- **14:50-15:40:** Datenbeschaffung und -integration

### 🛠 **Was wir heute lernen:**
1. **Datenanalyseprozess** in der Praxis
2. **Data Mining** für Banking-Patterns
3. **Spark Setup** und SQL-Queries
4. **GCP/Databricks** Configuration
5. **Web Scraping** für Financial Data
6. **Multi-Source Integration**

### 📋 **Live Coding Approach**
- **Instructor demonstrates** → **Students modify/extend**
- **Short code blocks** with thorough comments
- **Interactive exercises** at each step

## 1. Load Large Banking Transactions (PySpark) 🏦
**Goal:** Load a >1GB CSV efficiently using PySpark and prepare it for analysis

Dataset: `transactions_data.csv` (set the path below)

### 🎓 Live Coding Exercise:
- **Instructor:** Sets up Spark and loads the dataset with an explicit schema or fast inference
- **Students:** Add derived columns and validate data quality

In [None]:
# Simple PySpark setup for banking data
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Dataset path
dataset_path = "transactions_data.csv"

# Create basic Spark session
spark = SparkSession.builder.appName("Banking Analysis").getOrCreate()
spark.sparkContext.setLogLevel("WARN")

print("✅ Spark ready!")
print(f"Version: {spark.version}")

In [None]:
# Load and prepare banking data
print("📦 Loading banking dataset...")

# Simple data loading
df = spark.read.option("header", True).option("inferSchema", True).csv(dataset_path)

print("📋 Schema:")
df.printSchema()

# Basic column mapping
df = df.withColumnRenamed("client_id", "customer_id") \
       .withColumnRenamed("id", "transaction_id") \
       .withColumn("transaction_date", to_timestamp(col("date")))

print(f"📊 Loaded {df.count():,} transactions")
df.show(5)

In [None]:
# Add basic features
print("🔧 Adding basic features...")

# First, ensure amount is properly numeric - convert to amount_usd as double
df = df.withColumn("amount_usd", regexp_replace(col("amount"), "[$]", "").cast("double"))

print(f"✅ Amount data types - USD: {dict(df.dtypes)['amount_usd']}")

# Add simple time features
df = df.withColumn("hour", hour(col("transaction_date"))) \
       .withColumn("is_weekend", dayofweek(col("transaction_date")).isin([1, 7]))

# Add merchant category (simplified)
df = df.withColumn("merchant_category",
                   when(col("mcc").isin(5411, 5441), "Grocery")
                   .when(col("mcc").isin(5812, 5813), "Restaurant") 
                   .when(col("mcc").isin(5541, 5542), "Gas Station")
                   .otherwise("Other"))

# Add zip_region column
df = df.withColumn("zip_region",
                   when(col("zip").substr(1,1).isin(["0", "1", "2"]), "Northeast")
                   .when(col("zip").substr(1,1).isin(["3", "4", "5"]), "Southeast") 
                   .when(col("zip").substr(1,1).isin(["6", "7"]), "Central")
                   .when(col("zip").substr(1,1).isin(["8", "9"]), "West")
                   .otherwise("Unknown"))

# Create temp view for SQL
df.createOrReplaceTempView("transactions")
print("✅ Features added and temp view created!")

# Show sample with numeric amounts
df.select("customer_id", "amount_usd", "merchant_category", "hour", "is_weekend", "zip_region").show()

## 2. Basic Data Exploration with Spark 🐼➡️🔥
**Goal:** Explore the 1GB+ dataset with Spark (no pandas copies)

### 🎓 Live Coding Exercise:
- **Instructor:** Demonstrates Spark actions and SQL
- **Students:** Build aggregations and quality checks at scale

In [None]:
# 🧑‍🏫 INSTRUCTOR: Basic Spark exploration (LIVE CODING)
# TODO: Live code the data exploration function
print("📊 BANKING DATA OVERVIEW")
print("🎯 LIVE CODING: We'll build a comprehensive exploration function together")

# Students will learn:
# - Basic DataFrame operations (count, printSchema)
# - Date range analysis with min/max
# - Groupby operations for weekday analysis
# - Distinct counts for unique customers
# - Statistical functions and percentiles
# - Top N analysis with orderBy

print("✅ Ready for live coding session!")

In [None]:
# 🧑‍🎓 STUDENT EXERCISE: Basic data exploration
print("📊 EXERCISE: Complete the basic dataset analysis")

# TODO: Students complete these basic operations
print("✏️ YOUR TASKS:")
print("1. Show total transaction count")
print("2. Show unique customer count") 
print("3. Create spending analysis by merchant category")
print("4. Order results by total spending")

print("\n🎯 EXERCISE OBJECTIVES:")
print("• Practice DataFrame operations")
print("• Learn aggregation functions (sum, count)")
print("• Use groupBy and orderBy")
print("• Work with column aliases")

# Hint: Use df.count(), df.select().distinct().count()
# Hint: Use df.groupBy().agg(sum(), count()).orderBy()

print("✅ Ready for your solution!")

## 3. Spark Session Recap 🚀
Spark is already initialized. We’ll keep this short and move to SQL analytics.

- Session tuned for local development and large CSVs
- Temp view `banking_transactions` is ready
- Proceed to analytics at scale

In [None]:
# (Optional) Spark utilities
from pyspark.sql import functions as F
from pyspark.sql.types import *

print("ℹ️ Spark utilities available. Session already created above.")

In [None]:
# Simple Spark DataFrame operations
print("🔧 LIVE CODING: Basic Spark operations")

# Show basic dataset statistics
print("💰 Amount statistics:")
df.select("amount_usd").describe().show()

print("✏️ EXERCISE: Weekend vs Weekday Analysis")
print("TODO: Write SQL query to compare weekend vs weekday spending")
print("Hint: Use is_weekend column, COUNT(*), SUM(amount_usd)")
print("Hint: GROUP BY is_weekend")

# Template:
# spark.sql("""
# SELECT 
#     is_weekend,
#     -- Add your aggregations here
# FROM transactions 
# -- Add your GROUP BY here
# """).show()

print("✅ Ready for your SQL solution!")

## 4. Advanced Spark SQL Analytics 🔍
**Goal:** Complex banking analytics using SQL on big data

### 🏦 Real Banking Use Cases:
- **Fraud Detection:** Unusual spending patterns
- **Customer Segmentation:** Spending behavior analysis
- **Risk Assessment:** Transaction pattern analysis

In [None]:
# 🧑‍🏫 INSTRUCTOR: Banking analytics (LIVE CODING)
print("🔍 LIVE CODING: Banking Analytics Session")

print("🎯 We'll build together:")
print("1. Top customers by spending")
print("2. Top merchants by transaction volume")
print("3. Revenue analysis by merchant")

print("💡 Students will learn:")
print("• Complex SQL with aggregate functions")
print("• Window functions and ranking")
print("• Business KPI calculations")
print("• ORDER BY with LIMIT for top-N queries")

# TODO: Live code the customer analysis
print("\n👑 COMING UP: Top customers analysis")

# TODO: Live code the merchant analysis  
print("🏪 COMING UP: Merchant performance analysis")

print("✅ Ready for live banking analytics session!")

In [None]:
# 🧑‍🎓 STUDENT EXERCISE: Fraud Detection
print("🚨 EXERCISE: Build a Basic Fraud Detection System")

print("✏️ YOUR CHALLENGES:")
print("1. Find transactions above $200 (potential high-value fraud)")
print("2. ADVANCED: Find customers with unusual spending patterns")

print("\n🎯 BASIC EXERCISE (Start here):")
print("Write SQL to find high-amount transactions")
print("• SELECT customer_id, amount_usd, merchant_id") 
print("• WHERE amount_usd > 200")
print("• ORDER BY amount_usd DESC")
print("• LIMIT 5")

# TODO: Write your SQL query here
basic_fraud_query = """
-- Write your high-amount transaction query here
"""

print("\n🎯 ADVANCED EXERCISE (Optional):")
print("Statistical fraud detection using window functions")
print("• Calculate customer average and standard deviation")  
print("• Find transactions > 3 standard deviations from average")
print("• Classify as HIGH_RISK, MEDIUM_RISK, NORMAL")

# TODO: Advanced students can attempt window functions
advanced_fraud_query = """
-- Advanced: Use window functions for statistical analysis
-- WITH customer_stats AS (...)
"""

print("✅ Ready to detect fraud!")


## 5. Simple Cloud Deployment 🌥️
**Goal:** Basic overview of deploying to cloud

### 🌟 Why Cloud?
- **Scale:** Handle big datasets
- **Storage:** Secure data storage  
- **Compute:** More processing power

In [None]:
# 🧑‍🏫 INSTRUCTOR: Simple cloud overview (demo only)
print("☁️ CLOUD DEPLOYMENT BASICS")
print("Key concepts:")
print("1. Upload data to cloud storage")
print("2. Create compute cluster") 
print("3. Run Spark jobs")
print("4. Store results")

print("\n✅ Cloud overview complete!")

In [None]:
# ✅ COMPLETE SOLUTION: Customize GCP deployment
print("🎯 COMPLETE: Customize the deployment configuration!")
print("=" * 50)

# Solution 1: Update configuration for your environment
print("⚙️ CUSTOMIZE YOUR CONFIGURATION:")
my_gcp_config = {
    "project_id": "banking-analytics-demo-2025",  # Example project ID
    "bucket_name": "banking-data-workshop-eu",   # Example bucket name
    "region": "europe-west3",                    # Frankfurt region for GDPR compliance
    "dataset_location": "EU",                    # European Union for data residency
    "service_account_email": "banking-analytics@banking-analytics-demo-2025.iam.gserviceaccount.com",
    "vpc_network": "banking-vpc",                # Custom VPC for security
    "subnet": "banking-subnet-eu-west3"          # Specific subnet
}

print("📝 Your GCP Config:")
import json
print(json.dumps(my_gcp_config, indent=2))

# Solution 2: Create a deployment checklist
print("\n✅ DEPLOYMENT CHECKLIST:")
deployment_checklist = [
    "GCP project created and billing enabled",
    "Service account created with necessary permissions",
    "Cloud Storage bucket created in EU region",
    "Databricks workspace provisioned"
]
print("\n✅ Cloud overview complete!")

## 6. Simple Data Integration 🔗
**Goal:** Combine banking data with external sources

### 💡 Real Banking Examples:
- **Exchange Rates:** Convert international transactions
- **Interest Rates:** Economic impact on spending
- **Merchant Data:** Enhanced merchant information

In [None]:
# 🧑‍🏫 INSTRUCTOR: API Integration (LIVE CODING)
print("🔗 LIVE CODING: External Data Integration")

print("🎯 We'll demonstrate:")
print("1. Making API calls to financial services")
print("2. Handling JSON responses")
print("3. Error handling and fallbacks")
print("4. Integrating live data into Spark DataFrames")

# Students will learn:
import requests

print("📊 COMING UP: Live exchange rate API integration")
print("💡 Key concepts:")
print("• REST API calls with requests library")
print("• JSON data parsing")
print("• Try/except error handling")
print("• DataFrame transformations with live data")

# TODO: Live code the API integration
api_url = "https://api.exchangeratesapi.io/v1/latest?access_key=24da234d4ded987472b5ece3b4981c9b&format=1"

print("✅ Ready for live API integration session!")

In [None]:
# 🧑‍🎓 FINAL EXERCISE: Integration Analysis
print("📊 FINAL CHALLENGE: Live Exchange Rate Analysis")

print("🎯 CAPSTONE EXERCISE:")
print("Combine everything you've learned to analyze currency impact!")

print("\n✏️ YOUR MISSION:")
print("1. Use the live exchange rate data we just integrated")
print("2. Compare USD vs EUR spending by merchant category") 
print("3. Calculate currency impact percentages")
print("4. Generate business insights")

print("\n💡 REQUIRED QUERIES:")
print("Query 1: Basic comparison")
print("• SELECT merchant_category, SUM(amount_usd), SUM(amount_eur)")
print("• Calculate totals and averages for both currencies")
print("• GROUP BY merchant_category")

print("\nQuery 2: Advanced analysis")  
print("• Calculate (amount_usd - amount_eur) differences")
print("• Compute percentage impact: (usd-eur)/eur * 100")
print("• Order by currency impact")

print("\n🎯 LEARNING OBJECTIVES:")
print("• Integration of live external data")
print("• Multi-currency financial analysis") 
print("• Business insight generation")
print("• Advanced SQL calculations")

# TODO: Students write their analysis queries
live_rate_analysis = """
-- Write your currency comparison query here
-- Include: merchant_category, totals, averages, differences
"""

currency_impact_query = """
-- Write your advanced impact analysis here
-- Calculate percentage impacts and ranking
"""

print("✅ Ready for your final analysis challenge!")

## 🎯 Workshop Summary

You've learned the essentials of big data analytics:

1. **PySpark Basics:** Loading and processing data
2. **SQL Analysis:** Simple aggregations and insights
3. **Banking Analytics:** Basic fraud detection
4. **Cloud Concepts:** Deployment overview