# Banking Data Analysis - Live Coding Workshop
## Big Data Analytics im Banking | 13:00-15:40

### üéØ **Workshop Agenda**
- **13:00-13:45:** Einf√ºhrung in Datenanalyse + Banking Transaction Analysis
- **13:55-14:40:** Spark Deep-Dive & GCP Setup
- **14:50-15:40:** Datenbeschaffung und -integration

### üõ† **Was wir heute lernen:**
1. **Datenanalyseprozess** in der Praxis
2. **Data Mining** f√ºr Banking-Patterns
3. **Spark Setup** und SQL-Queries
4. **GCP/Databricks** Configuration
5. **Web Scraping** f√ºr Financial Data
6. **Multi-Source Integration**

### üìã **Live Coding Approach**
- **Instructor demonstrates** ‚Üí **Students modify/extend**
- **Short code blocks** with thorough comments
- **Interactive exercises** at each step

## 1. Load Large Banking Transactions (PySpark) üè¶
**Goal:** Load a >1GB CSV efficiently using PySpark and prepare it for analysis

Dataset: `transactions_data.csv` (set the path below)

### üéì Live Coding Exercise:
- **Instructor:** Sets up Spark and loads the dataset with an explicit schema or fast inference
- **Students:** Add derived columns and validate data quality

In [None]:
# Simple PySpark setup for banking data
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Dataset path
dataset_path = "transactions_data.csv"

# Create basic Spark session
spark = SparkSession.builder.appName("Banking Analysis").getOrCreate()
spark.sparkContext.setLogLevel("WARN")

print("‚úÖ Spark ready!")
print(f"Version: {spark.version}")

In [None]:
# Load and prepare banking data
print("üì¶ Loading banking dataset...")

# Simple data loading
df = spark.read.option("header", True).option("inferSchema", True).csv(dataset_path)

print("üìã Schema:")
df.printSchema()

# Basic column mapping
df = df.withColumnRenamed("client_id", "customer_id") \
       .withColumnRenamed("id", "transaction_id") \
       .withColumn("transaction_date", to_timestamp(col("date")))

print(f"üìä Loaded {df.count():,} transactions")
df.show(5)

In [None]:
# Add basic features
print("üîß Adding basic features...")

# First, ensure amount is properly numeric - convert to amount_usd as double
df = df.withColumn("amount_usd", regexp_replace(col("amount"), "[$]", "").cast("double"))

print(f"‚úÖ Amount data types - USD: {dict(df.dtypes)['amount_usd']}")

# Add simple time features
df = df.withColumn("hour", hour(col("transaction_date"))) \
       .withColumn("is_weekend", dayofweek(col("transaction_date")).isin([1, 7]))

# Add merchant category (simplified)
df = df.withColumn("merchant_category",
                   when(col("mcc").isin(5411, 5441), "Grocery")
                   .when(col("mcc").isin(5812, 5813), "Restaurant") 
                   .when(col("mcc").isin(5541, 5542), "Gas Station")
                   .otherwise("Other"))

# Add zip_region column
df = df.withColumn("zip_region",
                   when(col("zip").substr(1,1).isin(["0", "1", "2"]), "Northeast")
                   .when(col("zip").substr(1,1).isin(["3", "4", "5"]), "Southeast") 
                   .when(col("zip").substr(1,1).isin(["6", "7"]), "Central")
                   .when(col("zip").substr(1,1).isin(["8", "9"]), "West")
                   .otherwise("Unknown"))

# Create temp view for SQL
df.createOrReplaceTempView("transactions")
print("‚úÖ Features added and temp view created!")

# Show sample with numeric amounts
df.select("customer_id", "amount_usd", "merchant_category", "hour", "is_weekend", "zip_region").show()

## 2. Basic Data Exploration with Spark üêº‚û°Ô∏èüî•
**Goal:** Explore the 1GB+ dataset with Spark (no pandas copies)

### üéì Live Coding Exercise:
- **Instructor:** Demonstrates Spark actions and SQL
- **Students:** Build aggregations and quality checks at scale

In [None]:
# üßë‚Äçüè´ INSTRUCTOR: Basic Spark exploration (LIVE CODING)
# TODO: Live code the data exploration function
print("üìä BANKING DATA OVERVIEW")
print("üéØ LIVE CODING: We'll build a comprehensive exploration function together")

# Students will learn:
# - Basic DataFrame operations (count, printSchema)
# - Date range analysis with min/max
# - Groupby operations for weekday analysis
# - Distinct counts for unique customers
# - Statistical functions and percentiles
# - Top N analysis with orderBy

print("‚úÖ Ready for live coding session!")

In [None]:
# üßë‚Äçüéì STUDENT EXERCISE: Basic data exploration
print("üìä EXERCISE: Complete the basic dataset analysis")

# TODO: Students complete these basic operations
print("‚úèÔ∏è YOUR TASKS:")
print("1. Show total transaction count")
print("2. Show unique customer count") 
print("3. Create spending analysis by merchant category")
print("4. Order results by total spending")

print("\nüéØ EXERCISE OBJECTIVES:")
print("‚Ä¢ Practice DataFrame operations")
print("‚Ä¢ Learn aggregation functions (sum, count)")
print("‚Ä¢ Use groupBy and orderBy")
print("‚Ä¢ Work with column aliases")

# Hint: Use df.count(), df.select().distinct().count()
# Hint: Use df.groupBy().agg(sum(), count()).orderBy()

print("‚úÖ Ready for your solution!")

## 3. Spark Session Recap üöÄ
Spark is already initialized. We‚Äôll keep this short and move to SQL analytics.

- Session tuned for local development and large CSVs
- Temp view `banking_transactions` is ready
- Proceed to analytics at scale

In [None]:
# (Optional) Spark utilities
from pyspark.sql import functions as F
from pyspark.sql.types import *

print("‚ÑπÔ∏è Spark utilities available. Session already created above.")

In [None]:
# Simple Spark DataFrame operations
print("üîß LIVE CODING: Basic Spark operations")

# Show basic dataset statistics
print("üí∞ Amount statistics:")
df.select("amount_usd").describe().show()

print("‚úèÔ∏è EXERCISE: Weekend vs Weekday Analysis")
print("TODO: Write SQL query to compare weekend vs weekday spending")
print("Hint: Use is_weekend column, COUNT(*), SUM(amount_usd)")
print("Hint: GROUP BY is_weekend")

# Template:
# spark.sql("""
# SELECT 
#     is_weekend,
#     -- Add your aggregations here
# FROM transactions 
# -- Add your GROUP BY here
# """).show()

print("‚úÖ Ready for your SQL solution!")

## 4. Advanced Spark SQL Analytics üîç
**Goal:** Complex banking analytics using SQL on big data

### üè¶ Real Banking Use Cases:
- **Fraud Detection:** Unusual spending patterns
- **Customer Segmentation:** Spending behavior analysis
- **Risk Assessment:** Transaction pattern analysis

In [None]:
# üßë‚Äçüè´ INSTRUCTOR: Banking analytics (LIVE CODING)
print("üîç LIVE CODING: Banking Analytics Session")

print("üéØ We'll build together:")
print("1. Top customers by spending")
print("2. Top merchants by transaction volume")
print("3. Revenue analysis by merchant")

print("üí° Students will learn:")
print("‚Ä¢ Complex SQL with aggregate functions")
print("‚Ä¢ Window functions and ranking")
print("‚Ä¢ Business KPI calculations")
print("‚Ä¢ ORDER BY with LIMIT for top-N queries")

# TODO: Live code the customer analysis
print("\nüëë COMING UP: Top customers analysis")

# TODO: Live code the merchant analysis  
print("üè™ COMING UP: Merchant performance analysis")

print("‚úÖ Ready for live banking analytics session!")

In [None]:
# üßë‚Äçüéì STUDENT EXERCISE: Fraud Detection
print("üö® EXERCISE: Build a Basic Fraud Detection System")

print("‚úèÔ∏è YOUR CHALLENGES:")
print("1. Find transactions above $200 (potential high-value fraud)")
print("2. ADVANCED: Find customers with unusual spending patterns")

print("\nüéØ BASIC EXERCISE (Start here):")
print("Write SQL to find high-amount transactions")
print("‚Ä¢ SELECT customer_id, amount_usd, merchant_id") 
print("‚Ä¢ WHERE amount_usd > 200")
print("‚Ä¢ ORDER BY amount_usd DESC")
print("‚Ä¢ LIMIT 5")

# TODO: Write your SQL query here
basic_fraud_query = """
-- Write your high-amount transaction query here
"""

print("\nüéØ ADVANCED EXERCISE (Optional):")
print("Statistical fraud detection using window functions")
print("‚Ä¢ Calculate customer average and standard deviation")  
print("‚Ä¢ Find transactions > 3 standard deviations from average")
print("‚Ä¢ Classify as HIGH_RISK, MEDIUM_RISK, NORMAL")

# TODO: Advanced students can attempt window functions
advanced_fraud_query = """
-- Advanced: Use window functions for statistical analysis
-- WITH customer_stats AS (...)
"""

print("‚úÖ Ready to detect fraud!")


## 5. Simple Cloud Deployment üå•Ô∏è
**Goal:** Basic overview of deploying to cloud

### üåü Why Cloud?
- **Scale:** Handle big datasets
- **Storage:** Secure data storage  
- **Compute:** More processing power

In [None]:
# üßë‚Äçüè´ INSTRUCTOR: Simple cloud overview (demo only)
print("‚òÅÔ∏è CLOUD DEPLOYMENT BASICS")
print("Key concepts:")
print("1. Upload data to cloud storage")
print("2. Create compute cluster") 
print("3. Run Spark jobs")
print("4. Store results")

print("\n‚úÖ Cloud overview complete!")

In [None]:
# ‚úÖ COMPLETE SOLUTION: Customize GCP deployment
print("üéØ COMPLETE: Customize the deployment configuration!")
print("=" * 50)

# Solution 1: Update configuration for your environment
print("‚öôÔ∏è CUSTOMIZE YOUR CONFIGURATION:")
my_gcp_config = {
    "project_id": "banking-analytics-demo-2025",  # Example project ID
    "bucket_name": "banking-data-workshop-eu",   # Example bucket name
    "region": "europe-west3",                    # Frankfurt region for GDPR compliance
    "dataset_location": "EU",                    # European Union for data residency
    "service_account_email": "banking-analytics@banking-analytics-demo-2025.iam.gserviceaccount.com",
    "vpc_network": "banking-vpc",                # Custom VPC for security
    "subnet": "banking-subnet-eu-west3"          # Specific subnet
}

print("üìù Your GCP Config:")
import json
print(json.dumps(my_gcp_config, indent=2))

# Solution 2: Create a deployment checklist
print("\n‚úÖ DEPLOYMENT CHECKLIST:")
deployment_checklist = [
    "GCP project created and billing enabled",
    "Service account created with necessary permissions",
    "Cloud Storage bucket created in EU region",
    "Databricks workspace provisioned"
]
print("\n‚úÖ Cloud overview complete!")

## 6. Simple Data Integration üîó
**Goal:** Combine banking data with external sources

### üí° Real Banking Examples:
- **Exchange Rates:** Convert international transactions
- **Interest Rates:** Economic impact on spending
- **Merchant Data:** Enhanced merchant information

In [None]:
# üßë‚Äçüè´ INSTRUCTOR: API Integration (LIVE CODING)
print("üîó LIVE CODING: External Data Integration")

print("üéØ We'll demonstrate:")
print("1. Making API calls to financial services")
print("2. Handling JSON responses")
print("3. Error handling and fallbacks")
print("4. Integrating live data into Spark DataFrames")

# Students will learn:
import requests

print("üìä COMING UP: Live exchange rate API integration")
print("üí° Key concepts:")
print("‚Ä¢ REST API calls with requests library")
print("‚Ä¢ JSON data parsing")
print("‚Ä¢ Try/except error handling")
print("‚Ä¢ DataFrame transformations with live data")

# TODO: Live code the API integration
api_url = "https://api.exchangeratesapi.io/v1/latest?access_key=24da234d4ded987472b5ece3b4981c9b&format=1"

print("‚úÖ Ready for live API integration session!")

In [None]:
# üßë‚Äçüéì FINAL EXERCISE: Integration Analysis
print("üìä FINAL CHALLENGE: Live Exchange Rate Analysis")

print("üéØ CAPSTONE EXERCISE:")
print("Combine everything you've learned to analyze currency impact!")

print("\n‚úèÔ∏è YOUR MISSION:")
print("1. Use the live exchange rate data we just integrated")
print("2. Compare USD vs EUR spending by merchant category") 
print("3. Calculate currency impact percentages")
print("4. Generate business insights")

print("\nüí° REQUIRED QUERIES:")
print("Query 1: Basic comparison")
print("‚Ä¢ SELECT merchant_category, SUM(amount_usd), SUM(amount_eur)")
print("‚Ä¢ Calculate totals and averages for both currencies")
print("‚Ä¢ GROUP BY merchant_category")

print("\nQuery 2: Advanced analysis")  
print("‚Ä¢ Calculate (amount_usd - amount_eur) differences")
print("‚Ä¢ Compute percentage impact: (usd-eur)/eur * 100")
print("‚Ä¢ Order by currency impact")

print("\nüéØ LEARNING OBJECTIVES:")
print("‚Ä¢ Integration of live external data")
print("‚Ä¢ Multi-currency financial analysis") 
print("‚Ä¢ Business insight generation")
print("‚Ä¢ Advanced SQL calculations")

# TODO: Students write their analysis queries
live_rate_analysis = """
-- Write your currency comparison query here
-- Include: merchant_category, totals, averages, differences
"""

currency_impact_query = """
-- Write your advanced impact analysis here
-- Calculate percentage impacts and ranking
"""

print("‚úÖ Ready for your final analysis challenge!")

## üéØ Workshop Summary

You've learned the essentials of big data analytics:

1. **PySpark Basics:** Loading and processing data
2. **SQL Analysis:** Simple aggregations and insights
3. **Banking Analytics:** Basic fraud detection
4. **Cloud Concepts:** Deployment overview