# Day 2: Advanced Analytics & Machine Learning

**Workshop Schedule:**
- 13:00-13:45: Regressionsanalyse
- 13:55-14:40: Unstructured Data Analytics  
- 14:50-15:40: Data Visualization & Pandas Deep-Dive

## Dataset Overview

We'll be working with a large transactions dataset containing:
- **13M+ transaction records**
- **Fields:** id, date, client_id, card_id, amount, use_chip, merchant_id, merchant_city, merchant_state, zip, mcc, errors
- **Time Range:** 2010+ banking transactions
- **Use Cases:** Fraud detection, customer analytics, risk scoring

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

In [None]:
# Load the dataset (sample for performance)
# In production, you'd use Spark for the full 13M records
df = pd.read_csv('../data/transactions_data.csv', nrows=100000)
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
df.head()

---

# 📊 13:00-13:45: Regressionsanalyse

## 🎯 Lernziele:
- Lineare Regression für Customer Lifetime Value
- Logistische Regression für Fraud Detection
- Credit Risk Scoring
- Model Evaluation Metriken

## 1. Datenaufbereitung für Regression

In [None]:
# Data preprocessing
df['date'] = pd.to_datetime(df['date'])
df['amount_numeric'] = df['amount'].str.replace('$', '').str.replace(',', '').astype(float)
df['is_online'] = (df['merchant_city'] == 'ONLINE').astype(int)
df['is_weekend'] = df['date'].dt.weekday.isin([5, 6]).astype(int)
df['hour'] = df['date'].dt.hour

print("Preprocessing completed!")
df[['amount', 'amount_numeric', 'is_online', 'is_weekend', 'hour']].head()

## 2. Customer Lifetime Value Prediction (Linear Regression)

### 📝 **EXERCISE 1: Customer Analytics**

Erstellen Sie Features für Customer Lifetime Value Prediction:

In [None]:
# Customer aggregation for CLV prediction
customer_features = df.groupby('client_id').agg({
    'amount_numeric': ['sum', 'mean', 'count', 'std'],
    'is_online': 'mean',
    'is_weekend': 'mean',
    'merchant_id': 'nunique',
    'date': ['min', 'max']
}).round(2)

customer_features.columns = ['total_spend', 'avg_transaction', 'transaction_count', 'spend_volatility',
                           'online_ratio', 'weekend_ratio', 'merchant_diversity', 'first_transaction', 'last_transaction']

customer_features['days_active'] = (customer_features['last_transaction'] - customer_features['first_transaction']).dt.days
customer_features['spend_per_day'] = customer_features['total_spend'] / (customer_features['days_active'] + 1)

# Remove customers with insufficient data
customer_features = customer_features[customer_features['transaction_count'] >= 5]

print(f"Customer features shape: {customer_features.shape}")
customer_features.head()

### 🔍 **YOUR TASK:** 
Complete the linear regression model to predict customer lifetime value:

In [None]:
# TODO: Complete the linear regression implementation

# 1. Define features (X) and target (y)
# Features: avg_transaction, transaction_count, merchant_diversity, online_ratio, weekend_ratio
# Target: total_spend (as proxy for CLV)

X = customer_features[['avg_transaction', 'transaction_count', 'merchant_diversity', 'online_ratio', 'weekend_ratio']]
y = customer_features['total_spend']

# 2. Split data into train/test
# YOUR CODE HERE

# 3. Create and train the model
# YOUR CODE HERE

# 4. Make predictions
# YOUR CODE HERE

print("Linear Regression Model trained!")

In [None]:
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Model Performance:")
print(f"MSE: ${mse:,.2f}")
print(f"R² Score: {r2:.3f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': lr_model.coef_
}).sort_values('coefficient', key=abs, ascending=False)

print(f"\nFeature Importance:")
print(feature_importance)

In [None]:
# Visualization
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual CLV')
plt.ylabel('Predicted CLV')
plt.title(f'CLV Prediction (R² = {r2:.3f})')

plt.subplot(1, 2, 2)
feature_importance.plot(x='feature', y='coefficient', kind='barh', ax=plt.gca())
plt.title('Feature Importance')
plt.xlabel('Coefficient Value')

plt.tight_layout()
plt.show()

## 3. Fraud Detection (Logistic Regression)

### 📝 **EXERCISE 2: Binary Classification**

Create a fraud detection model using suspicious transaction patterns:

In [None]:
# Create fraud indicators (synthetic for demonstration)
np.random.seed(42)

df['unusual_amount'] = (df['amount_numeric'] > df['amount_numeric'].quantile(0.95)).astype(int)
df['night_transaction'] = ((df['hour'] < 6) | (df['hour'] > 22)).astype(int)
df['round_amount'] = (df['amount_numeric'] % 10 == 0).astype(int)

# Synthetic fraud labels based on risk factors
fraud_probability = (
    df['unusual_amount'] * 0.3 + 
    df['night_transaction'] * 0.2 + 
    df['is_online'] * 0.1 + 
    df['round_amount'] * 0.1
)

df['is_fraud'] = (np.random.random(len(df)) < fraud_probability * 0.1).astype(int)

print(f"Fraud rate: {df['is_fraud'].mean():.1%}")
print(f"Total fraud cases: {df['is_fraud'].sum()}")

### 🔍 **YOUR TASK:** 
Implement logistic regression for fraud detection:

In [None]:
# TODO: Complete the logistic regression implementation

# 1. Prepare features for fraud detection
fraud_features = ['amount_numeric', 'is_online', 'is_weekend', 'hour', 'unusual_amount', 'night_transaction', 'round_amount']

X_fraud = df[fraud_features]
y_fraud = df['is_fraud']

# 2. Split the data
# YOUR CODE HERE

# 3. Scale the features
# YOUR CODE HERE

# 4. Create and train logistic regression model
# YOUR CODE HERE

# 5. Make predictions
# YOUR CODE HERE

print("Fraud Detection Model trained!")

In [None]:
# Model evaluation with detailed metrics
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, roc_curve

precision = precision_score(y_test_fraud, y_pred_fraud)
recall = recall_score(y_test_fraud, y_pred_fraud)
f1 = f1_score(y_test_fraud, y_pred_fraud)
auc = roc_auc_score(y_test_fraud, y_pred_proba)

print("Fraud Detection Performance:")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
print(f"AUC-ROC: {auc:.3f}")

print("\nClassification Report:")
print(classification_report(y_test_fraud, y_pred_fraud))

In [None]:
# Visualizations
plt.figure(figsize=(15, 4))

# Confusion Matrix
plt.subplot(1, 3, 1)
cm = confusion_matrix(y_test_fraud, y_pred_fraud)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')

# ROC Curve
plt.subplot(1, 3, 2)
fpr, tpr, _ = roc_curve(y_test_fraud, y_pred_proba)
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()

# Feature Importance
plt.subplot(1, 3, 3)
feature_importance_fraud = pd.DataFrame({
    'feature': fraud_features,
    'importance': abs(log_model.coef_[0])
}).sort_values('importance', ascending=True)

plt.barh(feature_importance_fraud['feature'], feature_importance_fraud['importance'])
plt.title('Feature Importance (Fraud Detection)')
plt.xlabel('Absolute Coefficient')

plt.tight_layout()
plt.show()

---

# 📄 13:55-14:40: Unstructured Data Analytics

## 🎯 Lernziele:
- Text Data Processing für Banking
- Transaction Description Mining
- Sentiment Analysis
- Text-based Fraud Indicators

In [None]:
# Text processing libraries
import re
from collections import Counter
try:
    from wordcloud import WordCloud
except ImportError:
    print("WordCloud not available - install with: pip install wordcloud")

try:
    import nltk
    # Download NLTK data (run once)
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt')
        nltk.download('stopwords')
        nltk.download('vader_lexicon')

    from nltk.corpus import stopwords
    from nltk.sentiment import SentimentIntensityAnalyzer
except ImportError:
    print("NLTK not available - install with: pip install nltk")

## 1. Synthetic Transaction Descriptions

Create realistic transaction descriptions for text analysis:

In [None]:
# Generate synthetic transaction descriptions
np.random.seed(42)

merchant_types = {
    5411: ['GROCERY STORE', 'SUPERMARKET', 'FOOD MART'],
    5812: ['RESTAURANT', 'FAST FOOD', 'CAFE', 'DINER'],
    4121: ['TAXI SERVICE', 'UBER', 'LYFT', 'CAB COMPANY'],
    5541: ['GAS STATION', 'FUEL STOP', 'PETROL'],
    5942: ['BOOKSTORE', 'LIBRARY', 'READING CORNER'],
    5499: ['CONVENIENCE STORE', 'CORNER SHOP', '24/7 MART'],
    7801: ['ONLINE PAYMENT', 'DIGITAL TRANSACTION', 'WEB PURCHASE'],
    4784: ['ATM WITHDRAWAL', 'CASH ADVANCE', 'ATM TRANSACTION']
}

def generate_description(row):
    mcc = row.get('mcc', 5499)
    amount = row['amount_numeric']
    
    if mcc in merchant_types:
        base_desc = np.random.choice(merchant_types[mcc])
    else:
        base_desc = "MERCHANT TRANSACTION"
    
    # Add suspicious patterns for some transactions
    if np.random.random() < 0.05:  # 5% suspicious
        suspicious_words = ['URGENT', 'IMMEDIATE', 'FINAL NOTICE', 'VERIFY ACCOUNT', 'SECURITY ALERT']
        base_desc += ' ' + np.random.choice(suspicious_words)
    
    return f"{base_desc} ${amount:.2f}"

# Generate descriptions for sample data
sample_df = df.sample(10000, random_state=42).copy()
sample_df['description'] = sample_df.apply(generate_description, axis=1)

print("Sample transaction descriptions:")
print(sample_df[['amount', 'description', 'is_fraud']].head(10))

## 2. Text Data Processing

### 📝 **EXERCISE 4: Transaction Description Analysis**

In [None]:
# Text preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

sample_df['description_clean'] = sample_df['description'].apply(preprocess_text)

print("Text preprocessing examples:")
print(sample_df[['description', 'description_clean']].head())

### 🔍 **YOUR TASK:** 
Implement text mining for transaction categories:

In [None]:
# TODO: Complete text mining analysis

# 1. Extract most common words
# Define stop words (use simple list if NLTK not available)
try:
    stop_words = set(stopwords.words('english'))
except:
    stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were'}

all_words = []
for desc in sample_df['description_clean']:
    words = desc.split()
    # YOUR CODE HERE: Filter out stop words and short words
    filtered_words = [word for word in words if word not in stop_words and len(word) > 2]
    all_words.extend(filtered_words)

# 2. Count word frequencies
# YOUR CODE HERE
word_freq = Counter(all_words)
top_words = word_freq.most_common(20)

print("Top 20 words in transaction descriptions:")
for word, count in top_words:
    print(f"{word}: {count}")

In [None]:
# Word cloud visualization (if available)
try:
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)

    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Transaction Description Word Cloud')
    plt.show()
except:
    # Alternative visualization with bar chart
    plt.figure(figsize=(12, 6))
    words, counts = zip(*top_words[:15])
    plt.barh(words, counts)
    plt.title('Top 15 Words in Transaction Descriptions')
    plt.xlabel('Frequency')
    plt.gca().invert_yaxis()
    plt.show()

## 3. Sentiment Analysis für Banking Communications

### 📝 **EXERCISE 5: Sentiment-based Risk Assessment**

In [None]:
# Simple sentiment analysis (if NLTK VADER not available)
def simple_sentiment_score(text):
    positive_words = ['good', 'great', 'excellent', 'amazing', 'wonderful', 'fantastic', 'love', 'like']
    negative_words = ['bad', 'terrible', 'awful', 'hate', 'dislike', 'horrible', 'urgent', 'alert', 'warning', 'security']
    
    words = text.lower().split()
    pos_count = sum(1 for word in words if word in positive_words)
    neg_count = sum(1 for word in words if word in negative_words)
    
    if pos_count + neg_count == 0:
        return 0.0
    
    return (pos_count - neg_count) / (pos_count + neg_count)

try:
    # Use NLTK VADER if available
    sia = SentimentIntensityAnalyzer()
    
    def get_sentiment_scores(text):
        scores = sia.polarity_scores(text)
        return scores
    
    sentiment_scores = sample_df['description'].apply(get_sentiment_scores)
    sample_df['sentiment_compound'] = [score['compound'] for score in sentiment_scores]
    sample_df['sentiment_positive'] = [score['pos'] for score in sentiment_scores]
    sample_df['sentiment_negative'] = [score['neg'] for score in sentiment_scores]
    sample_df['sentiment_neutral'] = [score['neu'] for score in sentiment_scores]
    
except:
    # Use simple sentiment if NLTK not available
    sample_df['sentiment_compound'] = sample_df['description'].apply(simple_sentiment_score)
    sample_df['sentiment_positive'] = (sample_df['sentiment_compound'] > 0).astype(float)
    sample_df['sentiment_negative'] = (sample_df['sentiment_compound'] < 0).astype(float)
    sample_df['sentiment_neutral'] = (sample_df['sentiment_compound'] == 0).astype(float)

print("Sentiment analysis results:")
print(sample_df[['description', 'sentiment_compound', 'is_fraud']].head(10))

### 🔍 **YOUR TASK:**
Analyze correlation between sentiment and fraud:

In [None]:
# TODO: Analyze sentiment patterns in fraud vs legitimate transactions

# 1. Group by fraud status and analyze sentiment
# YOUR CODE HERE
sentiment_by_fraud = sample_df.groupby('is_fraud')[['sentiment_compound', 'sentiment_positive', 'sentiment_negative']].mean()

print("Sentiment Analysis by Fraud Status:")
print(sentiment_by_fraud)

# 2. Statistical test for sentiment differences
from scipy import stats

fraud_sentiment = sample_df[sample_df['is_fraud'] == 1]['sentiment_compound']
legit_sentiment = sample_df[sample_df['is_fraud'] == 0]['sentiment_compound']

# YOUR CODE HERE: Perform t-test
if len(fraud_sentiment) > 0 and len(legit_sentiment) > 0:
    t_stat, p_value = stats.ttest_ind(fraud_sentiment, legit_sentiment)
    
    print(f"\nT-test results:")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Significant difference: {'Yes' if p_value < 0.05 else 'No'}")
else:
    print("\nInsufficient data for statistical test")

---

# 📊 14:50-15:40: Data Visualization & Pandas Deep-Dive

## 🎯 Lernziele:
- Pandas Advanced Techniques
- Time Series Analysis
- Statistical Analysis Methods
- Interactive Banking KPI Dashboards

## 1. Time Series Analysis

### 📝 **EXERCISE 6: Banking Time Series Analytics**

In [None]:
# Time series preparation
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()
df['hour'] = df['date'].dt.hour

# Set date as index for time series operations
df_ts = df.set_index('date').copy()

print(f"Time series data prepared: {df_ts.index.min()} to {df_ts.index.max()}")
print(f"Total time span: {(df_ts.index.max() - df_ts.index.min()).days} days")

### 🔍 **YOUR TASK:**
Implement comprehensive time series analysis:

In [None]:
# TODO: Complete time series analysis

# 1. Daily transaction patterns
# YOUR CODE HERE
daily_metrics = df_ts.groupby(df_ts.index.date).agg({
    'amount_numeric': ['sum', 'mean', 'count'],
    'is_fraud': 'sum',
    'is_online': 'sum'
})

daily_metrics.columns = ['daily_volume', 'avg_transaction', 'transaction_count', 'fraud_count', 'online_count']
daily_metrics['fraud_rate'] = daily_metrics['fraud_count'] / daily_metrics['transaction_count'] * 100
daily_metrics['online_rate'] = daily_metrics['online_count'] / daily_metrics['transaction_count'] * 100

print("Daily metrics calculated:")
print(daily_metrics.head())

# 2. Weekly patterns
# YOUR CODE HERE
weekly_patterns = df.groupby('weekday').agg({
    'amount_numeric': ['mean', 'count'],
    'is_fraud': 'mean',
    'is_online': 'mean'
}).round(3)

weekly_patterns.columns = ['avg_amount', 'transaction_count', 'fraud_rate', 'online_rate']
print("\nWeekly patterns:")
print(weekly_patterns)

In [None]:
# Advanced time series visualizations
plt.figure(figsize=(15, 12))

# Daily transaction volume
plt.subplot(3, 2, 1)
daily_metrics['daily_volume'].plot()
plt.title('Daily Transaction Volume')
plt.ylabel('Total Volume ($)')

# Daily fraud rate
plt.subplot(3, 2, 2)
daily_metrics['fraud_rate'].plot(color='red')
plt.title('Daily Fraud Rate')
plt.ylabel('Fraud Rate (%)')

# Hourly patterns
plt.subplot(3, 2, 3)
hourly_patterns = df.groupby('hour')['amount_numeric'].sum()
hourly_patterns.plot(kind='bar')
plt.title('Transaction Volume by Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Volume ($)')

# Weekly patterns
plt.subplot(3, 2, 4)
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekly_ordered = weekly_patterns.reindex(days_order)
weekly_ordered['avg_amount'].plot(kind='bar', color='green')
plt.title('Average Transaction Amount by Day')
plt.xlabel('Day of Week')
plt.ylabel('Average Amount ($)')
plt.xticks(rotation=45)

# Moving averages
plt.subplot(3, 2, 5)
daily_metrics['transaction_count'].rolling(window=7).mean().plot(label='7-day MA')
daily_metrics['transaction_count'].rolling(window=30).mean().plot(label='30-day MA')
plt.title('Transaction Count Moving Averages')
plt.ylabel('Transaction Count')
plt.legend()

# Correlation heatmap
plt.subplot(3, 2, 6)
correlation_matrix = daily_metrics[['daily_volume', 'avg_transaction', 'fraud_rate', 'online_rate']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Daily Metrics Correlation')

plt.tight_layout()
plt.show()

## 2. Banking KPI Dashboard

### 📝 **EXERCISE 7: Comprehensive KPI Visualization**

In [None]:
# Calculate key banking KPIs
def calculate_banking_kpis(df):
    kpis = {}
    
    # Volume metrics
    kpis['total_volume'] = df['amount_numeric'].sum()
    kpis['total_transactions'] = len(df)
    kpis['avg_transaction_size'] = df['amount_numeric'].mean()
    
    # Customer metrics
    kpis['active_customers'] = df['client_id'].nunique()
    kpis['avg_transactions_per_customer'] = len(df) / df['client_id'].nunique()
    kpis['avg_customer_value'] = df.groupby('client_id')['amount_numeric'].sum().mean()
    
    # Risk metrics
    kpis['fraud_rate'] = df['is_fraud'].mean() * 100
    kpis['fraud_volume'] = df[df['is_fraud'] == 1]['amount_numeric'].sum()
    kpis['high_value_transactions'] = (df['amount_numeric'] > df['amount_numeric'].quantile(0.95)).mean() * 100
    
    # Channel metrics
    kpis['online_percentage'] = df['is_online'].mean() * 100
    kpis['weekend_percentage'] = df['is_weekend'].mean() * 100
    
    return kpis

# Calculate current KPIs
current_kpis = calculate_banking_kpis(df)

print("Banking KPI Dashboard")
print("=" * 50)
for kpi, value in current_kpis.items():
    if 'rate' in kpi or 'percentage' in kpi:
        print(f"{kpi.replace('_', ' ').title()}: {value:.2f}%")
    elif 'volume' in kpi or 'value' in kpi or 'size' in kpi:
        print(f"{kpi.replace('_', ' ').title()}: ${value:,.2f}")
    else:
        print(f"{kpi.replace('_', ' ').title()}: {value:,.0f}")

### 🔍 **YOUR TASK:**
Create a comprehensive KPI dashboard with visualizations:

In [None]:
# TODO: Create comprehensive KPI dashboard

# 1. Create figure with multiple subplots
# YOUR CODE HERE
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Banking KPI Dashboard', fontsize=16, fontweight='bold')

# 2. Daily volume trend
axes[0, 0].plot(daily_metrics.index, daily_metrics['daily_volume'], linewidth=2)
axes[0, 0].set_title('Daily Transaction Volume')
axes[0, 0].set_ylabel('Volume ($)')
axes[0, 0].tick_params(axis='x', rotation=45)

# 3. Fraud rate by day of week
fraud_by_day = df.groupby('weekday')['is_fraud'].mean() * 100
fraud_by_day_ordered = fraud_by_day.reindex(days_order)
axes[0, 1].bar(range(len(fraud_by_day_ordered)), fraud_by_day_ordered.values, color='red', alpha=0.7)
axes[0, 1].set_title('Fraud Rate by Day of Week')
axes[0, 1].set_ylabel('Fraud Rate (%)')
axes[0, 1].set_xticks(range(len(days_order)))
axes[0, 1].set_xticklabels([day[:3] for day in days_order])

# 4. Transaction volume by hour
hourly_volume = df.groupby('hour')['amount_numeric'].sum()
axes[0, 2].bar(hourly_volume.index, hourly_volume.values, color='blue', alpha=0.7)
axes[0, 2].set_title('Transaction Volume by Hour')
axes[0, 2].set_xlabel('Hour of Day')
axes[0, 2].set_ylabel('Volume ($)')

# 5. Online vs Offline transactions
channel_data = df['is_online'].value_counts()
axes[1, 0].pie(channel_data.values, labels=['Offline', 'Online'], autopct='%1.1f%%', colors=['lightblue', 'orange'])
axes[1, 0].set_title('Transaction Channels')

# 6. Top states by volume
top_states_volume = df.groupby('merchant_state')['amount_numeric'].sum().nlargest(10)
axes[1, 1].barh(range(len(top_states_volume)), top_states_volume.values)
axes[1, 1].set_title('Top 10 States by Volume')
axes[1, 1].set_yticks(range(len(top_states_volume)))
axes[1, 1].set_yticklabels(top_states_volume.index)
axes[1, 1].set_xlabel('Volume ($)')

# 7. Amount distribution
axes[1, 2].hist(df['amount_numeric'], bins=50, alpha=0.7, color='green')
axes[1, 2].set_title('Transaction Amount Distribution')
axes[1, 2].set_xlabel('Amount ($)')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].set_yscale('log')

plt.tight_layout()
plt.show()

---

# 🎯 Workshop Summary & Next Steps

## ✅ What We Accomplished Today:

### 1. **Regression Analysis (13:00-13:45)**
- ✅ Customer Lifetime Value Prediction with Linear Regression
- ✅ Fraud Detection using Logistic Regression
- ✅ Model Evaluation: Precision, Recall, F1-Score, ROC-AUC

### 2. **Unstructured Data Analytics (13:55-14:40)**
- ✅ Transaction Description Text Mining
- ✅ Sentiment Analysis for Risk Assessment
- ✅ Text-based Feature Engineering

### 3. **Data Visualization & Pandas Deep-Dive (14:50-15:40)**
- ✅ Advanced Time Series Analysis
- ✅ Banking KPI Dashboard Creation
- ✅ Statistical Analysis and Correlation Studies

## 🚀 **Key Skills Developed:**
1. **Machine Learning:** Linear/Logistic Regression, Classification metrics
2. **Text Analytics:** Text mining, Sentiment Analysis, Feature Engineering
3. **Data Visualization:** Advanced Matplotlib/Seaborn, Dashboard Design
4. **Pandas Mastery:** GroupBy, Time Series, Statistical Functions
5. **Banking Domain:** Risk Assessment, Fraud Detection, Customer Analytics

## 🎓 **Homework Challenges:**
1. Complete all the TODO sections in the exercises
2. Extend the fraud detection model with additional features
3. Create a time-series forecasting model for transaction volumes
4. Implement clustering algorithms for customer segmentation
5. Build a Streamlit dashboard for interactive analytics

## 📚 **Additional Resources:**
- **Documentation:** pandas.pydata.org, scikit-learn.org
- **Books:** "Python for Data Analysis" by Wes McKinney
- **Practice:** Kaggle competitions on financial data
- **Tools:** Apache Spark for big data processing

**Great work today! You've built a comprehensive analytics pipeline for banking data! 🎉**