# Exploratory Data Analysis (EDA)
## Transaction Fraud Detection System

This notebook performs comprehensive exploratory data analysis on the PaySim dataset to understand:
1. Fraud distribution and class imbalance
2. Transaction type patterns
3. Amount and balance distributions
4. Data quality issues
5. Key fraud indicators

In [None]:
# Import libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from data_loader import load_raw_data, clean_data

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## 1. Load Data

**Note:** For development/testing, you can use `nrows` parameter to load a sample.
For full analysis, remove the `nrows` parameter or set it to `None`.

In [None]:
# Load data - adjust nrows based on your system memory
# For full dataset: df = load_raw_data()
# For sample: df = load_raw_data(nrows=500000)

df = load_raw_data(nrows=500000)  # Load 500k transactions for EDA

print(f"\nDataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
# Display first few rows
df.head(10)

In [None]:
# Dataset info
df.info()

In [None]:
# Statistical summary
df.describe()

## 2. Fraud Distribution Analysis

Understanding the class imbalance is critical for fraud detection.

In [None]:
# Fraud distribution
fraud_counts = df['isFraud'].value_counts()
fraud_pct = df['isFraud'].value_counts(normalize=True) * 100

print("Fraud Distribution:")
print("="*50)
print(f"Legitimate transactions: {fraud_counts[0]:,} ({fraud_pct[0]:.4f}%)")
print(f"Fraudulent transactions: {fraud_counts[1]:,} ({fraud_pct[1]:.4f}%)")
print(f"\nImbalance ratio: {fraud_counts[0] / fraud_counts[1]:.2f}:1")
print("\n‚ö†Ô∏è  This is a HIGHLY IMBALANCED dataset!")
print("   Accuracy is NOT a good metric here.")
print("   We need Precision@K, Recall@K, and PR-AUC.")

In [None]:
# Visualize fraud distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Count plot
fraud_counts.plot(kind='bar', ax=ax1, color=['#2ecc71', '#e74c3c'])
ax1.set_title('Fraud Distribution (Count)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Is Fraud', fontsize=12)
ax1.set_ylabel('Count', fontsize=12)
ax1.set_xticklabels(['Legitimate (0)', 'Fraud (1)'], rotation=0)
ax1.grid(axis='y', alpha=0.3)

# Add count labels on bars
for i, v in enumerate(fraud_counts):
    ax1.text(i, v + max(fraud_counts)*0.02, f'{v:,}', ha='center', fontweight='bold')

# Pie chart
colors = ['#2ecc71', '#e74c3c']
ax2.pie(fraud_counts, labels=['Legitimate', 'Fraud'], autopct='%1.4f%%', 
        colors=colors, startangle=90, explode=(0, 0.1))
ax2.set_title('Fraud Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 3. Transaction Type Analysis

Different transaction types have different fraud patterns.

In [None]:
# Transaction type distribution
print("Transaction Type Distribution:")
print("="*50)
print(df['type'].value_counts())
print("\n" + "="*50)
print(df['type'].value_counts(normalize=True) * 100)

In [None]:
# Fraud by transaction type
fraud_by_type = pd.crosstab(df['type'], df['isFraud'], normalize='index') * 100

print("\nFraud Rate by Transaction Type:")
print("="*50)
print(fraud_by_type)

print("\nüîç Key Insight:")
print("   Only TRANSFER and CASH_OUT transactions can be fraudulent!")
print("   This is a critical feature for our model.")

In [None]:
# Visualize fraud by transaction type
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Count by type
type_fraud = df.groupby(['type', 'isFraud']).size().unstack(fill_value=0)
type_fraud.plot(kind='bar', ax=ax1, color=['#2ecc71', '#e74c3c'], width=0.8)
ax1.set_title('Transaction Count by Type and Fraud Status', fontsize=14, fontweight='bold')
ax1.set_xlabel('Transaction Type', fontsize=12)
ax1.set_ylabel('Count', fontsize=12)
ax1.legend(['Legitimate', 'Fraud'], loc='upper right')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')
ax1.grid(axis='y', alpha=0.3)

# Fraud rate by type
fraud_rate_by_type = df.groupby('type')['isFraud'].mean() * 100
fraud_rate_by_type.plot(kind='bar', ax=ax2, color='#e74c3c', width=0.6)
ax2.set_title('Fraud Rate by Transaction Type', fontsize=14, fontweight='bold')
ax2.set_xlabel('Transaction Type', fontsize=12)
ax2.set_ylabel('Fraud Rate (%)', fontsize=12)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right')
ax2.grid(axis='y', alpha=0.3)

# Add value labels
for i, v in enumerate(fraud_rate_by_type):
    ax2.text(i, v + max(fraud_rate_by_type)*0.02, f'{v:.3f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

## 4. Amount Analysis

Analyzing transaction amounts for fraud vs legitimate transactions.

In [None]:
# Amount statistics by fraud status
print("Amount Statistics by Fraud Status:")
print("="*70)
print(df.groupby('isFraud')['amount'].describe())

In [None]:
# Visualize amount distribution
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Histogram - all transactions
axes[0, 0].hist(df['amount'], bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Amount Distribution (All Transactions)', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Amount')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(alpha=0.3)

# Log scale histogram
axes[0, 1].hist(np.log1p(df['amount']), bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Amount Distribution (Log Scale)', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Log(Amount + 1)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].grid(alpha=0.3)

# Box plot by fraud status
df_sample = df.sample(n=min(10000, len(df)), random_state=42)  # Sample for faster plotting
sns.boxplot(data=df_sample, x='isFraud', y='amount', ax=axes[1, 0])
axes[1, 0].set_title('Amount Distribution by Fraud Status', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Is Fraud')
axes[1, 0].set_ylabel('Amount')
axes[1, 0].set_xticklabels(['Legitimate', 'Fraud'])
axes[1, 0].grid(alpha=0.3)

# Violin plot (log scale)
df_sample['log_amount'] = np.log1p(df_sample['amount'])
sns.violinplot(data=df_sample, x='isFraud', y='log_amount', ax=axes[1, 1])
axes[1, 1].set_title('Log Amount Distribution by Fraud Status', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Is Fraud')
axes[1, 1].set_ylabel('Log(Amount + 1)')
axes[1, 1].set_xticklabels(['Legitimate', 'Fraud'])
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Balance Analysis

Balance inconsistencies are key fraud indicators.

In [None]:
# Calculate balance errors
df['errorBalanceOrig'] = df['newbalanceOrig'] + df['amount'] - df['oldbalanceOrg']
df['errorBalanceDest'] = df['oldbalanceDest'] + df['amount'] - df['newbalanceDest']

print("Balance Error Analysis:")
print("="*70)
print("\nOrigin Balance Error:")
print(df.groupby('isFraud')['errorBalanceOrig'].describe())
print("\nDestination Balance Error:")
print(df.groupby('isFraud')['errorBalanceDest'].describe())

In [None]:
# Percentage of transactions with balance errors
has_error_orig = (df['errorBalanceOrig'].abs() > 0.01).groupby(df['isFraud']).mean() * 100
has_error_dest = (df['errorBalanceDest'].abs() > 0.01).groupby(df['isFraud']).mean() * 100

print("\nPercentage with Balance Errors:")
print("="*70)
print("\nOrigin Balance Error:")
print(f"  Legitimate: {has_error_orig[0]:.2f}%")
print(f"  Fraud:      {has_error_orig[1]:.2f}%")
print("\nDestination Balance Error:")
print(f"  Legitimate: {has_error_dest[0]:.2f}%")
print(f"  Fraud:      {has_error_dest[1]:.2f}%")

print("\nüîç Key Insight:")
print("   Balance errors are MUCH more common in fraudulent transactions!")
print("   This will be a powerful feature for our model.")

In [None]:
# Visualize balance errors
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Origin balance error
df_sample = df.sample(n=min(10000, len(df)), random_state=42)
sns.boxplot(data=df_sample, x='isFraud', y='errorBalanceOrig', ax=axes[0])
axes[0].set_title('Origin Balance Error by Fraud Status', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Is Fraud')
axes[0].set_ylabel('Balance Error')
axes[0].set_xticklabels(['Legitimate', 'Fraud'])
axes[0].grid(alpha=0.3)

# Destination balance error
sns.boxplot(data=df_sample, x='isFraud', y='errorBalanceDest', ax=axes[1])
axes[1].set_title('Destination Balance Error by Fraud Status', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Is Fraud')
axes[1].set_ylabel('Balance Error')
axes[1].set_xticklabels(['Legitimate', 'Fraud'])
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Temporal Analysis

Analyzing fraud patterns over time.

In [None]:
# Create time features
df['hour'] = df['step'] % 24
df['day'] = df['step'] // 24

# Fraud rate by hour
fraud_by_hour = df.groupby('hour')['isFraud'].agg(['sum', 'count', 'mean'])
fraud_by_hour.columns = ['fraud_count', 'total_count', 'fraud_rate']
fraud_by_hour['fraud_rate'] *= 100

print("Fraud Rate by Hour of Day:")
print("="*70)
print(fraud_by_hour)

In [None]:
# Visualize temporal patterns
fig, axes = plt.subplots(2, 1, figsize=(16, 10))

# Fraud count by hour
axes[0].bar(fraud_by_hour.index, fraud_by_hour['fraud_count'], color='#e74c3c', alpha=0.7)
axes[0].set_title('Fraud Count by Hour of Day', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Hour of Day', fontsize=12)
axes[0].set_ylabel('Fraud Count', fontsize=12)
axes[0].grid(axis='y', alpha=0.3)

# Fraud rate by hour
axes[1].plot(fraud_by_hour.index, fraud_by_hour['fraud_rate'], 
             marker='o', linewidth=2, markersize=8, color='#e74c3c')
axes[1].set_title('Fraud Rate by Hour of Day', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Hour of Day', fontsize=12)
axes[1].set_ylabel('Fraud Rate (%)', fontsize=12)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Data Quality Checks

In [None]:
# Missing values
print("Missing Values:")
print("="*70)
missing = df.isnull().sum()
if missing.any():
    print(missing[missing > 0])
else:
    print("‚úì No missing values found!")

In [None]:
# Negative balances (data errors)
print("\nNegative Balance Check:")
print("="*70)
neg_old_orig = (df['oldbalanceOrg'] < 0).sum()
neg_new_orig = (df['newbalanceOrig'] < 0).sum()
neg_old_dest = (df['oldbalanceDest'] < 0).sum()
neg_new_dest = (df['newbalanceDest'] < 0).sum()

print(f"Negative oldbalanceOrg:  {neg_old_orig:,}")
print(f"Negative newbalanceOrig: {neg_new_orig:,}")
print(f"Negative oldbalanceDest: {neg_old_dest:,}")
print(f"Negative newbalanceDest: {neg_new_dest:,}")

if any([neg_old_orig, neg_new_orig, neg_old_dest, neg_new_dest]):
    print("\n‚ö†Ô∏è  Negative balances found - these will be removed during cleaning.")
else:
    print("\n‚úì No negative balances found!")

In [None]:
# Duplicate transactions
print("\nDuplicate Check:")
print("="*70)
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates:,}")
if duplicates > 0:
    print("‚ö†Ô∏è  Duplicates found - review needed.")
else:
    print("‚úì No duplicates found!")

## 8. Correlation Analysis

In [None]:
# Select numeric columns for correlation
numeric_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 
                'oldbalanceDest', 'newbalanceDest', 'isFraud', 
                'errorBalanceOrig', 'errorBalanceDest']

# Calculate correlation matrix
corr_matrix = df[numeric_cols].corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Show correlations with fraud
print("\nCorrelation with Fraud:")
print("="*70)
fraud_corr = corr_matrix['isFraud'].sort_values(ascending=False)
print(fraud_corr)

## 9. Key Findings Summary

### Class Imbalance
- The dataset is **highly imbalanced** (~0.13% fraud rate)
- This makes **accuracy a misleading metric**
- We must use **Precision@K, Recall@K, and PR-AUC**

### Transaction Types
- **Only TRANSFER and CASH_OUT can be fraudulent**
- This is a critical feature for classification

### Balance Inconsistencies
- **Balance errors are much more common in fraud**
- This will be a powerful predictive feature

### Amount Patterns
- Fraudulent transactions show different amount distributions
- Log transformation may help normalize the data

### Data Quality
- Some negative balances need to be removed
- No missing values
- Dataset is generally clean

### Next Steps
1. Clean data (remove negative balances)
2. Engineer features (balance errors, amount ratios, etc.)
3. Use time-based split (no random splitting)
4. Handle class imbalance with weights/SMOTE
5. Evaluate with Precision@K and PR-AUC

In [None]:
print("\n" + "="*70)
print("EDA COMPLETE!")
print("="*70)
print("\nNext notebook: 02_feature_engineering.ipynb")