# Fraud Detection - Exploratory Data Analysis

This notebook explores the credit card transaction dataset to understand:
- Data structure and quality
- Class imbalance (fraud vs legitimate)
- Feature distributions and patterns
- Temporal and geographic trends

**Dataset:** Kaggle Credit Card Fraud Detection (simulated)

## 1. Setup & Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')

# Plot settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Paths
DATA_DIR = Path('../data/interim')
FIGURES_DIR = Path('../outputs/figures')
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

In [None]:
# Load merged dataset
df = pd.read_csv(DATA_DIR / 'transactions_merged.csv')
print(f"Dataset shape: {df.shape}")
print(f"Rows: {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")

## 2. Initial Exploration

In [None]:
# Column names and types
df.info()

In [None]:
# First few rows
df.head()

In [None]:
# Last few rows
df.tail()

In [None]:
# Statistical summary
df.describe()

In [None]:
# Column list with dtypes
pd.DataFrame({
    'dtype': df.dtypes,
    'non_null': df.count(),
    'null_count': df.isnull().sum(),
    'unique': df.nunique()
})

## 3. Data Quality

In [None]:
# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100

missing_df = pd.DataFrame({
    'missing_count': missing,
    'missing_pct': missing_pct
}).sort_values('missing_count', ascending=False)

print("Missing Values:")
print(missing_df[missing_df['missing_count'] > 0])

if missing.sum() == 0:
    print("No missing values found!")

In [None]:
# Duplicate rows
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates}")

# Duplicate transaction IDs
dup_trans = df['trans_num'].duplicated().sum()
print(f"Duplicate transaction IDs: {dup_trans}")

## 4. Target Analysis (Class Imbalance)

In [None]:
# Fraud distribution
fraud_counts = df['is_fraud'].value_counts()
fraud_pct = df['is_fraud'].value_counts(normalize=True) * 100

print("Class Distribution:")
print(f"  Legitimate (0): {fraud_counts[0]:,} ({fraud_pct[0]:.2f}%)")
print(f"  Fraudulent (1): {fraud_counts[1]:,} ({fraud_pct[1]:.2f}%)")
print(f"\nImbalance Ratio: {fraud_counts[0]/fraud_counts[1]:.0f}:1")

In [None]:
# Visualize class imbalance
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Bar plot
colors = ['#2ecc71', '#e74c3c']
bars = axes[0].bar(['Legitimate', 'Fraud'], fraud_counts.values, color=colors)
axes[0].set_title('Transaction Class Distribution', fontsize=14)
axes[0].set_ylabel('Count')
for bar, count in zip(bars, fraud_counts.values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height(), 
                 f'{count:,}', ha='center', va='bottom', fontsize=11)

# Pie chart
axes[1].pie(fraud_counts.values, labels=['Legitimate', 'Fraud'],
            autopct='%1.2f%%', colors=colors, explode=[0, 0.1],
            shadow=True, startangle=90)
axes[1].set_title('Fraud Percentage', fontsize=14)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'class_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

**Key Insight:** Severe class imbalance (~99.5% legitimate, ~0.5% fraud)
- Accuracy is NOT a suitable metric
- Need: SMOTE, class weights, PR-AUC metric

## 5. Numerical Features

In [None]:
# Transaction amount distribution
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Overall distribution
axes[0, 0].hist(df['amt'], bins=50, color='steelblue', edgecolor='white')
axes[0, 0].set_title('Transaction Amount Distribution')
axes[0, 0].set_xlabel('Amount ($)')
axes[0, 0].set_ylabel('Count')

# Log scale
axes[0, 1].hist(np.log1p(df['amt']), bins=50, color='steelblue', edgecolor='white')
axes[0, 1].set_title('Log Transaction Amount')
axes[0, 1].set_xlabel('Log(Amount + 1)')
axes[0, 1].set_ylabel('Count')

# By fraud status (boxplot)
df.boxplot(column='amt', by='is_fraud', ax=axes[1, 0])
axes[1, 0].set_title('Amount by Fraud Status')
axes[1, 0].set_xlabel('Is Fraud')
axes[1, 0].set_ylabel('Amount ($)')
plt.suptitle('')

# Density by fraud status
for fraud_val, color, label in [(0, '#2ecc71', 'Legitimate'), (1, '#e74c3c', 'Fraud')]:
    subset = df[df['is_fraud'] == fraud_val]['amt']
    axes[1, 1].hist(np.log1p(subset), bins=50, alpha=0.6, color=color, label=label, density=True)
axes[1, 1].set_title('Amount Distribution by Fraud Status')
axes[1, 1].set_xlabel('Log(Amount + 1)')
axes[1, 1].set_ylabel('Density')
axes[1, 1].legend()

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'amount_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Amount statistics by fraud status
df.groupby('is_fraud')['amt'].describe()

## 6. Categorical Features

In [None]:
# Transaction categories
print(f"Number of categories: {df['category'].nunique()}")
print(f"\nCategories:")
print(df['category'].value_counts())

In [None]:
# Fraud rate by category
category_fraud = df.groupby('category').agg(
    total=('is_fraud', 'count'),
    fraud_count=('is_fraud', 'sum'),
    fraud_rate=('is_fraud', 'mean')
).sort_values('fraud_rate', ascending=False)

category_fraud['fraud_rate_pct'] = category_fraud['fraud_rate'] * 100
print(category_fraud)

In [None]:
# Visualize fraud rate by category
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Transaction count by category
cat_counts = df['category'].value_counts()
axes[0].barh(cat_counts.index, cat_counts.values, color='steelblue')
axes[0].set_title('Transactions by Category')
axes[0].set_xlabel('Count')

# Fraud rate by category
fraud_rates = category_fraud['fraud_rate_pct'].sort_values(ascending=True)
colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(fraud_rates)))
axes[1].barh(fraud_rates.index, fraud_rates.values, color=colors)
axes[1].set_title('Fraud Rate by Category (%)')
axes[1].set_xlabel('Fraud Rate (%)')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'category_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Gender distribution
print("Gender Distribution:")
print(df['gender'].value_counts())
print(f"\nFraud rate by gender:")
print(df.groupby('gender')['is_fraud'].mean() * 100)

## 7. Temporal Patterns

In [None]:
# Parse datetime
df['trans_datetime'] = pd.to_datetime(df['trans_date_trans_time'])
df['trans_hour'] = df['trans_datetime'].dt.hour
df['trans_day'] = df['trans_datetime'].dt.day
df['trans_month'] = df['trans_datetime'].dt.month
df['trans_dayofweek'] = df['trans_datetime'].dt.dayofweek
df['trans_year'] = df['trans_datetime'].dt.year

print(f"Date range: {df['trans_datetime'].min()} to {df['trans_datetime'].max()}")

In [None]:
# Hourly patterns
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Transactions by hour
hourly_counts = df.groupby('trans_hour').size()
axes[0, 0].bar(hourly_counts.index, hourly_counts.values, color='steelblue')
axes[0, 0].set_title('Transactions by Hour')
axes[0, 0].set_xlabel('Hour of Day')
axes[0, 0].set_ylabel('Count')

# Fraud rate by hour
hourly_fraud = df.groupby('trans_hour')['is_fraud'].mean() * 100
axes[0, 1].plot(hourly_fraud.index, hourly_fraud.values, 'ro-', linewidth=2, markersize=6)
axes[0, 1].set_title('Fraud Rate by Hour (%)')
axes[0, 1].set_xlabel('Hour of Day')
axes[0, 1].set_ylabel('Fraud Rate (%)')
axes[0, 1].axhline(y=df['is_fraud'].mean()*100, color='gray', linestyle='--', label='Overall')
axes[0, 1].legend()

# Day of week
dow_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
dow_fraud = df.groupby('trans_dayofweek')['is_fraud'].mean() * 100
axes[1, 0].bar(dow_labels, dow_fraud.values, color='coral')
axes[1, 0].set_title('Fraud Rate by Day of Week')
axes[1, 0].set_ylabel('Fraud Rate (%)')

# Monthly trend
monthly = df.groupby(['trans_year', 'trans_month'])['is_fraud'].mean() * 100
monthly = monthly.reset_index()
monthly['period'] = monthly['trans_year'].astype(str) + '-' + monthly['trans_month'].astype(str).str.zfill(2)
axes[1, 1].plot(range(len(monthly)), monthly['is_fraud'].values, 'b-o', markersize=4)
axes[1, 1].set_title('Monthly Fraud Rate Trend')
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Fraud Rate (%)')
axes[1, 1].set_xticks(range(0, len(monthly), 3))
axes[1, 1].set_xticklabels(monthly['period'].iloc[::3], rotation=45)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'temporal_patterns.png', dpi=150, bbox_inches='tight')
plt.show()

**Key Insight:** Fraud rates vary by time
- Higher fraud rates during late night/early morning hours
- Time-based features will be useful

## 8. Geographic Analysis

In [None]:
# Calculate distance between customer and merchant
df['distance'] = np.sqrt(
    (df['lat'] - df['merch_lat'])**2 + 
    (df['long'] - df['merch_long'])**2
)

print("Distance statistics:")
print(df['distance'].describe())

In [None]:
# Distance by fraud status
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Boxplot
df.boxplot(column='distance', by='is_fraud', ax=axes[0])
axes[0].set_title('Customer-Merchant Distance by Fraud Status')
axes[0].set_xlabel('Is Fraud')
axes[0].set_ylabel('Distance')
plt.suptitle('')

# Density
for fraud_val, color, label in [(0, '#2ecc71', 'Legitimate'), (1, '#e74c3c', 'Fraud')]:
    subset = df[df['is_fraud'] == fraud_val]['distance']
    axes[1].hist(subset, bins=50, alpha=0.6, color=color, label=label, density=True)
axes[1].set_title('Distance Distribution by Fraud Status')
axes[1].set_xlabel('Distance')
axes[1].set_ylabel('Density')
axes[1].legend()

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'geographic_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Distance stats by fraud
print("Distance by fraud status:")
print(df.groupby('is_fraud')['distance'].describe())

## 9. Correlation Analysis

In [None]:
# Numerical columns for correlation
numerical_cols = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 
                  'trans_hour', 'trans_dayofweek', 'distance', 'is_fraud']

corr_matrix = df[numerical_cols].corr()

# Heatmap
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Matrix', fontsize=14)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'correlation_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Correlation with target
target_corr = df[numerical_cols].corr()['is_fraud'].drop('is_fraud').sort_values(key=abs, ascending=False)
print("Correlation with Fraud:")
print(target_corr)

## 10. Key Insights Summary

### Dataset Overview
- **1.85M transactions** from 2019-2020
- **23 features** including transaction, customer, and merchant info
- **No missing values**

### Class Imbalance
- **~99.5% legitimate, ~0.5% fraud** (ratio ~190:1)
- Need specialized handling: SMOTE, class weights
- Use PR-AUC, not accuracy

### Feature Insights
1. **Amount:** Fraudulent transactions show different amount patterns
2. **Category:** Some categories have higher fraud rates (shopping_net, misc_net)
3. **Time:** Higher fraud rates at night (00:00-06:00)
4. **Distance:** Customer-merchant distance may indicate fraud

### Feature Engineering Ideas
- Time features: hour, day, weekend, is_night
- Amount: log transform
- Geographic: customer-merchant distance
- Category encoding: target encoding for fraud rate

In [None]:
# Save processed dataframe with new features for later use
print(f"Final dataframe shape: {df.shape}")
print(f"\nNew columns added: trans_datetime, trans_hour, trans_day, trans_month, trans_dayofweek, trans_year, distance")