‚öñÔ∏è Task 1 - Class Imbalance Handling
# ## Balancing Fraud Detection Data for Better Model Performance
# 
# **Objective**: Address extreme class imbalance (99:1) using advanced techniques.
# 
# **Key Challenges**:
# 1. Only 1% of transactions are fraud
# 2. Models biased toward majority class
# 3. Need to balance detection vs false positives
# 
# **Techniques**:
# 1. SMOTE (Synthetic Minority Oversampling)
# 2. ADASYN (Adaptive Synthetic Sampling)
# 3. Class weighting
# 4. Ensemble methods

In [8]:
# Basic imports that should work
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve
import warnings
warnings.filterwarnings('ignore')

# Custom styling
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (16, 10)

print("‚úÖ Basic libraries imported successfully")

‚úÖ Basic libraries imported successfully


In [2]:
import os
from pathlib import Path
from datetime import datetime

# Define paths
base_path = Path("D:/10 acadamy/fraud-detection-ml-system")
data_dir = base_path / "data/processed"

# Load the most recent cleaned data files
fraud_file = data_dir / "fraud_data_cleaned_20251221_110457.csv"  # Most recent
credit_file = data_dir / "creditcard_cleaned_20251221_110457.csv"  # Most recent
ip_file = data_dir / "ip_country_mapping_20251221_110457.csv"  # Most recent

# Output directories
output_dir = base_path / "outputs/data_analysis_processing"
reports_dir = output_dir / "reports"
visualizations_dir = output_dir / "visualizations"
processed_data_dir = output_dir / "processed_data"
balanced_data_dir = output_dir / "balanced_data"

# Create directories
for directory in [output_dir, reports_dir, visualizations_dir, processed_data_dir, balanced_data_dir]:
    directory.mkdir(parents=True, exist_ok=True)

print("üìÅ Data files:")
print(f"Fraud data: {fraud_file}")
print(f"Credit data: {credit_file}")
print(f"IP data: {ip_file}")
print(f"Output directory: {output_dir}")

üìÅ Data files:
Fraud data: D:\10 acadamy\fraud-detection-ml-system\data\processed\fraud_data_cleaned_20251221_110457.csv
Credit data: D:\10 acadamy\fraud-detection-ml-system\data\processed\creditcard_cleaned_20251221_110457.csv
IP data: D:\10 acadamy\fraud-detection-ml-system\data\processed\ip_country_mapping_20251221_110457.csv
Output directory: D:\10 acadamy\fraud-detection-ml-system\outputs\data_analysis_processing


In [3]:
# Load data
print("="*80)
print("üì• LOADING AND ANALYZING DATA")
print("="*80)

# Load fraud data
fraud_df = pd.read_csv(fraud_file)
print(f"‚úÖ Fraud data loaded: {fraud_df.shape[0]:,} rows √ó {fraud_df.shape[1]} columns")

# Find fraud indicator column
fraud_col = None
for col in ['class', 'is_fraud', 'fraud', 'Class', 'isFraud']:
    if col in fraud_df.columns:
        fraud_col = col
        print(f"üîç Found fraud indicator column: '{fraud_col}'")
        break

if fraud_col is None:
    print("‚ö†Ô∏è No fraud indicator column found in fraud data")

# Load credit card data
credit_df = pd.read_csv(credit_file)
print(f"‚úÖ Credit card data loaded: {credit_df.shape[0]:,} rows √ó {credit_df.shape[1]} columns")

# Find fraud indicator column for credit data
credit_fraud_col = None
for col in ['Class', 'class', 'is_fraud', 'fraud', 'isFraud']:
    if col in credit_df.columns:
        credit_fraud_col = col
        print(f"üîç Found fraud indicator column: '{credit_fraud_col}'")
        break

# Display column information
print(f"\nüìã Fraud data columns ({len(fraud_df.columns)}):")
for i, col in enumerate(fraud_df.columns[:15], 1):
    print(f"  {i:2}. {col} ({fraud_df[col].dtype})")
if len(fraud_df.columns) > 15:
    print(f"  ... and {len(fraud_df.columns) - 15} more")

print(f"\nüìã Credit data columns ({len(credit_df.columns)}):")
for i, col in enumerate(credit_df.columns[:15], 1):
    print(f"  {i:2}. {col} ({credit_df[col].dtype})")
if len(credit_df.columns) > 15:
    print(f"  ... and {len(credit_df.columns) - 15} more")

üì• LOADING AND ANALYZING DATA
‚úÖ Fraud data loaded: 151,112 rows √ó 12 columns
üîç Found fraud indicator column: 'class'
‚úÖ Credit card data loaded: 283,726 rows √ó 31 columns
üîç Found fraud indicator column: 'Class'

üìã Fraud data columns (12):
   1. user_id (int64)
   2. signup_time (object)
   3. purchase_time (object)
   4. purchase_value (int64)
   5. device_id (object)
   6. source (object)
   7. browser (object)
   8. sex (object)
   9. age (int64)
  10. ip_address (float64)
  11. class (int64)
  12. country (object)

üìã Credit data columns (31):
   1. Time (float64)
   2. V1 (float64)
   3. V2 (float64)
   4. V3 (float64)
   5. V4 (float64)
   6. V5 (float64)
   7. V6 (float64)
   8. V7 (float64)
   9. V8 (float64)
  10. V9 (float64)
  11. V10 (float64)
  12. V11 (float64)
  13. V12 (float64)
  14. V13 (float64)
  15. V14 (float64)
  ... and 16 more


In [4]:
print("="*80)
print("üìä DETAILED CLASS DISTRIBUTION ANALYSIS")
print("="*80)

# Fraud data statistics
if fraud_col:
    fraud_counts = fraud_df[fraud_col].value_counts().sort_index()
    total_fraud = len(fraud_df)
    fraud_cases = fraud_counts.get(1, 0)
    legit_cases = fraud_counts.get(0, total_fraud - fraud_cases)
    
    fraud_percentage = (fraud_cases / total_fraud) * 100
    imbalance_ratio = legit_cases / fraud_cases if fraud_cases > 0 else float('inf')
    
    print(f"\nüîç FRAUD DATA (E-commerce Transactions):")
    print(f"  ‚Ä¢ Total transactions: {total_fraud:,}")
    print(f"  ‚Ä¢ Legitimate cases: {legit_cases:,} ({100 - fraud_percentage:.2f}%)")
    print(f"  ‚Ä¢ Fraud cases: {fraud_cases:,} ({fraud_percentage:.2f}%)")
    print(f"  ‚Ä¢ Imbalance ratio: {imbalance_ratio:.1f}:1")
    print(f"  ‚Ä¢ For every fraud case, there are {imbalance_ratio:.0f} legitimate transactions")
    
    # Additional statistics
    print(f"\nüìà FRAUD DATA STATISTICS:")
    print(f"  ‚Ä¢ Data types distribution:")
    for dtype in fraud_df.dtypes.unique():
        cols = [col for col in fraud_df.columns if fraud_df[col].dtype == dtype]
        print(f"    - {dtype}: {len(cols)} columns")
    
    # Missing values
    missing = fraud_df.isnull().sum()
    if missing.sum() > 0:
        print(f"\n‚ö†Ô∏è  Missing values found:")
        for col in missing[missing > 0].index[:5]:
            missing_pct = (missing[col] / total_fraud) * 100
            print(f"    ‚Ä¢ {col}: {missing[col]:,} ({missing_pct:.2f}%)")
    else:
        print(f"\n‚úÖ No missing values in fraud data")

# Credit card data statistics
if credit_fraud_col:
    credit_counts = credit_df[credit_fraud_col].value_counts().sort_index()
    total_credit = len(credit_df)
    credit_fraud_cases = credit_counts.get(1, 0)
    credit_legit_cases = credit_counts.get(0, total_credit - credit_fraud_cases)
    
    credit_fraud_percentage = (credit_fraud_cases / total_credit) * 100
    credit_imbalance_ratio = credit_legit_cases / credit_fraud_cases if credit_fraud_cases > 0 else float('inf')
    
    print(f"\nüí≥ CREDIT CARD DATA:")
    print(f"  ‚Ä¢ Total transactions: {total_credit:,}")
    print(f"  ‚Ä¢ Legitimate cases: {credit_legit_cases:,} ({100 - credit_fraud_percentage:.4f}%)")
    print(f"  ‚Ä¢ Fraud cases: {credit_fraud_cases:,} ({credit_fraud_percentage:.4f}%)")
    print(f"  ‚Ä¢ Imbalance ratio: {credit_imbalance_ratio:.1f}:1")
    print(f"  ‚Ä¢ For every fraud case, there are {credit_imbalance_ratio:.0f} legitimate transactions")

üìä DETAILED CLASS DISTRIBUTION ANALYSIS

üîç FRAUD DATA (E-commerce Transactions):
  ‚Ä¢ Total transactions: 151,112
  ‚Ä¢ Legitimate cases: 136,961 (90.64%)
  ‚Ä¢ Fraud cases: 14,151 (9.36%)
  ‚Ä¢ Imbalance ratio: 9.7:1
  ‚Ä¢ For every fraud case, there are 10 legitimate transactions

üìà FRAUD DATA STATISTICS:
  ‚Ä¢ Data types distribution:
    - int64: 4 columns
    - object: 7 columns
    - float64: 1 columns

‚úÖ No missing values in fraud data

üí≥ CREDIT CARD DATA:
  ‚Ä¢ Total transactions: 283,726
  ‚Ä¢ Legitimate cases: 283,253 (99.8333%)
  ‚Ä¢ Fraud cases: 473 (0.1667%)
  ‚Ä¢ Imbalance ratio: 598.8:1
  ‚Ä¢ For every fraud case, there are 599 legitimate transactions


In [5]:
print("="*80)
print("üìà VISUALIZING EXTREME CLASS IMBALANCE")
print("="*80)

# Create comprehensive imbalance visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('E-commerce Fraud: Class Distribution',
                    'Credit Card Fraud: Class Distribution',
                    'Imbalance Ratio Comparison',
                    'Business Impact Analysis'),
    specs=[[{'type': 'pie'}, {'type': 'pie'}],
           [{'type': 'bar'}, {'type': 'bar'}]],
    vertical_spacing=0.15,
    horizontal_spacing=0.15
)

# 1. Fraud data pie chart
if fraud_col:
    fraud_counts = fraud_df[fraud_col].value_counts()
    fig.add_trace(
        go.Pie(labels=['Legitimate', 'Fraud'], 
               values=fraud_counts.values,
               hole=0.5,
               marker_colors=['#2ECC71', '#E74C3C'],
               textinfo='percent+label+value',
               textposition='inside',
               name='E-commerce',
               hovertemplate='<b>%{label}</b><br>Count: %{value:,}<br>Percentage: %{percent:.2%}<extra></extra>'),
        row=1, col=1
    )
    
    # Add annotation for imbalance ratio
    fig.add_annotation(
        x=0.12, y=0.5,
        text=f"Ratio: {imbalance_ratio:.0f}:1",
        showarrow=False,
        font=dict(size=16, color="red", family="Arial Black"),
        xref="paper",
        yref="paper"
    )

# 2. Credit card data pie chart
if credit_fraud_col:
    credit_counts = credit_df[credit_fraud_col].value_counts()
    fig.add_trace(
        go.Pie(labels=['Legitimate', 'Fraud'], 
               values=credit_counts.values,
               hole=0.5,
               marker_colors=['#2ECC71', '#E74C3C'],
               textinfo='percent+label+value',
               textposition='inside',
               name='Credit Card',
               hovertemplate='<b>%{label}</b><br>Count: %{value:,}<br>Percentage: %{percent:.4%}<extra></extra>'),
        row=1, col=2
    )
    
    # Add annotation for imbalance ratio
    fig.add_annotation(
        x=0.88, y=0.5,
        text=f"Ratio: {credit_imbalance_ratio:.0f}:1",
        showarrow=False,
        font=dict(size=16, color="red", family="Arial Black"),
        xref="paper",
        yref="paper"
    )

# 3. Imbalance ratio comparison bar chart
if fraud_col and credit_fraud_col:
    datasets = ['E-commerce', 'Credit Card']
    ratios = [imbalance_ratio, credit_imbalance_ratio]
    
    fig.add_trace(
        go.Bar(x=datasets,
               y=ratios,
               marker_color=['#3498DB', '#9B59B6'],
               text=[f"{r:.0f}:1" for r in ratios],
               textposition='auto',
               hovertemplate='<b>%{x}</b><br>Imbalance Ratio: %{text}<extra></extra>'),
        row=2, col=1
    )
    
    fig.update_yaxes(title_text="Imbalance Ratio (Legitimate: Fraud)", row=2, col=1)

# 4. Business impact analysis
business_costs = {
    'Fraud Loss': 100,          # Direct financial loss
    'False Positive Cost': 25,  # Customer service, manual review
    'False Negative Cost': 200, # Fraud + reputational damage
    'Customer Churn': 150       # Lost future revenue
}

fig.add_trace(
    go.Bar(x=list(business_costs.keys()),
           y=list(business_costs.values()),
           marker_color=['#E74C3C', '#F39C12', '#8E44AD', '#16A085'],
           text=[f"${v}" for v in business_costs.values()],
           textposition='auto',
           hovertemplate='<b>%{x}</b><br>Cost: %{text}<extra></extra>'),
    row=2, col=2
)

fig.update_layout(
    height=800,
    title_text="‚öñÔ∏è EXTREME CLASS IMBALANCE ANALYSIS FOR FRAUD DETECTION",
    title_font=dict(size=22, family="Arial Black"),
    showlegend=False,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)'
)

fig.update_xaxes(tickangle=45, row=2, col=2)

# Save visualization
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
viz_path = visualizations_dir / f"class_imbalance_analysis_{timestamp}.html"
fig.write_html(str(viz_path))
print(f"üíæ Visualization saved: {viz_path}")

fig.show()

üìà VISUALIZING EXTREME CLASS IMBALANCE
üíæ Visualization saved: D:\10 acadamy\fraud-detection-ml-system\outputs\data_analysis_processing\visualizations\class_imbalance_analysis_20251221_111643.html


In [6]:
print("="*80)
print("‚ö†Ô∏è THE DANGER OF IMBALANCED DATA - MODEL SIMULATION")
print("="*80)

if fraud_col:
    print(f"\nü§ñ NAIVE MODEL PERFORMANCE SIMULATION (E-commerce Data):")
    print("-"*50)
    
    # Simulation 1: Always predict legitimate
    print(f"\n1Ô∏è‚É£ Model that ALWAYS predicts 'Legitimate':")
    accuracy = (total_fraud - fraud_cases) / total_fraud * 100
    print(f"   üìä Accuracy: {accuracy:.2f}%")
    print(f"   üéØ Fraud Detection Rate: 0.00%")
    print(f"   üö´ False Positive Rate: 0.00%")
    print(f"   üí∞ Business Impact: 100% of fraud cases missed!")
    print(f"   ‚ö†Ô∏è  Financial Loss: ${fraud_cases * 100:,} (assuming $100 avg fraud)")
    
    # Simulation 2: Always predict fraud
    print(f"\n2Ô∏è‚É£ Model that ALWAYS predicts 'Fraud':")
    accuracy = fraud_cases / total_fraud * 100
    print(f"   üìä Accuracy: {accuracy:.2f}%")
    print(f"   üéØ Fraud Detection Rate: 100.00%")
    print(f"   üö´ False Positive Rate: 100.00%")
    print(f"   üí∞ Business Impact: All {legit_cases:,} legitimate transactions blocked!")
    print(f"   ‚ö†Ô∏è  Customer Churn Cost: ${legit_cases * 150:,}")
    
    # Simulation 3: Random prediction (50/50)
    print(f"\n3Ô∏è‚É£ Model that predicts RANDOMLY (50/50):")
    expected_fraud_detected = fraud_cases * 0.5
    expected_fp = legit_cases * 0.5
    accuracy = ((fraud_cases * 0.5) + (legit_cases * 0.5)) / total_fraud * 100
    print(f"   üìä Expected Accuracy: ~50.00%")
    print(f"   üéØ Expected Fraud Detection: {expected_fraud_detected:,.0f} cases")
    print(f"   üö´ Expected False Positives: {expected_fp:,.0f} cases")
    
    print(f"\nüéØ THE CHALLENGE:")
    print("   Need to balance multiple objectives:")
    print("   ‚Ä¢ ‚úÖ Catch as many fraud cases as possible")
    print("   ‚Ä¢ ‚úÖ Minimize false positives (don't block legitimate customers)")
    print("   ‚Ä¢ ‚úÖ Optimize business costs (fraud loss vs customer churn)")
    print("   ‚Ä¢ ‚úÖ Maintain customer experience")

‚ö†Ô∏è THE DANGER OF IMBALANCED DATA - MODEL SIMULATION

ü§ñ NAIVE MODEL PERFORMANCE SIMULATION (E-commerce Data):
--------------------------------------------------

1Ô∏è‚É£ Model that ALWAYS predicts 'Legitimate':
   üìä Accuracy: 90.64%
   üéØ Fraud Detection Rate: 0.00%
   üö´ False Positive Rate: 0.00%
   üí∞ Business Impact: 100% of fraud cases missed!
   ‚ö†Ô∏è  Financial Loss: $1,415,100 (assuming $100 avg fraud)

2Ô∏è‚É£ Model that ALWAYS predicts 'Fraud':
   üìä Accuracy: 9.36%
   üéØ Fraud Detection Rate: 100.00%
   üö´ False Positive Rate: 100.00%
   üí∞ Business Impact: All 136,961 legitimate transactions blocked!
   ‚ö†Ô∏è  Customer Churn Cost: $20,544,150

3Ô∏è‚É£ Model that predicts RANDOMLY (50/50):
   üìä Expected Accuracy: ~50.00%
   üéØ Expected Fraud Detection: 7,076 cases
   üö´ Expected False Positives: 68,480 cases

üéØ THE CHALLENGE:
   Need to balance multiple objectives:
   ‚Ä¢ ‚úÖ Catch as many fraud cases as possible
   ‚Ä¢ ‚úÖ Minimize false p