# Credit Card Fraud Detection - Exploratory Data Analysis

This notebook provides an interactive exploration of the credit card fraud detection dataset and demonstrates the model training and evaluation process.

## Dataset Information

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.
The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

### Features:
- **Time**: Number of seconds elapsed between this transaction and the first transaction in the dataset
- **V1-V28**: Principal components obtained with PCA (anonymized features)
- **Amount**: Transaction amount
- **Class**: Response variable (1 for fraud, 0 for legitimate)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load and Explore the Dataset

In [None]:
# Load the dataset
df = pd.read_csv('creditcard.csv')

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
# Dataset information
df.info()

In [None]:
# Statistical summary
df.describe()

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

## 2. Class Distribution Analysis

In [None]:
# Class distribution
class_counts = df['Class'].value_counts()
print("Class Distribution:")
print(class_counts)
print(f"\nPercentage of Fraudulent Transactions: {(class_counts[1] / len(df)) * 100:.4f}%")

# Visualize class distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
class_counts.plot(kind='bar', ax=ax1, color=['green', 'red'], alpha=0.7)
ax1.set_title('Class Distribution', fontsize=14, fontweight='bold')
ax1.set_xlabel('Class (0: Normal, 1: Fraud)')
ax1.set_ylabel('Count')
ax1.set_xticklabels(['Normal', 'Fraud'], rotation=0)

# Pie chart
ax2.pie(class_counts, labels=['Normal', 'Fraud'], autopct='%1.4f%%', 
        colors=['green', 'red'], startangle=90)
ax2.set_title('Class Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 3. Feature Analysis

In [None]:
# Time distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Normal transactions
axes[0].hist(df[df['Class'] == 0]['Time'], bins=50, alpha=0.7, color='green')
axes[0].set_title('Normal Transactions - Time Distribution')
axes[0].set_xlabel('Time (seconds)')
axes[0].set_ylabel('Frequency')

# Fraudulent transactions
axes[1].hist(df[df['Class'] == 1]['Time'], bins=50, alpha=0.7, color='red')
axes[1].set_title('Fraudulent Transactions - Time Distribution')
axes[1].set_xlabel('Time (seconds)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Amount distribution
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Normal transactions
axes[0, 0].hist(df[df['Class'] == 0]['Amount'], bins=50, alpha=0.7, color='green')
axes[0, 0].set_title('Normal Transactions - Amount Distribution')
axes[0, 0].set_xlabel('Amount')
axes[0, 0].set_ylabel('Frequency')

# Fraudulent transactions
axes[0, 1].hist(df[df['Class'] == 1]['Amount'], bins=50, alpha=0.7, color='red')
axes[0, 1].set_title('Fraudulent Transactions - Amount Distribution')
axes[0, 1].set_xlabel('Amount')
axes[0, 1].set_ylabel('Frequency')

# Box plots
df.boxplot(column='Amount', by='Class', ax=axes[1, 0])
axes[1, 0].set_title('Amount by Class')
axes[1, 0].set_xlabel('Class (0: Normal, 1: Fraud)')
axes[1, 0].set_ylabel('Amount')

# Statistics
stats_text = f"""Normal Transactions Amount:
Mean: ${df[df['Class'] == 0]['Amount'].mean():.2f}
Median: ${df[df['Class'] == 0]['Amount'].median():.2f}
Std: ${df[df['Class'] == 0]['Amount'].std():.2f}

Fraudulent Transactions Amount:
Mean: ${df[df['Class'] == 1]['Amount'].mean():.2f}
Median: ${df[df['Class'] == 1]['Amount'].median():.2f}
Std: ${df[df['Class'] == 1]['Amount'].std():.2f}"""

axes[1, 1].text(0.1, 0.5, stats_text, fontsize=12, verticalalignment='center')
axes[1, 1].axis('off')

plt.tight_layout()
plt.show()

## 4. Correlation Analysis

In [None]:
# Correlation with target variable
correlations = df.corr()['Class'].sort_values(ascending=False)
print("Top 10 features correlated with Class:")
print(correlations.head(11))  # 11 to exclude Class itself

print("\nBottom 10 features correlated with Class:")
print(correlations.tail(10))

In [None]:
# Visualize correlation with target
plt.figure(figsize=(10, 12))
correlations[1:].plot(kind='barh')
plt.title('Feature Correlation with Fraud (Class)', fontsize=14, fontweight='bold')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Features')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

## 5. Train Models Using the fraud_detection Module

In [None]:
# Import the fraud detection module
from fraud_detection import FraudDetectionModel

# Create and run the model
fraud_detector = FraudDetectionModel(data_path='creditcard.csv')
fraud_detector.run_full_pipeline(use_smote=True)

## 6. Make Predictions on New Data

In [None]:
# Load the prediction module
from predict import FraudPredictor

# Initialize predictor
predictor = FraudPredictor()
predictor.load_model()

# Make a sample prediction
sample_transaction = df.drop('Class', axis=1).iloc[0].values
result = predictor.predict_single(sample_transaction)
predictor.display_prediction(result)

## 7. Batch Predictions

In [None]:
# Make predictions on test set
test_sample = df.sample(1000, random_state=42)
predictions = predictor.predict_batch(test_sample)

# Display high-risk transactions
high_risk = predictions[predictions['Risk_Level'].isin(['High', 'Critical'])]
print(f"\nHigh-risk transactions found: {len(high_risk)}")
print("\nSample of high-risk transactions:")
high_risk[['Time', 'Amount', 'Class', 'Predicted_Class', 'Fraud_Probability', 'Risk_Level']].head(10)

## Conclusion

This notebook demonstrated:
1. Loading and exploring the credit card fraud detection dataset
2. Analyzing class distribution and feature characteristics
3. Training machine learning models for fraud detection
4. Making predictions on new transactions
5. Evaluating model performance

The trained models can now be used to detect fraudulent transactions in real-time!