# Exploratory Data Analysis (EDA) for Fraud Detection Project

This notebook performs exploratory data analysis on the `Fraud_Data.csv` and `creditcard.csv` datasets to understand their structure, identify patterns, and highlight class imbalance. The goal is to generate insights and visualizations for the fraud detection project, fulfilling the requirements for Task 1 (Data Analysis and Preprocessing).

## Objectives
- Analyze class distribution to understand the imbalance.
- Perform univariate analysis (distributions of key features: `purchase_value`, `age`, `Amount`).
- Perform bivariate analysis (relationships between features and target `class`/`Class`).
- Analyze categorical features (`source`, `browser`, `sex`) in `Fraud_Data`.
- Generate a correlation heatmap for the `creditcard` dataset.
- Save visualizations to the `plots/` directory for inclusion in the project report.

In [1]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# Create plots directory if it doesn't exist
output_dir = '../plots'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Set seaborn style for better visuals
sns.set(style='whitegrid')

# Load datasets
fraud_df = pd.read_csv('../data/Fraud_Data.csv')
creditcard_df = pd.read_csv('../data/creditcard.csv')

# Display basic info about the datasets
print('Fraud_Data Info:')
print(fraud_df.info())
print('\nCreditcard Info:')
print(creditcard_df.info())

Fraud_Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         151112 non-null  int64  
 1   signup_time     151112 non-null  object 
 2   purchase_time   151112 non-null  object 
 3   purchase_value  151112 non-null  int64  
 4   device_id       151112 non-null  object 
 5   source          151112 non-null  object 
 6   browser         151112 non-null  object 
 7   sex             151112 non-null  object 
 8   age             151112 non-null  int64  
 9   ip_address      151112 non-null  float64
 10  class           151112 non-null  int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 12.7+ MB
None

Creditcard Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 

## Class Distribution Analysis

Both datasets have imbalanced classes (few fraudulent transactions). We visualize the distribution of the target variables (`class` for `Fraud_Data`, `Class` for `creditcard`) to quantify the imbalance, which will inform the use of techniques like SMOTE and metrics like AUC-PR and F1-Score.

In [3]:
# Class distribution for Fraud_Data
plt.figure(figsize=(8, 6))
sns.countplot(x='class', data=fraud_df)
plt.title('Class Distribution in Fraud_Data (0: Non-Fraud, 1: Fraud)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.savefig(os.path.join(output_dir, 'fraud_class_distribution.png'))
plt.close()  # Close the plot to free memory

# Class distribution for creditcard
plt.figure(figsize=(8, 6))
sns.countplot(x='Class', data=creditcard_df)
plt.title('Class Distribution in Creditcard (0: Non-Fraud, 1: Fraud)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.savefig(os.path.join(output_dir, 'creditcard_class_distribution.png'))
plt.close()

# Print class imbalance ratios
fraud_ratio = fraud_df['class'].value_counts(normalize=True)
creditcard_ratio = creditcard_df['Class'].value_counts(normalize=True)
print('Fraud_Data Class Distribution:\n', fraud_ratio)
print('\nCreditcard Class Distribution:\n', creditcard_ratio)

Fraud_Data Class Distribution:
 class
0    0.906354
1    0.093646
Name: proportion, dtype: float64

Creditcard Class Distribution:
 Class
0    0.998273
1    0.001727
Name: proportion, dtype: float64


## Univariate Analysis

We analyze the distributions of key numerical features (`purchase_value` and `age` for `Fraud_Data`, `Amount` for `creditcard`) to understand their ranges and shapes. This helps identify if features are skewed, which may require scaling (e.g., StandardScaler).

In [4]:
# Purchase value distribution (Fraud_Data)
plt.figure(figsize=(8, 6))
sns.histplot(fraud_df['purchase_value'], bins=30, kde=True)
plt.title('Purchase Value Distribution (Fraud_Data)')
plt.xlabel('Purchase Value ($)')
plt.ylabel('Count')
plt.savefig(os.path.join(output_dir, 'purchase_value_distribution.png'))
plt.close()

# Age distribution (Fraud_Data)
plt.figure(figsize=(8, 6))
sns.histplot(fraud_df['age'], bins=30, kde=True)
plt.title('Age Distribution (Fraud_Data)')
plt.xlabel('Age')
plt.ylabel('Count')
plt.savefig(os.path.join(output_dir, 'age_distribution.png'))
plt.close()

# Amount distribution (creditcard)
plt.figure(figsize=(8, 6))
sns.histplot(creditcard_df['Amount'], bins=30, kde=True)
plt.title('Transaction Amount Distribution (Creditcard)')
plt.xlabel('Amount ($)')
plt.ylabel('Count')
plt.savefig(os.path.join(output_dir, 'creditcard_amount_distribution.png'))
plt.close()

## Bivariate Analysis

We explore relationships between numerical features and the target variable to identify patterns associated with fraud. Boxplots are used to compare `purchase_value` and `age` vs. `class` in `Fraud_Data`, and `Amount` vs. `Class` in `creditcard`.

In [5]:
# Purchase value vs. Class (Fraud_Data)
plt.figure(figsize=(8, 6))
sns.boxplot(x='class', y='purchase_value', data=fraud_df)
plt.title('Purchase Value vs. Class (Fraud_Data)')
plt.xlabel('Class (0: Non-Fraud, 1: Fraud)')
plt.ylabel('Purchase Value ($)')
plt.savefig(os.path.join(output_dir, 'purchase_value_vs_class.png'))
plt.close()

# Age vs. Class (Fraud_Data)
plt.figure(figsize=(8, 6))
sns.boxplot(x='class', y='age', data=fraud_df)
plt.title('Age vs. Class (Fraud_Data)')
plt.xlabel('Class (0: Non-Fraud, 1: Fraud)')
plt.ylabel('Age')
plt.savefig(os.path.join(output_dir, 'age_vs_class.png'))
plt.close()

# Amount vs. Class (creditcard)
plt.figure(figsize=(8, 6))
sns.boxplot(x='Class', y='Amount', data=creditcard_df)
plt.title('Transaction Amount vs. Class (Creditcard)')
plt.xlabel('Class (0: Non-Fraud, 1: Fraud)')
plt.ylabel('Amount ($)')
plt.savefig(os.path.join(output_dir, 'creditcard_amount_vs_class.png'))
plt.close()

## Correlation Analysis (Creditcard Dataset)

The `creditcard.csv` dataset contains anonymized PCA features (V1-V28). We use a correlation heatmap to identify relationships between features and the target `Class`, guiding feature selection.

In [6]:
# Correlation heatmap for creditcard
plt.figure(figsize=(12, 8))
corr = creditcard_df.corr()
sns.heatmap(corr, cmap='coolwarm', annot=False, vmin=-1, vmax=1)
plt.title('Correlation Heatmap for Creditcard Dataset')
plt.savefig(os.path.join(output_dir, 'creditcard_correlation_heatmap.png'))
plt.close()

# Print correlations with Class
print('Correlations with Class (Creditcard):')
print(corr['Class'].sort_values(ascending=False))

Correlations with Class (Creditcard):
Class     1.000000
V11       0.154876
V4        0.133447
V2        0.091289
V21       0.040413
V19       0.034783
V20       0.020090
V8        0.019875
V27       0.017580
V28       0.009536
Amount    0.005632
V26       0.004455
V25       0.003308
V22       0.000805
V23      -0.002685
V15      -0.004223
V13      -0.004570
V24      -0.007221
Time     -0.012323
V6       -0.043643
V5       -0.094974
V9       -0.097733
V1       -0.101347
V18      -0.111485
V7       -0.187257
V3       -0.192961
V16      -0.196539
V10      -0.216883
V12      -0.260593
V14      -0.302544
V17      -0.326481
Name: Class, dtype: float64


## Categorical Feature Analysis (Fraud_Data)

We analyze categorical features (`source`, `browser`, `sex`) in `Fraud_Data` to see their distribution and relationship with `class`. This informs feature engineering (e.g., one-hot encoding).

In [7]:
# Source vs. Class (Fraud_Data)
plt.figure(figsize=(8, 6))
sns.countplot(x='source', hue='class', data=fraud_df)
plt.title('Source vs. Class (Fraud_Data)')
plt.xlabel('Source')
plt.ylabel('Count')
plt.savefig(os.path.join(output_dir, 'source_vs_class.png'))
plt.close()

# Browser vs. Class (Fraud_Data)
plt.figure(figsize=(8, 6))
sns.countplot(x='browser', hue='class', data=fraud_df)
plt.title('Browser vs. Class (Fraud_Data)')
plt.xlabel('Browser')
plt.ylabel('Count')
plt.savefig(os.path.join(output_dir, 'browser_vs_class.png'))
plt.close()

# Sex vs. Class (Fraud_Data)
plt.figure(figsize=(8, 6))
sns.countplot(x='sex', hue='class', data=fraud_df)
plt.title('Sex vs. Class (Fraud_Data)')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.savefig(os.path.join(output_dir, 'sex_vs_class.png'))
plt.close()

## Summary of Findings

* Class Imbalance: Significant imbalance with few fraudulent transactions, justifying SMOTE and AUC-PR/F1-Score.
* Feature Distributions: purchase_value and Amount are skewed, suggesting scaling; age may indicate fraud-prone groups.
* Bivariate Insights: Differences in purchase_value or Amount between classes hint at predictive potential.
* Correlations: Some PCA features (V1-V28) in creditcard correlate with Class, aiding feature selection.
* Categorical Features: Patterns in source, browser, sex vs. class suggest feature engineering (e.g., one-hot encoding).
* Class Imbalance: Significant imbalance with few fraudulent transactions, justifying SMOTE and AUC-PR/F1-Score.
* Feature Insights: purchase_value and Amount are skewed (suggesting scaling), while age and PCA features (V1-V28) in creditcard show predictive potential.
* Categorical Patterns: source, browser, sex vs. class suggest feature engineering (e.g., one-hot encoding).