# Exploratory Data Analysis for Fraud Detection System
This notebook performs an initial exploration of the fraud detection dataset. Our goal is to understand the dataset's characteristics, distribution of variables, and identify any patterns or anomalies that could inform subsequent modeling.


In [None]:
import pandas as pd

# Load the dataset
file_path = '../data/dataset.csv'  # Update with your dataset path
data = pd.read_csv(file_path)

# Display the first few rows
data.head()


In [None]:
# Dataset Summary
data.describe()

# Dataset Information
data.info()

# Checking for missing values
data.isnull().sum()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Class Distribution
sns.countplot(data['Class'])
plt.title('Class Distribution (0: Non-Fraud, 1: Fraud)')
plt.show()

# Time Analysis
plt.figure(figsize=(10,6))
plt.title('Distribution of Transaction Time')
sns.histplot(data['Time'], bins=40)
plt.show()

# Amount Analysis
plt.figure(figsize=(10,6))
plt.title('Distribution of Transaction Amount')
sns.histplot(data['Amount'], bins=40)
plt.show()

# Feature Correlation
plt.figure(figsize=(12,10))
sns.heatmap(data.corr(), cmap='coolwarm', annot=False)
plt.title('Feature Correlation Heatmap')
plt.show()


In [None]:
# Histograms for a selection of features (V1, V2, V3, V4)
features = ['V1', 'V2', 'V3', 'V4']
data[features].hist(bins=20, figsize=(15, 6), layout=(2, 2))


## Preliminary Observations
- **Class Imbalance**: The number of fraudulent transactions is much lower than non-fraudulent ones.
- **Time Feature**: There doesn't seem to be a clear trend or pattern in transaction time.
- **Amount Feature**: Most transactions are of lower amounts; however, there are outliers with very high amounts.
- **Feature Correlations**: Most features do not show strong correlation, which is expected given the PCA transformation.


## Conclusion
The exploratory analysis provides valuable insights into the dataset's structure and distributions. The next steps involve further data preprocessing, feature engineering, and model selection based on these insights.
