## 01_eda — Exploratory Data Analysis
This dataset contains anonymized credit card transactions, where the target variable
(Class) indicates whether a transaction is fraudulent (1) or normal (0).

## Dataset Loading and Initial Preview

In this section, we load the dataset and take an initial look at its structure
and sample records.

In [None]:
import pandas as pd
from pathlib import Path
...
import seaborn as sns
import matplotlib.pyplot as plt

PROJECT_ROOT = Path().resolve().parent
DATA_PATH = PROJECT_ROOT / "data" / "creditcard.csv"

# Load the dataset
df = pd.read_csv(DATA_PATH)

# Display the first few rows
df.head()

## Dataset Dimensions

This step checks the number of rows and columns in the dataset.

In [None]:
# Display dataset shape (rows, columns)
df.shape

## Data Types and Missing Values

Here we inspect data types and verify whether the dataset contains any missing values.

In [None]:
# Display column data types and non-null counts
df.info()
# Check for missing values
df.isna().sum()

### Feature Correlation with Target

In this section, we examine the correlation between numerical features and the
target variable to identify features that may carry predictive signal.

In [None]:
# Rank features by correlation strength with the target class (fraud indicator)
corr_with_target = df.corr()['Class'].sort_values(ascending=False)

corr_with_target.head(10)

In [None]:
# Visualize global correlation patterns across all features
corr = df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(
    corr,
    cmap="coolwarm",
    center=0,
    linewidths=0.1
)
plt.title("Feature Correlation Heatmap")
plt.show()

**Note:**
- Since most features are PCA-transformed (V1–V28), individual feature interpretability
  is limited. Correlation analysis and the heatmap are used to observe global patterns
  rather than explain specific feature behavior.
- While individual correlations are relatively low, this is expected due to PCA
  transformation and the complex, non-linear nature of fraud patterns.

## Data Quality Summary

- All features in the dataset are numeric, with no categorical variables requiring encoding.

- No missing values were detected across any columns, indicating a complete and clean dataset.

- The dataset is suitable for machine learning modeling, with preprocessing efforts
primarily focused on feature scaling and class imbalance handling rather than data cleaning.

## Class Distribution

In this section, we examine the distribution of the target variable to understand
the balance between normal and fraudulent transactions.

In [None]:
# Display absolute class counts
df['Class'].value_counts()

In [None]:
# Display class proportions
df['Class'].value_counts(normalize=True)

In [None]:
# Plot class distribution using a logarithmic scale
class_counts = df['Class'].value_counts()

plt.figure()
class_counts.plot(kind='bar')
plt.xlabel('Class (0 = Normal, 1 = Fraud)')
plt.ylabel('Count')
plt.title('Class Distribution')
plt.show()

In [None]:
# Compare transaction time distribution between normal and fraudulent transactions
plt.figure(figsize=(10, 4))

plt.hist(
    df[df['Class'] == 0]['Time'],
    bins=100,
    alpha=0.6,
    label='Normal',
    log=True
)

plt.hist(
    df[df['Class'] == 1]['Time'],
    bins=100,
    alpha=0.6,
    label='Fraud',
    log=True
)

plt.xlabel("Time")
plt.ylabel("Frequency (log scale)")
plt.title("Transaction Time Distribution by Class")
plt.legend()
plt.show()


## Key Observations
- The class distribution shows a severe imbalance, with fraudulent transactions
  representing only 0.17% of the dataset.
- No strong temporal separation is observed between fraudulent and normal transactions.
- Temporal patterns may still provide weak signal when combined with other features.

## Transaction Amount Distribution

In this section, we analyze the distribution of transaction amounts to understand
their overall range, skewness, and presence of extreme values.

In [None]:
# Plot the distribution of transaction amounts
plt.figure()
plt.hist(df['Amount'], bins=100)

# Axis labels and title
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.title('Transaction Amount Distribution')

# Use logarithmic scale to better visualize skewed data
plt.yscale('log')

plt.show()

## Key Observations
- The transaction amount distribution is highly skewed, with the majority of transactions
occurring at low amounts and a small number of very large transactions.

- The logarithmic scale reveals the presence of extreme values (outliers), which should be
considered during preprocessing and model evaluation.

## Transaction Amount by Class

This section compares transaction amounts between normal and fraudulent transactions
to identify potential differences in spending patterns.

In [None]:
# Compare transaction amounts across classes using a boxplot
plt.figure()
df.boxplot(column='Amount', by='Class')

# Axis labels and title
plt.xlabel('Class (0 = Normal, 1 = Fraud)')
plt.ylabel('Transaction Amount')
plt.title('Transaction Amount by Class')

# Remove the automatic pandas subtitle
plt.suptitle('')

plt.show()

## Key Observations
- The boxplot shows that both normal and fraudulent transactions exhibit a wide range of
transaction amounts, with the presence of extreme outliers in both classes.

- There is no clear separation between the two classes based on transaction amount alone,
suggesting that amount by itself is insufficient for reliable fraud detection.


### Transaction Amount Statistics by Class

To better understand how transaction amounts differ between normal and fraudulent
transactions, we examine summary statistics grouped by the target class.


In [None]:
df.groupby('Class')['Amount'].describe()

## Key Observations
- Although fraudulent transactions do not show a clearly distinct amount range,
summary statistics indicate differences in central tendency and dispersion,
suggesting that transaction amount may contribute useful signal when combined
with other features.

## EDA Summary and Modeling Implications

- The dataset represents a highly imbalanced fraud detection problem, with fraudulent
  transactions accounting for a very small fraction of the data.
- Transaction amounts are heavily skewed and contain extreme outliers, requiring careful
  handling during preprocessing and model evaluation.
- No single feature provides clear separation between fraudulent and normal transactions,
  indicating that effective fraud detection will rely on combining multiple features.
- Due to the severe class imbalance, evaluation metrics such as precision, recall, and
  PR-AUC will be more informative than accuracy.
- Extreme transaction amounts were observed; however, outliers were not removed during EDA,
  as fraudulent transactions are often extreme by nature. Instead, scaling and robust
  evaluation metrics are preferred.

## EDA Impact on Modeling Decisions

- The severe class imbalance motivates the use of class-aware training strategies and
  probability-based evaluation instead of accuracy-driven metrics.
- The absence of clear single-feature separation supports the use of multivariate models
  rather than rule-based approaches.
- The heavily skewed transaction amount distribution motivates feature scaling during
  preprocessing.
- PCA-transformed features simplify preprocessing while limiting interpretability, making
  model evaluation and threshold tuning more critical than feature-level explanations.