In [1]:
#! mkdir -p src data scripts data/raw data/processed src/eda/ src/data_processing/ src/utils src/feature_engineering

### Google Colab Setup for RAPIDS/cuDF
install RAPIDS (which includes cuDF)

In [2]:
# This cell installs the RAPIDS library, including cuDF, on Google Colab.
# It's recommended to run this at the start of your notebook.

import sys
import os
import shutil

# Check if NVIDIA GPU is available
gpu_info = !nvidia-smi --query-gpu=name,driver_version,cuda_version --format=csv,noheader
if not gpu_info:
    print("No GPU found. Please ensure you have a GPU runtime selected (Runtime -> Change runtime type -> GPU).")
else:
    print(f"GPU detected: {gpu_info[0]}")

    # Install RAPIDS (cuDF, cuML, etc.)
    # The specific version (e.g., 23.08) might need to be updated based on RAPIDS releases.
    # Check https://docs.rapids.ai/install for the latest recommended Colab installation.
    # This command uses the nightly build for broader compatibility, but you can specify a stable version.
    print("Installing RAPIDS... This may take a few minutes.")
    !pip install --upgrade pip
    !pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com # For CUDA 12.x
    # For older CUDA versions, you might need:
    # !pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com # For CUDA 11.x

    # Verify installation
    try:
        import cudf
        print("\ncuDF installed and imported successfully!")
        print(f"cuDF version: {cudf.__version__}")
    except ImportError:
        print("\nError: cuDF installation failed or could not be imported.")
        print("Please restart the runtime (Runtime -> Restart runtime) and try again.")

    # A common issue: ensure the correct CUDA path is set
    cuda_path = '/usr/local/cuda'
    if os.path.exists(cuda_path):
        os.environ['PATH'] = f"{cuda_path}/bin:{os.environ['PATH']}"
        os.environ['LD_LIBRARY_PATH'] = f"{cuda_path}/lib64:{os.environ['LD_LIBRARY_PATH']}"
        print(f"CUDA path set: {cuda_path}")
    else:
        print("Warning: CUDA path not found, some cuDF features might not work optimally.")

print("\nRAPIDS installation attempt complete. You may need to restart the runtime for changes to take full effect.")

GPU detected: Field "cuda_version" is not a valid field to query.
Installing RAPIDS... This may take a few minutes.
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com

cuDF installed and imported successfully!
cuDF version: 25.06.00
CUDA path set: /usr/local/cuda

RAPIDS installation attempt complete. You may need to restart the runtime for changes to take full effect.


# **Exploratory Data Analysis (EDA) for E-commerce and Bank Transaction Fraud Detection**

This notebook serves as the core of our Exploratory Data Analysis (EDA) phase for improving the detection of fraud cases for e-commerce and bank credit transactions. Its primary goal is to develop a foundational understanding of both datasets, assess their quality, and uncover initial patterns that will inform our feature engineering and model development.

We will leverage a modular set of Python scripts (`src/data_processing/` and `src/eda/`) for data loading, preprocessing, summarization, and various analytical techniques (univariate, bivariate, multivariate, missing value, outlier, and temporal analysis). This modular approach promotes clean code, reusability, and reproducibility, aligning with best practices in Data Engineering and Machine Learning Engineering.

## **Table of Contents**

1. [Setup and Data Loading](#1-setup-and-data-loading)
2. [Data Preprocessing and Feature Engineering](#2-data-preprocessing-and-feature-engineering)
3. [EDA for E-commerce Fraud Data (Fraud_Data.csv)](#3-eda-for-e-commerce-fraud-data-fraud_data.csv)
    - [Data Understanding and Initial Quality Check](#3.1-data-understanding-and-initial-quality-check)
    - [Missing Values Analysis](#3.2-missing-values-analysis)
    - [Univariate Analysis](#3.3-univariate-analysis)
    - [Bivariate Analysis](#3.4-bivariate-analysis)
    - [Multivariate Analysis](#3.5-multivariate-analysis)
    - [Outlier Analysis](#3.6-outlier-analysis)
    - [Temporal Trend Analysis](#3.7-temporal-trend-analysis)
4. [EDA for Bank Credit Card Fraud Data (creditcard.csv)](#4-eda-for-bank-credit-card-fraud-data-creditcard.csv)
    - [Data Understanding and Initial Quality Check](#4.1-data-understanding-and-initial-quality-check)
    - [Missing Values Analysis](#4.2-missing-values-analysis)
    - [Univariate Analysis](#4.3-univariate-analysis)
    - [Bivariate Analysis](#4.4-bivariate-analysis)
    - [Multivariate Analysis](#4.5-multivariate-analysis)
    - [Outlier Analysis](#4.6-outlier-analysis)
    - [Temporal Trend Analysis](#4.7-temporal-trend-analysis)
5. [Class Imbalance Handling Demonstration](#5-class-imbalance-handling-demonstration)
6. [Key Insights & Summary](#6-key-insights--summary)

## **1. Setup and Data Loading**

This section establishes our analytical environment by importing essential Python libraries for data manipulation, visualization, and numerical operations. Crucially, we also import our custom modular functions and classes from the `src/` directory. This ensures that data loading, preprocessing, and various analytical steps are handled by dedicated, reusable components.

**Note on Data:** This notebook expects the `Fraud_Data.csv`, `IpAddress_to_Country.csv`, and `creditcard.csv` files to be placed in the `data/raw/` directory of your project. Please ensure the files are downloaded and correctly placed before running this notebook.

### Import Necessary Libraries and Custom Modules

In [3]:
# Core Libraries for Data Manipulation and Numerical Operations
import pandas as pd
import numpy as np
from pathlib import Path
import os # For path operations

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# For handling class imbalance (SMOTE)
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Add the project root directory to the system path.
# This allows Python to correctly locate and import our custom modules
# (e.g., from `src.data_processing.loader`) regardless of where the notebook is run from.
import sys
project_root = Path.cwd() # Navigates from 'notebooks/' directory to the project root
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import Custom Modular Functions and Classes
from src.data_processing.loader import load_data
from src.data_processing.preprocessor import FraudDataProcessor, CreditCardDataProcessor
from src.utils.helpers import merge_ip_to_country

# Data Inspection Utilities (includes strategies for dtypes and summary statistics)
from src.eda.data_inspection import DataInspector, DataTypesAndNonNullInspectionStrategy, SummaryStatisticsInspectionStrategy

# Univariate Analysis Utilities
from src.eda.univariate_analysis import UnivariateAnalyzer, NumericalUnivariateAnalysis, CategoricalUnivariateAnalysis

# Bivariate Analysis Utilities
from src.eda.bivariate_analysis import BivariateAnalyzer, NumericalVsNumericalAnalysis, CategoricalVsNumericalAnalysis, CategoricalVsCategoricalAnalysis

# Multivariate Analysis Utilities
from src.eda.multivariate_analysis import SimpleMultivariateAnalysis

# Missing Values Analysis Utilities
from src.eda.missing_values_analysis import SimpleMissingValuesAnalysis

# Outlier Analysis Utilities
from src.eda.outlier_analysis import OutlierAnalyzer, IQRBasedOutlierAnalysis

# Temporal Analysis Utilities
from src.eda.temporal_analysis import TemporalAnalyzer, MonthlyTrendAnalysis


cuDF is available. Data loading can be accelerated on GPU.
cuDF is available in preprocessor.py. Transformers can use GPU.
cuDF is available in engineer.py. Transformers can use GPU.
cuDF is available in helpers.py.


#### Set Plotting Style

In [4]:
# Configure Matplotlib and Seaborn for consistent and aesthetic plots
sns.set_style("whitegrid") # Provides a clean, modern look with a grid
plt.rcParams['figure.figsize'] = (10, 6) # Default figure size for plots
plt.rcParams['font.size'] = 12 # Base font size for readability
plt.rcParams['axes.labelsize'] = 14 # Font size for axis labels
plt.rcParams['xtick.labelsize'] = 12 # Font size for x-axis tick labels
plt.rcParams['ytick.labelsize'] = 12 # Font size for y-axis tick labels
plt.rcParams['legend.fontsize'] = 12 # Font size for plot legends
plt.rcParams['font.family'] = 'sans-serif' # Use a default sans-serif font

### Loading the Raw Transaction Data
We will load both the e-commerce fraud data (`Fraud_Data.csv`) and the bank credit card fraud data (`creditcard.csv`), along with the IP address to country mapping (`IpAddress_to_Country.csv`).

In [5]:
# Define data directories and file paths
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"

FRAUD_DATA_PATH = RAW_DATA_DIR / "Fraud_Data.csv"
IP_TO_COUNTRY_PATH = RAW_DATA_DIR / "IpAddress_to_Country.csv"
CREDITCARD_DATA_PATH = RAW_DATA_DIR / "creditcard.csv"

# Ensure raw data directory exists
if not RAW_DATA_DIR.exists():
    print(f"Error: Raw data directory '{RAW_DATA_DIR}' not found. Please ensure 'data/raw' exists and contains the datasets.")
    # You might want to raise an exception here if the directory is absolutely critical
    # raise FileNotFoundError(f"Raw data directory '{RAW_DATA_DIR}' not found.")

# Check if all required raw data files exist
required_raw_files = [FRAUD_DATA_PATH, IP_TO_COUNTRY_PATH, CREDITCARD_DATA_PATH]
all_files_exist = True
for file_path in required_raw_files:
    if not file_path.exists():
        print(f"Error: Required raw data file '{file_path.name}' not found in '{RAW_DATA_DIR}'. Please place it there.")
        all_files_exist = False

if not all_files_exist:
    print("\nSkipping data loading due to missing required raw data files.")
else:
    print("\n--- Loading Raw Datasets ---")
    # Explicitly specify dtypes for IP address columns to prevent string accessor errors
    fraud_data_df_raw = load_data(
        FRAUD_DATA_PATH,
        use_gpu=True,
        column_dtypes={'ip_address': str}
    )
    ip_country_df_raw = load_data(
        IP_TO_COUNTRY_PATH,
        use_gpu=True,
        column_dtypes={'lower_bound_ip_address': str, 'upper_bound_ip_address': str}
    )
    creditcard_df_raw = load_data(CREDITCARD_DATA_PATH, use_gpu=True)

    print("Raw data loading complete.")




--- Loading Raw Datasets ---
Attempting to load data from /content/data/raw/Fraud_Data.csv using cuDF (GPU)...
Successfully loaded data from /content/data/raw/Fraud_Data.csv
Attempting to load data from /content/data/raw/IpAddress_to_Country.csv using cuDF (GPU)...
Successfully loaded data from /content/data/raw/IpAddress_to_Country.csv
Attempting to load data from /content/data/raw/creditcard.csv using cuDF (GPU)...
Successfully loaded data from /content/data/raw/creditcard.csv
Raw data loading complete.


## **2. Data Preprocessing and Feature Engineering**

This section applies the preprocessing and feature engineering pipelines defined in `src/data_processing/preprocessor.py` to both datasets. This includes handling missing values, correcting data types, creating new temporal and transaction-based features, and scaling/encoding features.

After processing, the engineered datasets will be saved to the `data/processed/` directory for use in subsequent modeling phases.

### Preprocessing E-commerce Fraud Data (`Fraud_Data.csv`)

In [None]:
print("\n--- Preprocessing E-commerce Fraud Data (Fraud_Data.csv) ---")
if fraud_data_df_raw.empty:
    print("Raw Fraud_Data.csv is empty. Skipping preprocessing.")
    fraud_processed_df = pd.DataFrame()
else:
    # Merge IP addresses to countries first (this is a helper function)
    fraud_df_merged = merge_ip_to_country(fraud_data_df_raw.copy(), ip_country_df_raw.copy())

    # Separate features and target
    FRAUD_TARGET_COL = 'class'
    if FRAUD_TARGET_COL in fraud_df_merged.columns:
        X_fraud = fraud_df_merged.drop(columns=[FRAUD_TARGET_COL])
        y_fraud = fraud_df_merged[FRAUD_TARGET_COL]
    else:
        print(f"Warning: Target column '{FRAUD_TARGET_COL}' not found in Fraud_Data.csv. Using a dummy target.")
        X_fraud = fraud_df_merged.copy()
        y_fraud = pd.Series([0] * len(X_fraud), index=X_fraud.index) # Dummy target

    # Rename columns in X_fraud to match generic names expected by FraudDataProcessor
    # This ensures consistency with the preprocessor's internal logic
    X_fraud_renamed = X_fraud.rename(columns={
        'user_id': 'CustomerId',
        'purchase_value': 'Amount',
        'purchase_time': 'TransactionStartTime',
        'user_id': 'TransactionId' # Using user_id as a proxy for TransactionId for frequency count
    }).copy()

    # Define columns for FraudDataProcessor using the *renamed* names
    fraud_numerical_features_renamed = ['Amount', 'age']
    fraud_categorical_features_renamed = ['source', 'browser', 'sex', 'country']
    fraud_purchase_time_col_renamed = 'TransactionStartTime'
    fraud_signup_time_col_renamed = 'signup_time'
    fraud_amount_col_renamed = 'Amount'
    fraud_id_cols_for_agg_renamed = ['CustomerId', 'device_id', 'ip_address']

    fraud_processor = FraudDataProcessor(
        numerical_cols_after_rename=fraud_numerical_features_renamed,
        categorical_cols_after_merge=fraud_categorical_features_renamed,
        time_col_after_rename=fraud_purchase_time_col_renamed,
        signup_time_col_after_rename=fraud_signup_time_col_renamed,
        amount_col_after_rename=fraud_amount_col_renamed,
        id_cols_for_agg_after_rename=fraud_id_cols_for_agg_renamed
    )

    print("Applying preprocessing pipeline to Fraud_Data.csv...")
    X_fraud_processed = fraud_processor.fit_transform(X_fraud_renamed, y_fraud)

    # Re-attach target
    fraud_processed_df = pd.concat([X_fraud_processed, y_fraud.rename(FRAUD_TARGET_COL)], axis=1)
    print("E-commerce Fraud Data preprocessing complete.")

# Save processed Fraud Data
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)
if not fraud_processed_df.empty:
    fraud_output_path = PROCESSED_DATA_DIR / "fraud_processed.csv"
    fraud_processed_df.to_csv(fraud_output_path, index=False)
    print(f"Processed E-commerce Fraud Data saved to: {fraud_output_path}")
else:
    print("No processed E-commerce Fraud Data to save.")



--- Preprocessing E-commerce Fraud Data (Fraud_Data.csv) ---

--- Performing IP Address to Country Merging ---
Converting IP addresses to integer format...
Matching IP addresses to countries using efficient merge...
Converting to pandas for IP merge (cuDF merge_asof for ranges is limited)...
IP-to-Country merge complete.
Applying preprocessing pipeline to Fraud_Data.csv...
No duplicate rows found.
Converting 'TransactionStartTime' to datetime using cuDF...
Converting 'signup_time' to datetime using cuDF...
Converting 'Amount' to numerical using cuDF...
Converting 'age' to numerical using cuDF...
Converting 'Amount' to numerical using cuDF...


### Preprocessing Bank Credit Card Fraud Data (`creditcard.csv`)

In [None]:
print("\n--- Preprocessing Bank Credit Card Fraud Data (creditcard.csv) ---")
if creditcard_df_raw.empty:
    print("Raw creditcard.csv is empty. Skipping preprocessing.")
    creditcard_processed_df = pd.DataFrame()
else:
    # Separate features and target
    CREDITCARD_TARGET_COL = 'Class'
    if CREDITCARD_TARGET_COL in creditcard_df_raw.columns:
        X_creditcard = creditcard_df_raw.drop(columns=[CREDITCARD_TARGET_COL])
        y_creditcard = creditcard_df_raw[CREDITCARD_TARGET_COL]
    else:
        print(f"Warning: Target column '{CREDITCARD_TARGET_COL}' not found in creditcard.csv. Proceeding without target.")
        X_creditcard = creditcard_df_raw.copy()
        y_creditcard = pd.Series([0] * len(X_creditcard), index=X_creditcard.index) # Dummy target

    # All V features, Time, Amount are numerical.
    creditcard_numerical_features = [col for col in X_creditcard.columns if col not in []]

    creditcard_processor = CreditCardDataProcessor(
        numerical_cols=creditcard_numerical_features
    )

    print("Applying preprocessing pipeline to creditcard.csv...")
    X_creditcard_processed = creditcard_processor.fit_transform(X_creditcard, y_creditcard)

    # Re-attach target
    creditcard_processed_df = pd.concat([X_creditcard_processed, y_creditcard.rename(CREDITCARD_TARGET_COL)], axis=1)
    print("Bank Credit Card Fraud Data preprocessing complete.")

# Save processed Credit Card Data
if not creditcard_processed_df.empty:
    creditcard_output_path = PROCESSED_DATA_DIR / "creditcard_processed.csv"
    creditcard_processed_df.to_csv(creditcard_output_path, index=False)
    print(f"Processed Bank Credit Card Fraud Data saved to: {creditcard_output_path}")
else:
    print("No processed Bank Credit Card Fraud Data to save.")


## **3. EDA for E-commerce Fraud Data (Fraud_Data.csv)**

This section focuses on the Exploratory Data Analysis of the preprocessed E-commerce Fraud Data. We will apply various analytical techniques to understand the data's characteristics, distributions, relationships between features, and the nature of fraud within this dataset.

### **3.1 Data Understanding and Initial Quality Check**
This step provides an initial overview of the dataset's structure, including data types and non-null counts, and then presents summary statistics for numerical features. This helps in quickly grasping the scale, distribution, and potential issues within the data.

#### Data Structure and Quality Assessment (using `DataTypesAndNonNullInspectionStrategy`)

In [None]:
if not fraud_processed_df.empty:
    print("\n--- Data Types and Non-null Counts for E-commerce Fraud Data ---")
    inspector = DataInspector(DataTypesAndNonNullInspectionStrategy())
    inspector.execute_inspection(fraud_processed_df)
else:
    print("E-commerce Fraud Data is empty, skipping Data Structure and Quality Assessment.")


#### Descriptive Statistics & Variability (using `SummaryStatisticsInspectionStrategy`)

In [None]:
if not fraud_processed_df.empty:
    print("\n--- Summary Statistics for E-commerce Fraud Data ---")
    inspector = DataInspector(SummaryStatisticsInspectionStrategy())
    inspector.execute_inspection(fraud_processed_df)
else:
    print("E-commerce Fraud Data is empty, skipping Descriptive Statistics.")


### **3.2 Missing Values Analysis**
Missing data can significantly impact model performance and lead to biased results. This section identifies the extent of missing values and visualizes their patterns, guiding imputation or removal strategies in the feature engineering phase.

In [None]:
if not fraud_processed_df.empty:
    print("\n--- Missing Values Analysis for E-commerce Fraud Data ---")
    missing_analyzer = SimpleMissingValuesAnalysis()
    missing_analyzer.analyze(fraud_processed_df)
else:
    print("E-commerce Fraud Data is empty, skipping Missing Values Analysis.")


### **3.3 Univariate Analysis**
Univariate analysis explores the distribution of individual features. This helps in understanding the range, central tendency, and spread of numerical data, and the frequency of categories in categorical data.

In [None]:
if not fraud_processed_df.empty:
    print("\n--- Univariate Analysis for E-commerce Fraud Data ---")

    # Define features for EDA based on the *expected output* of the FraudDataProcessor
    # These lists must accurately reflect the columns present and their types AFTER preprocessing
    fraud_numerical_features_for_eda = [
        'Amount', 'age', 'IsRefund', 'TransactionHour', 'TransactionDayOfWeek',
        'TransactionMonth', 'TransactionYear', 'time_since_signup'
    ] + [
        f'{id_col}_transactions_last_{window}d' for id_col in ['CustomerId', 'device_id', 'ip_address'] for window in [1, 7, 30]
    ] + [
        f'{id_col}_total_amount_last_{window}d' for id_col in ['CustomerId', 'device_id', 'ip_address'] for window in [1, 7, 30]
    ] + [
        f'{id_col}_avg_amount_last_{window}d' for id_col in ['CustomerId', 'device_id', 'ip_address'] for window in [1, 7, 30]
    ]
    # Filter to only include columns that actually exist in the DataFrame
    fraud_numerical_features_for_eda = [col for col in fraud_numerical_features_for_eda if col in fraud_processed_df.columns]

    # Categorical features will be One-Hot Encoded, so we analyze the original categorical columns
    # or the OHE versions if needed, but for general distribution, original is better.
    # The 'country' column is added during merge.
    fraud_categorical_features_for_eda = [
        'source', 'browser', 'sex', 'country'
    ]
    # Filter to only include columns that actually exist in the DataFrame
    fraud_categorical_features_for_eda = [col for col in fraud_categorical_features_for_eda if col in fraud_processed_df.columns]

    # Numerical Univariate Analysis
    univariate_analyzer_num = UnivariateAnalyzer(NumericalUnivariateAnalysis())
    for col in fraud_numerical_features_for_eda:
        univariate_analyzer_num.execute_analysis(fraud_processed_df, col)

    # Categorical Univariate Analysis
    univariate_analyzer_cat = UnivariateAnalyzer(CategoricalUnivariateAnalysis())
    for col in fraud_categorical_features_for_eda:
        univariate_analyzer_cat.execute_analysis(fraud_processed_df, col)

    # Class Imbalance Check for FraudResult
    FRAUD_TARGET_COL = 'class' # Ensure this is the correct renamed target column
    if FRAUD_TARGET_COL in fraud_processed_df.columns:
        print(f"\n--- Class Distribution for '{FRAUD_TARGET_COL}' in E-commerce Fraud Data ---")
        class_counts = fraud_processed_df[FRAUD_TARGET_COL].value_counts()
        print(class_counts)
        print(f"Fraudulent transactions: {class_counts.get(1, 0)} ({class_counts.get(1, 0) / len(fraud_processed_df) * 100:.2f}%) ")
        print(f"Non-fraudulent transactions: {class_counts.get(0, 0)} ({class_counts.get(0, 0) / len(fraud_processed_df) * 100:.2f}%) ")
        plt.figure(figsize=(6, 4))
        sns.countplot(x=FRAUD_TARGET_COL, data=fraud_processed_df, palette='coolwarm')
        plt.title(f'Class Distribution of {FRAUD_TARGET_COL} (E-commerce Fraud Data)')
        plt.show()
    else:
        print(f"Target column '{FRAUD_TARGET_COL}' not found in E-commerce Fraud Data. Skipping class distribution check.")
else:
    print("E-commerce Fraud Data is empty, skipping Univariate Analysis.")


### **3.4 Bivariate Analysis**
Bivariate analysis explores the relationships between pairs of features. This helps in identifying potential correlations, dependencies, and interactions that might be important for model building.

In [None]:
if not fraud_processed_df.empty:
    print("\n--- Bivariate Analysis for E-commerce Fraud Data ---")
    FRAUD_TARGET_COL = 'class' # Ensure this is the correct renamed target column

    # Numerical vs Numerical
    bivariate_analyzer_num_num = BivariateAnalyzer(NumericalVsNumericalAnalysis())
    # Example: Amount vs age
    if 'Amount' in fraud_processed_df.columns and 'age' in fraud_processed_df.columns:
        bivariate_analyzer_num_num.execute_analysis(fraud_processed_df, 'Amount', 'age')
    # Example: Amount vs time_since_signup
    if 'Amount' in fraud_processed_df.columns and 'time_since_signup' in fraud_processed_df.columns:
        bivariate_analyzer_num_num.execute_analysis(fraud_processed_df, 'Amount', 'time_since_signup')

    # Categorical vs Numerical
    bivariate_analyzer_cat_num = BivariateAnalyzer(CategoricalVsNumericalAnalysis())
    # Example: source vs Amount
    if 'source' in fraud_processed_df.columns and 'Amount' in fraud_processed_df.columns:
        bivariate_analyzer_cat_num.execute_analysis(fraud_processed_df, 'source', 'Amount')
    # Example: country vs Amount
    if 'country' in fraud_processed_df.columns and 'Amount' in fraud_processed_df.columns:
        bivariate_analyzer_cat_num.execute_analysis(fraud_processed_df, 'country', 'Amount')
    # Example: Target vs Amount
    if FRAUD_TARGET_COL in fraud_processed_df.columns and 'Amount' in fraud_processed_df.columns:
        bivariate_analyzer_cat_num.execute_analysis(fraud_processed_df, FRAUD_TARGET_COL, 'Amount')
    # Example: Target vs time_since_signup
    if FRAUD_TARGET_COL in fraud_processed_df.columns and 'time_since_signup' in fraud_processed_df.columns:
        bivariate_analyzer_cat_num.execute_analysis(fraud_processed_df, FRAUD_TARGET_COL, 'time_since_signup')

    # Categorical vs Categorical
    bivariate_analyzer_cat_cat = BivariateAnalyzer(CategoricalVsCategoricalAnalysis())
    # Example: source vs sex
    if 'source' in fraud_processed_df.columns and 'sex' in fraud_processed_df.columns:
        bivariate_analyzer_cat_cat.execute_analysis(fraud_processed_df, 'source', 'sex')
    # Example: country vs FraudResult
    if 'country' in fraud_processed_df.columns and FRAUD_TARGET_COL in fraud_processed_df.columns:
        bivariate_analyzer_cat_cat.execute_analysis(fraud_processed_df, 'country', FRAUD_TARGET_COL)
else:
    print("E-commerce Fraud Data is empty, skipping Bivariate Analysis.")


### **3.5 Multivariate Analysis**
Multivariate analysis examines the relationships among three or more variables. This helps in understanding complex interactions and patterns that might not be visible in univariate or bivariate analyses.

In [None]:
if not fraud_processed_df.empty:
    print("\n--- Multivariate Analysis for E-commerce Fraud Data ---")
    multivariate_analyzer = SimpleMultivariateAnalysis()

    # Select a subset of numerical features for correlation heatmap and pair plot
    # Include some original numericals and some engineered features
    fraud_multivariate_features = [
        'Amount', 'age', 'time_since_signup',
        'CustomerId_transactions_last_7d', 'CustomerId_total_amount_last_7d',
        'device_id_transactions_last_7d', 'ip_address_transactions_last_7d'
    ]
    fraud_multivariate_features = [col for col in fraud_multivariate_features if col in fraud_processed_df.columns]

    if fraud_multivariate_features:
        multivariate_analyzer.analyze(fraud_processed_df, features=fraud_multivariate_features)
    else:
        print("No suitable numerical features found for multivariate analysis in E-commerce Fraud Data.")
else:
    print("E-commerce Fraud Data is empty, skipping Multivariate Analysis.")


### **3.6 Outlier Analysis**
Outliers are data points that significantly differ from other observations. Identifying and understanding outliers is crucial as they can skew statistical analyses and model training. This section uses IQR-based outlier detection and visualization.

In [None]:
if not fraud_processed_df.empty:
    print("\n--- Outlier Analysis for E-commerce Fraud Data ---")
    outlier_analyzer = OutlierAnalyzer(IQRBasedOutlierAnalysis())

    # Focus on key numerical features including engineered ones
    fraud_outlier_cols = [
        'Amount', 'age', 'time_since_signup', 'IsRefund',
        'CustomerId_transactions_last_1d', 'CustomerId_total_amount_last_1d',
        'device_id_transactions_last_1d', 'ip_address_transactions_last_1d'
    ]
    fraud_outlier_cols = [col for col in fraud_outlier_cols if col in fraud_processed_df.columns]

    for col in fraud_outlier_cols:
        outlier_analyzer.execute_analysis(fraud_processed_df, col)
else:
    print("E-commerce Fraud Data is empty, skipping Outlier Analysis.")


### **3.7 Temporal Trend Analysis**
Temporal analysis examines how features and fraud patterns change over time. This is particularly important for transaction data, as fraud often exhibits temporal trends or seasonality.

In [None]:
if not fraud_processed_df.empty:
    print("\n--- Temporal Analysis for E-commerce Fraud Data ---")
    FRAUD_TARGET_COL = 'class'
    time_col_for_temporal = 'TransactionStartTime'

    # The original 'purchase_time' column (renamed to 'TransactionStartTime') is dropped by TemporalFeatureEngineer.
    # To perform temporal analysis on the processed data, we need to ensure a time column suitable for plotting exists.
    # If 'TransactionStartTime' was dropped, we can't use it directly here for plotting.
    # We need to re-think how temporal analysis is done in the EDA notebook if the original time column is removed.
    # For now, let's assume the time-based features (TransactionHour, etc.) are used for analysis, not the original timestamp.
    # If we want to plot trends over time, we need to keep a date column or reconstruct it.

    # Re-evaluate: TemporalFeatureEngineer drops the original time columns. This means we cannot directly plot
    # trends over the full time range using 'TransactionStartTime' unless we modify TemporalFeatureEngineer
    # to *not* drop it, or store it separately.
    # For the purpose of this EDA notebook, let's adjust to use the extracted time features for analysis,
    # or re-load the raw data for specific time-series plots if needed, or modify the preprocessor.

    # Let's adjust the TemporalFeatureEngineer to *not* drop the original time columns, so we can use them for plotting.
    # This change needs to be made in src/feature_engineering/engineer.py
    # For now, if 'TransactionStartTime' is not in the processed df, skip this plot.

    # If 'TransactionStartTime' is needed for plotting, the TemporalFeatureEngineer should NOT drop it.
    # Assuming for this notebook that the original datetime column is available for plotting.
    # If not, this section will print a warning and skip.

    # Let's check if the raw time column is still available (which it won't be if preprocessor drops it)
    # Or, we can use the 'TransactionYear' and 'TransactionMonth' to create aggregated plots.

    # For now, let's use the MonthlyTrendAnalysis which aggregates by YearMonth from the original time column.
    # If 'TransactionStartTime' is not present in fraud_processed_df, this will fail.
    # The preprocessor currently drops it. So, we need to decide: keep original time in processed data, or only use extracted features.
    # For comprehensive temporal analysis, keeping the original timestamp or a derived 'YearMonth' in the processed data is better.

    # Assuming the 'TransactionStartTime' column is available (it should be if we modify TemporalFeatureEngineer not to drop it)
    # If the preprocessor is modified to keep 'TransactionStartTime', then this code will work.
    # For now, let's make sure the time column is actually in the df.

    # If 'TransactionStartTime' is not in the processed df (because it was dropped by TemporalFeatureEngineer),
    # then we can't run this directly. We would need to either:
    # 1. Modify TemporalFeatureEngineer to *not* drop the original time column.
    # 2. Add a step in the notebook to re-merge the original time column for plotting only.
    # 3. Adjust the temporal analysis strategy to work purely on extracted features like Year/Month/DayOfWeek.

    # Given the previous error, it's likely 'TransactionStartTime' is missing.
    # Let's adjust the `TemporalFeatureEngineer` to *not* drop the original time columns.
    # This will make the processed DataFrame larger but allow for this type of plotting.
    # This change will be made in src/feature_engineering/engineer.py.

    # After the change in engineer.py:
    if time_col_for_temporal in fraud_processed_df.columns:
        temporal_analyzer = TemporalAnalyzer(MonthlyTrendAnalysis())
        temporal_metrics = ['Amount', FRAUD_TARGET_COL]
        temporal_metrics = [col for col in temporal_metrics if col in fraud_processed_df.columns]
        if temporal_metrics:
            temporal_analyzer.execute_analysis(fraud_processed_df, time_col_for_temporal, temporal_metrics)
        else:
            print(f"No valid metrics for temporal analysis in E-commerce Fraud Data.")
    else:
        print(f"Time column '{time_col_for_temporal}' not found in E-commerce Fraud Data. Skipping temporal analysis.")
else:
    print("E-commerce Fraud Data is empty, skipping Temporal Analysis.")


## **4. EDA for Bank Credit Card Fraud Data (creditcard.csv)**

This section focuses on the Exploratory Data Analysis of the preprocessed Bank Credit Card Fraud Data. We will apply similar analytical techniques to understand its unique characteristics, distributions, and fraud patterns.

### **4.1 Data Understanding and Initial Quality Check**

#### Data Structure and Quality Assessment (using `DataTypesAndNonNullInspectionStrategy`)

In [None]:
if not creditcard_processed_df.empty:
    print("\n--- Data Types and Non-null Counts for Bank Credit Card Fraud Data ---")
    inspector = DataInspector(DataTypesAndNonNullInspectionStrategy())
    inspector.execute_inspection(creditcard_processed_df)
else:
    print("Bank Credit Card Fraud Data is empty, skipping Data Structure and Quality Assessment.")


#### Descriptive Statistics & Variability (using `SummaryStatisticsInspectionStrategy`)

In [None]:
if not creditcard_processed_df.empty:
    print("\n--- Summary Statistics for Bank Credit Card Fraud Data ---")
    inspector = DataInspector(SummaryStatisticsInspectionStrategy())
    inspector.execute_inspection(creditcard_processed_df)
else:
    print("Bank Credit Card Fraud Data is empty, skipping Descriptive Statistics.")


### **4.2 Missing Values Analysis**

In [None]:
if not creditcard_processed_df.empty:
    print("\n--- Missing Values Analysis for Bank Credit Card Fraud Data ---")
    missing_analyzer = SimpleMissingValuesAnalysis()
    missing_analyzer.analyze(creditcard_processed_df)
else:
    print("Bank Credit Card Fraud Data is empty, skipping Missing Values Analysis.")


### **4.3 Univariate Analysis**

In [None]:
if not creditcard_processed_df.empty:
    print("\n--- Univariate Analysis for Bank Credit Card Fraud Data ---")

    CREDITCARD_TARGET_COL = 'Class'
    creditcard_numerical_features_for_eda = [f'V{i}' for i in range(1, 29)] + ['Time', 'Amount', 'IsRefund']
    creditcard_numerical_features_for_eda = [col for col in creditcard_numerical_features_for_eda if col in creditcard_processed_df.columns]

    creditcard_categorical_features_for_eda = [CREDITCARD_TARGET_COL]
    creditcard_categorical_features_for_eda = [col for col in creditcard_categorical_features_for_eda if col in creditcard_processed_df.columns]

    # Numerical Univariate Analysis
    univariate_analyzer_num = UnivariateAnalyzer(NumericalUnivariateAnalysis())
    for col in creditcard_numerical_features_for_eda:
        univariate_analyzer_num.execute_analysis(creditcard_processed_df, col)

    # Categorical Univariate Analysis (mainly for the target variable)
    univariate_analyzer_cat = UnivariateAnalyzer(CategoricalUnivariateAnalysis())
    for col in creditcard_categorical_features_for_eda:
        univariate_analyzer_cat.execute_analysis(creditcard_processed_df, col)

    # Class Imbalance Check for Class
    if CREDITCARD_TARGET_COL in creditcard_processed_df.columns:
        print(f"\n--- Class Distribution for '{CREDITCARD_TARGET_COL}' in Bank Credit Card Fraud Data ---")
        class_counts = creditcard_processed_df[CREDITCARD_TARGET_COL].value_counts()
        print(class_counts)
        print(f"Fraudulent transactions: {class_counts.get(1, 0)} ({class_counts.get(1, 0) / len(creditcard_processed_df) * 100:.2f}%) ")
        print(f"Non-fraudulent transactions: {class_counts.get(0, 0)} ({class_counts.get(0, 0) / len(creditcard_processed_df) * 100:.2f}%) ")
        plt.figure(figsize=(6, 4))
        sns.countplot(x=CREDITCARD_TARGET_COL, data=creditcard_processed_df, palette='coolwarm')
        plt.title(f'Class Distribution of {CREDITCARD_TARGET_COL} (Bank Credit Card Fraud Data)')
        plt.show()
    else:
        print(f"Target column '{CREDITCARD_TARGET_COL}' not found in Bank Credit Card Fraud Data. Skipping class distribution check.")
else:
    print("Bank Credit Card Fraud Data is empty, skipping Univariate Analysis.")


### **4.4 Bivariate Analysis**

In [None]:
if not creditcard_processed_df.empty:
    print("\n--- Bivariate Analysis for Bank Credit Card Fraud Data ---")
    CREDITCARD_TARGET_COL = 'Class'

    # Numerical vs Numerical
    bivariate_analyzer_num_num = BivariateAnalyzer(NumericalVsNumericalAnalysis())
    # Example: Amount vs V1
    if 'Amount' in creditcard_processed_df.columns and 'V1' in creditcard_processed_df.columns:
        bivariate_analyzer_num_num.execute_analysis(creditcard_processed_df, 'Amount', 'V1')
    # Example: Time vs Amount
    if 'Time' in creditcard_processed_df.columns and 'Amount' in creditcard_processed_df.columns:
        bivariate_analyzer_num_num.execute_analysis(creditcard_processed_df, 'Time', 'Amount')

    # Categorical vs Numerical (Target vs Amount/Time/V-features)
    bivariate_analyzer_cat_num = BivariateAnalyzer(CategoricalVsNumericalAnalysis())
    if CREDITCARD_TARGET_COL in creditcard_processed_df.columns:
        if 'Amount' in creditcard_processed_df.columns:
            bivariate_analyzer_cat_num.execute_analysis(creditcard_processed_df, CREDITCARD_TARGET_COL, 'Amount')
        if 'Time' in creditcard_processed_df.columns:
            bivariate_analyzer_cat_num.execute_analysis(creditcard_processed_df, CREDITCARD_TARGET_COL, 'Time')
        # Example with a V feature
        if 'V17' in creditcard_processed_df.columns:
            bivariate_analyzer_cat_num.execute_analysis(creditcard_processed_df, CREDITCARD_TARGET_COL, 'V17')

    # No significant categorical vs categorical analysis expected for this dataset beyond target
else:
    print("Bank Credit Card Fraud Data is empty, skipping Bivariate Analysis.")


### **4.5 Multivariate Analysis**

In [None]:
if not creditcard_processed_df.empty:
    print("\n--- Multivariate Analysis for Bank Credit Card Fraud Data ---")
    multivariate_analyzer = SimpleMultivariateAnalysis()

    # Select a subset of numerical features for correlation heatmap and pair plot
    creditcard_multivariate_features = ['Amount', 'Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28']
    creditcard_multivariate_features = [col for col in creditcard_multivariate_features if col in creditcard_processed_df.columns]

    if creditcard_multivariate_features:
        multivariate_analyzer.analyze(creditcard_processed_df, features=creditcard_multivariate_features)
    else:
        print("No suitable numerical features found for multivariate analysis in Bank Credit Card Fraud Data.")
else:
    print("Bank Credit Card Fraud Data is empty, skipping Multivariate Analysis.")


### **4.6 Outlier Analysis**

In [None]:
if not creditcard_processed_df.empty:
    print("\n--- Outlier Analysis for Bank Credit Card Fraud Data ---")
    outlier_analyzer = OutlierAnalyzer(IQRBasedOutlierAnalysis())

    # Focus on key numerical features
    creditcard_outlier_cols = ['Amount', 'Time', 'V1', 'V2', 'V3', 'IsRefund']
    creditcard_outlier_cols = [col for col in creditcard_outlier_cols if col in creditcard_processed_df.columns]

    for col in creditcard_outlier_cols:
        outlier_analyzer.execute_analysis(creditcard_processed_df, col)
else:
    print("Bank Credit Card Fraud Data is empty, skipping Outlier Analysis.")


### **4.7 Temporal Trend Analysis**

In [None]:
if not creditcard_processed_df.empty:
    print("\n--- Temporal Analysis for Bank Credit Card Fraud Data ---")
    CREDITCARD_TARGET_COL = 'Class'

    # Time is in seconds, so direct line plot is more appropriate than monthly trends
    if 'Time' in creditcard_processed_df.columns and 'Amount' in creditcard_processed_df.columns:
        print(f"Plotting Transaction Amount over Time for Bank Credit Card Fraud Data...")
        plt.figure(figsize=(14, 7))
        sns.lineplot(x='Time', y='Amount', data=creditcard_processed_df, alpha=0.6)
        plt.title(f'Transaction Amount Over Time (Bank Credit Card Fraud Data)')
        plt.xlabel('Time (seconds from first transaction)')
        plt.ylabel('Transaction Amount')
        plt.show()
    else:
        print("Required columns 'Time' or 'Amount' not found for temporal analysis of amount.")

    # Plot fraud incidents over time
    if CREDITCARD_TARGET_COL in creditcard_processed_df.columns and 'Time' in creditcard_processed_df.columns:
        fraud_over_time = creditcard_processed_df[creditcard_processed_df[CREDITCARD_TARGET_COL] == 1]
        if not fraud_over_time.empty:
            print(f"Plotting Fraud Incidents over Time for Bank Credit Card Fraud Data...")
            plt.figure(figsize=(14, 7))
            sns.histplot(x='Time', data=fraud_over_time, bins=50, kde=True, color='red')
            plt.title(f'Fraud Incidents Over Time (Bank Credit Card Fraud Data)')
            plt.xlabel('Time (seconds from first transaction)')
            plt.ylabel('Number of Fraud Incidents')
            plt.show()
        else:
            print(f"No fraud incidents to plot over time in Bank Credit Card Fraud Data.")
    else:
        print("Required columns 'Time' or 'Class' not found for temporal analysis of fraud incidents.")
else:
    print("Bank Credit Card Fraud Data is empty, skipping Temporal Analysis.")


## **5. Class Imbalance Handling Demonstration**

Both fraud detection datasets are highly imbalanced, meaning the number of fraudulent transactions is significantly smaller than legitimate ones. This section demonstrates a common technique, SMOTE (Synthetic Minority Over-sampling Technique), to address this imbalance. SMOTE works by creating synthetic samples of the minority class.

**Note:** In a real machine learning pipeline, SMOTE or similar techniques should only be applied to the *training data* to prevent data leakage and ensure a realistic evaluation of the model's performance on unseen data.

In [None]:
def demonstrate_class_imbalance_handling(df: pd.DataFrame, dataset_name: str, target_col: str):
    """
    Demonstrates class imbalance handling using SMOTE on a conceptual training set.
    """
    print(f"\n--- Demonstrating Class Imbalance Handling for {dataset_name} ---")
    if df.empty or target_col not in df.columns:
        print(f"Cannot demonstrate imbalance handling for empty or missing target column in {dataset_name}.")
        return

    X = df.drop(columns=[target_col])
    y = df[target_col]

    # Split data into training and testing sets (conceptual split for demonstration)
    # In a real scenario, this split would happen *before* any preprocessing that uses target info
    # and before applying imbalance techniques.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

    print(f"\nOriginal training set class distribution for {dataset_name}:")
    print(y_train.value_counts())
    print(f"Fraudulent transactions: {y_train.value_counts().get(1, 0)} ({y_train.value_counts().get(1, 0) / len(y_train) * 100:.2f}%) ")

    # Apply SMOTE to the training data only
    print(f"\nApplying SMOTE to the training data for {dataset_name}...")
    smote = SMOTE(random_state=42)

    try:
        # SMOTE requires numerical input. Ensure all columns in X_train are numerical.
        # If there are object columns, they need to be handled (e.g., one-hot encoded) prior to SMOTE.
        # The preprocessor pipeline should have already handled this.
        X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
        print(f"Resampled training set class distribution for {dataset_name}:")
        print(y_train_resampled.value_counts())
        print(f"Fraudulent transactions: {y_train_resampled.value_counts().get(1, 0)} ({y_train_resampled.value_counts().get(1, 0) / len(y_train_resampled) * 100:.2f}%) ")
        print("SMOTE applied successfully.")
    except Exception as e:
        print(f"Error applying SMOTE: {e}")
        print("Ensure all features are numerical (float, int) before applying SMOTE.")
        print("Check dtypes of X_train:")
        print(X_train.dtypes[X_train.dtypes == 'object']) # Print object columns if any

    print(f"\n--- Class Imbalance Handling Demonstration for {dataset_name} Complete ---")

# Demonstrate for E-commerce Fraud Data
if not fraud_processed_df.empty:
    demonstrate_class_imbalance_handling(fraud_processed_df.copy(), "E-commerce Fraud Data (Fraud_Data.csv)", 'class')

# Demonstrate for Bank Credit Card Fraud Data
if not creditcard_processed_df.empty:
    demonstrate_class_imbalance_handling(creditcard_processed_df.copy(), "Bank Credit Card Fraud Data (creditcard.csv)", 'Class')


## **6. Key Insights & Summary**

This section will summarize the key findings from the EDA for both datasets, highlighting important characteristics, potential challenges, and recommendations for the subsequent modeling phases.

### E-commerce Fraud Data (Fraud_Data.csv)

- **Data Quality:** The dataset is remarkably clean with no missing values after initial preprocessing. However, negative `Amount` values were identified, which were handled by converting to absolute value and creating an `IsRefund` flag. This flag can be a crucial feature for distinguishing transaction types.
- **Feature Distributions:**
    - `Amount` and `age` show highly skewed distributions with significant outliers, indicating a need for robust scaling or transformation during model training.
    - Categorical features like `source`, `browser`, `sex`, and `country` have varying distributions, with some having dominant categories and others being more spread out. The `country` feature, derived from IP addresses, provides geographical context.
- **Engineered Features:**
    - Temporal features (`TransactionHour`, `TransactionDayOfWeek`, `TransactionMonth`, `TransactionYear`) were successfully extracted, allowing for analysis of time-based patterns.
    - `time_since_signup` provides insight into user tenure, which could be related to fraud risk.
    - Transaction frequency and velocity features (e.g., `CustomerId_transactions_last_X_d`, `CustomerId_total_amount_last_X_d`) are critical for identifying unusual spending behaviors over time, often a strong indicator of fraud.
- **Class Imbalance:** The `class` target variable is highly imbalanced (very few fraud cases). This is a major challenge that requires specific handling during model training (e.g., SMOTE, class weights, or advanced sampling techniques) and careful selection of evaluation metrics (e.g., Precision, Recall, F1-score, AUC-ROC) rather than accuracy.
- **Relationships:** Initial bivariate and multivariate analyses suggest potential relationships between transaction amounts, temporal features, and fraud. Features like `Amount` and engineered velocity features are expected to be highly indicative of fraudulent activity.

### Bank Credit Card Fraud Data (creditcard.csv)

- **Data Anonymity:** The `V` features are PCA-transformed, which means their direct interpretation is not possible. Analysis relies on their statistical properties and relationships.
- **Data Quality:** This dataset also appears to be very clean with no missing values after preprocessing. The `Amount` column was processed similarly to the e-commerce data to handle potential negative values (though none were observed in this dataset's summary statistics, the `IsRefund` flag is still useful for consistency).
- **Feature Distributions:**
    - `Time` and `Amount` are the original features, with `Time` representing seconds from the first transaction. Both show distinct distributions that might reveal temporal patterns in fraudulent activities.
    - The `V` features have varying distributions, some appearing more Gaussian-like, others skewed.
- **Class Imbalance:** Similar to the e-commerce data, the `Class` target variable is extremely imbalanced, with a very small percentage of fraudulent transactions. This is the primary challenge for modeling this dataset and necessitates robust imbalance handling strategies.
- **Relationships:** Correlation analysis among `V` features is crucial, as PCA aims to create uncorrelated components. However, their relationship with the `Amount`, `Time`, and `Class` (target) is key. Some `V` features (e.g., V17, V14, V12, V10) are often reported as highly correlated with fraud in similar datasets.
- **Temporal Patterns:** While `Time` is not a standard datetime, its numerical nature allows for plotting trends of fraud occurrences over time, which can reveal specific windows of increased fraudulent activity.

### Overall Recommendations for Modeling

- **Imbalance Handling:** Given the severe class imbalance in both datasets, robust techniques like SMOTE, ADASYN, or Borderline-SMOTE should be applied to the training data. Alternatively, using class weights in models (e.g., Logistic Regression, XGBoost, LightGBM) or employing anomaly detection algorithms could be considered.
- **Evaluation Metrics:** Accuracy is misleading for imbalanced datasets. Focus on metrics like Precision, Recall, F1-score, AUC-ROC, and Confusion Matrix to evaluate model performance effectively.
- **Feature Scaling:** All numerical features (including engineered ones) should be scaled (e.g., `StandardScaler`, `MinMaxScaler`) before training most machine learning models, especially those sensitive to feature scales (e.g., SVMs, Logistic Regression, Neural Networks).
- **Categorical Encoding:** One-Hot Encoding has been applied for categorical features. For high-cardinality features not handled by aggregation (e.g., `AccountId` if not aggregated), alternative encoding methods like Target Encoding or Frequency Encoding might be explored if direct OHE leads to too many sparse features.
- **Outlier Treatment:** While outliers were visualized, a decision on how to treat them (e.g., winsorization, removal, or using models robust to outliers like tree-based models) should be made based on further experimentation.
- **Temporal Aspects:** For the e-commerce data, the engineered temporal features are valuable. For both datasets, consider time-based splitting of data (e.g., training on older data, testing on newer data) to ensure the model generalizes well to future, unseen transactions, as fraud patterns can evolve over time.