---
# **Project Goal**

The primary goal of this project is to systematically evaluate the impact of various preprocessing techniques and machine learning models on the performance of a predictive model. Specifically, I aim to:

1. **Identify the Best Scaling Method**: Determine which scaling technique (e.g., StandardScaler, MinMaxScaler, or no scaling) optimally prepares the data for model training.
   
2. **Determine the Most Effective Class Imbalance Handling Strategy**: Explore different methods for managing class imbalance, such as using SMOTE (Synthetic Minority Over-sampling Technique), class weights, or no imbalance handling, to enhance the model’s ability to predict minority class instances.
   
3. **Evaluate and Compare Machine Learning Models**: Assess the performance of various models, including Logistic Regression, Random Forest, and Hist Gradient Boosting, to identify the most suitable model for this dataset.

By keeping all other parameters constant while varying one component at a time, this approach will enable a clear understanding of how each preprocessing step and model choice affects the overall performance. This systematic evaluation will not only help in selecting the best-performing model but also provide insights into the role of each method in the machine learning pipeline.

---

In [None]:
import pandas as pd 
import numpy as np 

import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.preprocessing import StandardScaler,MinMaxScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import cross_val_score , KFold, train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import HistGradientBoostingClassifier

from sklearn.tree import DecisionTreeClassifier


# **Data Loading**

In [None]:
df=pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")
df.head()

# **Exploratory Data Analysis (EDA)**

In [None]:
df.shape # 30 features and 284807 samples

In [None]:
df.info() # no missing values and all features are numarical

In [None]:
df.describe().iloc[:,:10] # by only seeing the first 10 columns features its clear that there is great varity in the range of the features so scalling will be needed

In [None]:
df.nunique()  # all features have a variety of unique values, no transformation to object type is needed.

In [None]:
sns.catplot(x="Class",data=df,kind="count")
plt.show() 
# The data is greatly imbalanced, with class 0 being very low, so we will need to apply techniques to handle this.
# Recommended order for handling imbalanced data:
# 1. Start by setting class weights in your models to handle imbalance without altering the data.
# 2. If class weights alone aren't sufficient, apply SMOTE to generate synthetic samples for the minority class.
# 3. Evaluate model performance using appropriate metrics like F1-score and ROC AUC throughout the process.
# 4. Consider using ensemble methods like Balanced Random Forest if class weights and SMOTE do not yield satisfactory results.
# 5. Use stratified cross-validation to maintain class distribution across folds for reliable evaluation.
# 6. Finally, fine-tune the model by adjusting the decision threshold based on your evaluation metrics to optimize the trade-off between precision and recall.


In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(),annot=True,fmt="0.02f",cmap="Blues")
plt.show()
# 'Amount' and 'Time' columns have high correlation with other features, which could introduce multicollinearity and affect model performance.


## **Insights**
* we need to perform scalling 
* we need to handle class imabalanced 
* we need to drop amount and time feature 

In [None]:
df.drop(columns=["Amount","Time"],axis=1,inplace=True)

# **Data Preprocessing**

In [None]:
x_train,x_test,y_train,y_test=train_test_split(df.drop("Class",axis=1),df[["Class"]],shuffle=True,stratify=df[["Class"]],test_size=0.2)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

In [None]:
# from sklearn.model_selection import StratifiedKFold
# kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=8)


## **1. Baseline Pipeline**
   - **Components**: No Scaling, No Imbalance Handling, Simple Model (e.g., Logistic Regression)
   - **Purpose**: Establish a baseline performance to compare with other pipelines.


In [None]:
# Define the baseline pipeline
base_pipeline = Pipeline([
    ("model", LogisticRegression())
])

# Evaluate the model
base_pipeline.fit(x_train,y_train.values.reshape(-1)) 
y_pred=base_pipeline.predict(x_test)

print(classification_report(y_test,y_pred))




## **2. Scaling Pipelines**
   - **Pipeline 1: Standard Scaling**
     - **Components**: StandardScaler for scaling, No Imbalance Handling, Simple Model (e.g., Logistic Regression)
     - **Purpose**: Evaluate the effect of standard scaling on model performance.
   - **Pipeline 2: MinMax Scaling**
     - **Components**: MinMaxScaler for scaling, No Imbalance Handling, Simple Model (e.g., Logistic Regression)
     - **Purpose**: Evaluate the effect of min-max scaling on model performance.

In [None]:
# Define the baseline pipeline
standerd_scaler_pipeline = Pipeline([
    ("standard scaling",StandardScaler()),
    ("model", LogisticRegression())
])

# Evaluate the model
standerd_scaler_pipeline.fit(x_train,y_train.values.reshape(-1)) 
y_pred=standerd_scaler_pipeline.predict(x_test)

print(classification_report(y_test,y_pred))



In [None]:
# Define the baseline pipeline
minmax_scaler_pipeline = Pipeline([
    ("standard scaling",MinMaxScaler()),
    ("model", LogisticRegression())
])

# Evaluate the model
minmax_scaler_pipeline.fit(x_train,y_train.values.reshape(-1)) 
y_pred=minmax_scaler_pipeline.predict(x_test)

print(classification_report(y_test,y_pred))



## **3. Imbalance Handling Pipelines**
   - **Pipeline 3: Standard Scaling + SMOTE**
     - **Components**: StandardScaler for scaling, SMOTE for imbalance handling, Simple Model (e.g., Logistic Regression)
     - **Purpose**: Evaluate the effect of SMOTE on a scaled dataset.
   - **Pipeline 4: Standard Scaling + Class Weights**
     - **Components**: StandardScaler for scaling, Class Weights for imbalance handling, Simple Model (e.g., Logistic Regression)
     - **Purpose**: Evaluate the effect of using class weights on a scaled dataset.

In [None]:
# Define the baseline pipeline
class_weight_pipeline = Pipeline([
    ("standard scaling",StandardScaler()),
    ("model", LogisticRegression(class_weight="balanced"))
])

# Evaluate the model
class_weight_pipeline.fit(x_train,y_train.values.reshape(-1)) 
y_pred=class_weight_pipeline.predict(x_test)

print(classification_report(y_test,y_pred))


In [None]:
# Define the pipeline
smote_pipeline = Pipeline([
    ("standard scaling", StandardScaler()),
    ("smote", SMOTE()), 
    ("model", LogisticRegression())
])

# Fit the pipeline on the full training set
smote_pipeline.fit(x_train, y_train.values.reshape(-1))

# Predict on the test set
y_pred = smote_pipeline.predict(x_test)

# Print the classification report for the test set
print(classification_report(y_test, y_pred))


## **4. Model Evaluation Pipelines**

   - **Pipeline 6: Standard Scaling + Class Balanced + Random Forest**
     - **Components**: Best Scaling Method, Best Imbalance Handling Method, Random Forest Model
     - **Purpose**: Test the performance of Random Forest using the best identified preprocessing methods.

   - **Pipeline 7: Standard Scaling + Class Balanced + AdaBoost**
     - **Components**: Best Scaling Method, Best Imbalance Handling Method, AdaBoost Model with a base `DecisionTreeClassifier` (using `class_weight="balanced"` for the base estimator)
     - **Purpose**: Test the performance of AdaBoost with a decision tree base estimator that handles class imbalance.

   - **Pipeline 8: Standard Scaling + Class Balanced + Hist Gradient Boosting Classifier**
     - **Components**: Components: Best Scaling Method, class_weight="balanced" in HistGradientBoostingClassifier
     - **Purpose**: Test the performance of HistGradientBoostingClassifier, which natively supports class balancing, using the best identified preprocessing methods.

In [None]:
# Define the baseline pipeline
random_forest_pipeline = Pipeline([
    ("standard scaling",StandardScaler()),
    ("model", RandomForestClassifier(class_weight="balanced"))
])

# Evaluate the model
random_forest_pipeline.fit(x_train,y_train.values.reshape(-1)) 
y_pred=random_forest_pipeline.predict(x_test)

print(classification_report(y_test,y_pred))



In [None]:
base_estimator = DecisionTreeClassifier(class_weight="balanced")
# Define the baseline pipeline
adav_boosting_pipeline = Pipeline([
    ("standard scaling",StandardScaler()),
    ("model", AdaBoostClassifier(base_estimator))
])

# Evaluate the model
adav_boosting_pipeline.fit(x_train,y_train.values.reshape(-1)) 
y_pred=adav_boosting_pipeline.predict(x_test)

print(classification_report(y_test,y_pred))



In [None]:
# Define the baseline pipeline
hist_grid_pipeline = Pipeline([
    ("standard scaling",StandardScaler()),
    ("model", HistGradientBoostingClassifier(class_weight="balanced"))
])

# Evaluate the model
hist_grid_pipeline.fit(x_train,y_train.values.reshape(-1)) 
y_pred=hist_grid_pipeline.predict(x_test)

print(classification_report(y_test,y_pred))




# **Final Evaluation Pipeline**


In [None]:
# Conclusion:
# Among the models tested, Random Forest emerged as the top performer, particularly excelling in handling class imbalance by achieving a strong balance between precision and recall for the minority class.
# The Random Forest model provided a high F1-score for the minority class, making it the most reliable choice for this specific dataset.
# HistGradientBoostingClassifier, while expected to perform well, struggled with low precision for the minority class, indicating potential overcompensation for the class imbalance. 
# Future work could involve further hyperparameter tuning for HistGradientBoosting to improve its precision, or exploring ensemble methods that combine the strengths of both Random Forest and HistGradientBoosting.