# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [6]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split


## Model Choice

**Logistic regression** was chosen as the baseline model for this machine learning task due to its simplicity, interpretability, and effectiveness in handling binary classification problems. As a linear model, it provides a strong foundation for understanding the relationship between features and the target variable.

Moreover, given the presence of class imbalance in the dataset, a **cost-sensitive approach** was incorporated to address this issue. By assigning different weights to each class, the model is encouraged to focus on the minority class, thereby improving overall performance metrics.

This combination of logistic regression and cost-sensitive learning offers a straightforward yet robust approach to tackle the classification problem while accounting for the imbalanced nature of the data. It serves as a solid baseline to compare the performance of more complex models.

**Key reasons for choosing this baseline:**

- **Simplicity:** Logistic regression is easy to implement and understand.
- **Interpretability:** The model's coefficients can provide insights into feature importance.
- **Efficiency:** Relatively fast training and prediction times.
- **Handles imbalance:** The cost-sensitive approach directly addresses the class imbalance issue.

By establishing a strong baseline, we can effectively evaluate the performance gains of more sophisticated models and make informed decisions about model selection.


## Feature Selection

I am using all the features as we are using profiling of data and selecting features can impact the performace

In [3]:
# Load the dataset
# Replace 'your_dataset.csv' with the path to your actual dataset
df = pd.read_csv('clean_data.csv')

# Feature selection
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

X, y = shuffle(X, y, random_state=42)


## Implementation




In [5]:
#Assigning weights for cost sensitive learning
class_weights = compute_class_weight(class_weight="balanced", classes=np.unique(y), y=y)
class_weight_dict = {i: weight for i, weight in enumerate(class_weights)}
print("Weights for each class", class_weight_dict)

Weights for each class {0: 0.5000945, 1: 2646.0026455026455}


In [8]:
#Splitting data into train test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
# Initialize and train the baseline model
param_grid = {
    "logisticregression__C": [0.01, 0.1, 1.0]
}

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logisticregression', LogisticRegression(class_weight=class_weight_dict, max_iter=1000))
])

grid_search = GridSearchCV(pipeline, param_grid, scoring='f1', cv=5)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

print(best_model)
print("Best Parameters", best_params)


Pipeline(steps=[('scaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(class_weight={0: 0.5000945,
                                                  1: 2646.0026455026455},
                                    max_iter=1000))])
Best Parameters {'logisticregression__C': 1.0}


## Evaluation

**F1 score** was used as the primary evaluation metric due to its effectiveness in handling imbalanced datasets.



In [10]:
# Evaluate the baseline model
#Using f1 score as evaluation metric
predictions = best_model.predict(X_test)
f1_score(y_test, predictions)

0.004407173899736793

In [11]:
y_pred = best_model.predict(X_test)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

In [12]:
# Create pandas DataFrame from confusion matrix
cm_df = pd.DataFrame(cm, index=best_model.classes_, columns=best_model.classes_)

# Print or display the confusion matrix DataFrame
print(cm_df)

# Generate classification report
report = classification_report(y_test, y_pred, output_dict=True)

# Print classification report in a more readable format (optional)
print(pd.DataFrame(report).transpose())

        0      1
0  183737  16260
1       5     36
              precision    recall  f1-score       support
0              0.999973  0.918699  0.957614  199997.00000
1              0.002209  0.878049  0.004407      41.00000
accuracy       0.918690  0.918690  0.918690       0.91869
macro avg      0.501091  0.898374  0.481011  200038.00000
weighted avg   0.999768  0.918690  0.957419  200038.00000
