# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
# STEP 1: Check balance of data
print(f"Fraud makes up about {round(fraud.value_counts('fraud')[1]/fraud.value_counts('fraud')[0]*100,1)} % of the data")

Fraud makes up about 9.6 % of the data


In [4]:
# STEP 2: Train Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Define X and y
X = fraud.drop(columns = 'fraud')
y = fraud['fraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

In [5]:
# STEP 3: Evaluate Model

# Predict on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {round(accuracy, 2)}")

Accuracy: 0.96


In [6]:
# STEP 4: Oversample fraud class

from imblearn.over_sampling import RandomOverSampler

# Define oversampling strategy
oversample = RandomOverSampler(sampling_strategy = 0.5)

# Fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)

# Split data into training and testing sets
X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(X_over, y_over, test_size = 0.3, random_state = 42)

# Train model
model_over = LogisticRegression()
model_over.fit(X_train_over, y_train_over)

# Predict on test data
y_pred_over = model_over.predict(X_test_over)

# Calculate accuracy
accuracy_over = accuracy_score(y_test_over, y_pred_over)
print(f"Accuracy (oversampled): {round(accuracy_over, 2)}")

Accuracy (oversampled): 0.94


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
# STEP 5: Undersample non-fraud class

from imblearn.under_sampling import RandomUnderSampler

# Define undersampling strategy
undersample = RandomUnderSampler(sampling_strategy = 0.5)

# Fit and apply the transform
X_under, y_under = undersample.fit_resample(X, y)

# Split data into training and testing sets
X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under, test_size = 0.3, random_state = 42)

# Train model
model_under = LogisticRegression()
model_under.fit(X_train_under, y_train_under)

# Predict on test data
y_pred_under = model_under.predict(X_test_under)

# Calculate accuracy
accuracy_under = accuracy_score(y_test_under, y_pred_under)
print(f"Accuracy (undersampled): {round(accuracy_under, 2)}")

Accuracy (undersampled): 0.93


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [8]:
# STEP 6: SMOTE

from imblearn.over_sampling import SMOTE

# Define oversampling strategy
smote = SMOTE(sampling_strategy = 0.5)


# Fit and apply the transform
X_smote, y_smote = smote.fit_resample(X, y)

# Split data into training and testing sets
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_smote, y_smote, test_size = 0.3, random_state = 42)

# Train model
model_smote = LogisticRegression()
model_smote.fit(X_train_smote, y_train_smote)

# Predict on test data
y_pred_smote = model_smote.predict(X_test_smote)

# Calculate accuracy
accuracy_smote = accuracy_score(y_test_smote, y_pred_smote)
print(f"Accuracy (SMOTE): {round(accuracy_smote, 2)}")

Accuracy (SMOTE): 0.94


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
