# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [19]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [4]:
# 1. What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset? Yes

fraud["fraud"].value_counts()

0.0    912597
1.0     87403
Name: fraud, dtype: int64

In [32]:
# 2. Train a LogisticRegression.
features = fraud.drop(columns = ["fraud"])
target = fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

log = LogisticRegression()
log.fit(X_train, y_train)

pred = log.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:
# 3. Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
pred = log.predict(X_test)


print("Accuracy:", accuracy_score(y_test, pred))
print("Classification Report:\n", classification_report(y_test, pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


print("Logistic Regression Model Accuracy:", log.score(X_test, y_test))

Accuracy: 0.957715
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182615
         1.0       0.89      0.59      0.71     17385

    accuracy                           0.96    200000
   macro avg       0.92      0.79      0.84    200000
weighted avg       0.96      0.96      0.95    200000

Confusion Matrix:
 [[181303   1312]
 [  7145  10240]]
Logistic Regression Model Accuracy: 0.957715


In [16]:
# 4. Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? #In general, the performance is worse


train = pd.DataFrame(X_train, columns = X_train.columns)
train["fraud"] = y_train.values
frauded = train[train["fraud"] == 1]
no_frauded = train[train["fraud"] == 0]

fraud_oversampled = resample(frauded, 
                                    replace=True, 
                                    n_samples = len(no_frauded),
                                    random_state=0)

train_over = pd.concat([fraud_oversampled, no_frauded])

X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]
log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)
pred = log_reg.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Classification Report:\n", classification_report(y_test, pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


print("Logistic Regression Model Accuracy:", log.score(X_test, y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.930575
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182615
         1.0       0.56      0.95      0.70     17385

    accuracy                           0.93    200000
   macro avg       0.78      0.94      0.83    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix:
 [[169642  12973]
 [   912  16473]]
Logistic Regression Model Accuracy: 0.957715


In [None]:
# 5. Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? # This one is better than the oversampling but still worse than the original
train = pd.DataFrame(X_train, columns = X_train.columns)
train["fraud"] = y_train.values
frauded = train[train["fraud"] == 1]
no_frauded = train[train["fraud"] == 0]

fraud_Undersample = resample(no_frauded, 
                                    replace=True, 
                                    n_samples = len(no_frauded),
                                    random_state=0)

train_under = pd.concat([frauded, fraud_Undersample])

X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]
log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)
pred = log_reg.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Classification Report:\n", classification_report(y_test, pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


print("Logistic Regression Model Accuracy:", log.score(X_test, y_test))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.95596
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182615
         1.0       0.89      0.57      0.69     17385

    accuracy                           0.96    200000
   macro avg       0.92      0.78      0.83    200000
weighted avg       0.95      0.96      0.95    200000

Confusion Matrix:
 [[181360   1255]
 [  7553   9832]]
Logistic Regression Model Accuracy: 0.957715


In [33]:
# 6. Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? This is the second worst model


sm = SMOTE(random_state = 1,sampling_strategy=1.0)
X_train_sm,y_train_sm = sm.fit_resample(X_train,y_train)
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)
pred = log_reg.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Classification Report:\n", classification_report(y_test, pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


print("Logistic Regression Model Accuracy:", log.score(X_test, y_test))

Accuracy: 0.93452
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182615
         1.0       0.57      0.95      0.72     17385

    accuracy                           0.93    200000
   macro avg       0.78      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix:
 [[170442  12173]
 [   923  16462]]
Logistic Regression Model Accuracy: 0.957715
