## Week 4: Imbalance Data

## Load dataset

In [None]:
import pandas as pd
data = pd.read_csv(r'C:\\Users\\veronicali\\Downloads\\Handle-imbalanced-data-master\\Handle-imbalanced-data-master\creditcard.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.Class.value_counts()

In [None]:
# check the number of 1s and 0s
count = data['Class'].value_counts()

print('Fraudulent "1" :', count[1])
print('Not Fraudulent "0":', count[0])

# print the percentage of question where target == 1
print(count[1]/count[0]* 100)

This show  that the data is highly imbalance. Only 0.17% of the data is belong to fraud

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# plot the no of 1's and 0's
g = sns.countplot(x='Class',data=data)
g.set_xticklabels(['Not Fraud', 'Fraud'])
plt.show()

In [None]:
# check for null values
data.isnull().sum()

## Respose and Target variable

In [None]:
import numpy as np
x = data.iloc[:, :-1]
y = data.iloc[:, -1]

# check length of 1's and 0's
one = np.where(y==1)
zero = np.where(y==0)
len(one[0]), len(zero[0])

## Train test split

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [None]:
Train_one = np.where(y_train==1)
Train_zero = np.where(y_train==0)
len(Train_one[0]), len(Train_zero[0])

In [None]:
Test_one = np.where(y_test==1)
Test_zero = np.where(y_test==0)
len(Test_one[0]), len(Test_zero[0])


## Fit the model using Logitic Regression

In [None]:
# create the object
from sklearn.linear_model import LogisticRegression
model =  LogisticRegression()

model.fit(x, y)

y_predict = model.predict(x)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score, roc_curve
accuracy_score(y_predict, y)

Accuray of 99.89% is achieved with this dataset. Let have a look of the confusion matrix

In [None]:
confusion_matrix(y_predict, y)

In [None]:
roc_auc_score(y_predict, y)

In [None]:
fpr, tpr, thresholds=roc_curve(y_predict, y)
plt.plot(fpr,tpr)

In [None]:
f1_score(y_predict, y)

### What can you conclude from the confusion matrix, ROC and F1 score?

##  Resampling Technique

In [None]:
# class count
class_count_0, class_count_1 = data['Class'].value_counts()

# divie class
class_0 = data[data['Class'] == 0]
class_1 = data[data['Class'] == 1]

In [None]:
# print the shape of the class
print('class 0:', class_0.shape)
print('\nclass 1:', class_1.shape)

## 1. Random under sampling

In [None]:
class_0_under = class_0.sample(class_count_1)

test_under = pd.concat([class_0_under, class_1], axis=0)

print("total class of 1 and 0:\n",test_under['Class'].value_counts())

test_under['Class'].value_counts().plot(kind='bar', title='Count (target)')
plt.show()

In [None]:
# Let split the data into train and testing
# Train the model using Logistic Regression
# Get the performance matrix: Accuray, Confusion matrix, ROC and F1 Score

## 2. Random over sampling

In [None]:
class_1_over = class_1.sample(class_count_0, replace=True)

test_under = pd.concat([class_1_over, class_0], axis=0)

# print the number of class count
print('class count of 1 and 0:\n', test_under['Class'].value_counts())

# plot the count
test_under['Class'].value_counts().plot(kind='bar', title='Count (target)')
plt.show()

In [None]:
# Let split the data into train and testing
# Train the model using Logistic Regression
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve



## Balance data with imbalance learn module

In [None]:
# import library
import imblearn

## 3. Random under-sampling with imblearn

In [None]:
# import library
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42, replacement=True)

# fit predictor and target varialbe
x_rus, y_rus = rus.fit_resample(x, y)

print('original dataset shape:', Counter(y))
print('Resample dataset shape', Counter(y_rus))

In [None]:
# Let split the data into train and testing
# Train the model using Logistic Regression
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve



## 4. Random over-sampling with imblearn

In [None]:
# import library
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)

# fit predictor and target varaible
x_ros, y_ros = ros.fit_resample(x, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

In [None]:
# Let split the data into train and testing
# Train the model using Logistic Regression
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve



## 5. Under-sampling Tomek links

In [None]:
# load library
from imblearn.under_sampling import TomekLinks

tl = TomekLinks(sampling_strategy='majority')

# fit predictor and target variable
x_tl, y_tl = tl.fit_resample(x, y)

print('Original dataset shape:', Counter(y))
print('Resample dataset shape:', Counter(y_tl))

In [None]:
# Let split the data into train and testing
# Train the model using Logistic Regression
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve



## 6. Synthetic minority over-sampling technique (SMOTE)

In [None]:
# load library
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit target and predictor variable
x_smote , y_smote = smote.fit_resample(x, y)

print('Origianl dataset shape:', Counter(y))
print('Resampple dataset shape:', Counter(y_smote))

In [None]:
# Let split the data into train and testing
# Train the model using Logistic Regression
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve



## 7. NearMiss

In [None]:
from imblearn.under_sampling import NearMiss

nm = NearMiss()

x_nm, y_nm = nm.fit_resample(x, y)

print('Original dataset shape:', Counter(y))
print('Resample dataset shape:', Counter(y_nm))

In [None]:
# Let split the data into train and testing
# Train the model using Logistic Regression
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve



Let check out what is NearMiss

## 8. penalize algorithm (cost-sensitive training)

In [None]:
"""
# load library
from sklearn.svm import SVC

# we can add class_weight='balanced' to add panalize mistake
svc_model = SVC(class_weight='balanced', probability=True)

svc_model.fit(x_train, y_train)

svc_predict = svc_model.predict(x_test)
"""

In [None]:
# Let split the data into train and testing
# Train the model using Logistic Regression
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve



## 10. Tree based algorithm

While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be especially beneficial with imbalanced datasets.

Decision trees frequently perform well on imbalanced data. In modern machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we’ll jump right into those:

Tree base algorithm work by learning a hierarchy of if/else questions. This can force both classes to be addressed.

In [None]:
# Let Train the model using Random Forest
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve
# fit the predictor and target

In [None]:
# Let Train the model using XGBoost
# Get the performance matrix: Accuracy, Confusion matrix, ROC and F1 Score, ROC Curve
# fit the predictor and target

What is the advantages and disadvantage of using under-sampling and over-sampling?