### Imbalanced Dataset
### References
- https://www.kaggle.com/mlg-ulb/creditcardfraud
- Krish Naik Feature Engineering https://www.youtube.com/watch?v=pDw_JHHvj-0&list=PLZoTAELRMXVPwYGE2PXD3x0bfKnR0cJjN&index=13

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('creditcard.csv')

In [3]:
df.shape

(284807, 31)

In [4]:
## no missing
df.isnull().count()

Time      284807
V1        284807
V2        284807
V3        284807
V4        284807
V5        284807
V6        284807
V7        284807
V8        284807
V9        284807
V10       284807
V11       284807
V12       284807
V13       284807
V14       284807
V15       284807
V16       284807
V17       284807
V18       284807
V19       284807
V20       284807
V21       284807
V22       284807
V23       284807
V24       284807
V25       284807
V26       284807
V27       284807
V28       284807
Amount    284807
Class     284807
dtype: int64

In [5]:
##### 0 : no fraud
##### 1: fraud
df['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [6]:
## separate into dependent(y) and independent features(x)
x = df.drop('Class',axis=1)
y = df.Class

#### Techniques to handle imbalanced dataset
1. Try different algorithms with different hyperparameters tuning
2. Undersampling
3. Oversampling
4. SMOTE

- As a rule of thumb , follow this workflow to handle imbalanced datasets:

    - SMOTE --> Oversampling --> Ensemble and Hyperparameter Tuning
    - Always have in mind the goal (reduce False Negatives or False Positives)

#### 1.1 Logistic Regression

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import numpy as np

In [8]:
##split into train and test datasets
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.7)

In [9]:
## instatiate the classifier
lr = LogisticRegression()

In [10]:
## set of parameters to pass into gridsearchCV
params = {'C':10.0**np.arange(-2,3),'penalty':['l1','l2'],'solver':['liblinear']}
## cross-validation
cv = KFold(n_splits=5)
## instatiate the gridsearchCV and fit it with train datasets
grid = GridSearchCV(lr,params,cv=cv,scoring='f1_macro')
grid.fit(x_train,y_train)

GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=False),
             estimator=LogisticRegression(),
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]),
                         'penalty': ['l1', 'l2'], 'solver': ['liblinear']},
             scoring='f1_macro')

In [11]:
y_pred = grid.predict(x_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85272    11]
 [   58   102]]
0.9991924440855307
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85283
           1       0.90      0.64      0.75       160

    accuracy                           1.00     85443
   macro avg       0.95      0.82      0.87     85443
weighted avg       1.00      1.00      1.00     85443



#### 1.2 Random Forest
- Increase of precision and recall without any hypertuning

In [12]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)

In [13]:
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85276     7]
 [   36   124]]
0.9994967405170698
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85283
           1       0.95      0.78      0.85       160

    accuracy                           1.00     85443
   macro avg       0.97      0.89      0.93     85443
weighted avg       1.00      1.00      1.00     85443



### 2. Undersampling
- reduce the occurence of the most representative class
- pitfall: reduce the number of observations
- rarely used! It's worth a try if dataset is small

In [14]:
y_train.value_counts()

0    199032
1       332
Name: Class, dtype: int64

In [15]:
from imblearn.under_sampling import NearMiss
## make labels=1 around 80% of labels=0, so that 0 is reduced to 435 
nm = NearMiss(sampling_strategy=0.8)
x_train_nm, y_train_nm = nm.fit_sample(x_train,y_train)

In [16]:
y_train_nm.value_counts()

0    415
1    332
Name: Class, dtype: int64

In [17]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train_nm,y_train_nm)
y_pred = rf.predict(x_test)

In [18]:
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[61544 23739]
 [    8   152]]
0.7220720246246035
              precision    recall  f1-score   support

           0       1.00      0.72      0.84     85283
           1       0.01      0.95      0.01       160

    accuracy                           0.72     85443
   macro avg       0.50      0.84      0.43     85443
weighted avg       1.00      0.72      0.84     85443



### 3. Oversampling
- increase the occurence of the less representative label by adding exact points 
- better than undersampling technique

In [19]:
from imblearn.over_sampling import RandomOverSampler
## make labels=1 around 50% of labels=0, so that 1 is increased to 99508
os = RandomOverSampler(sampling_strategy=0.5)
x_train_os,y_train_os = os.fit_sample(x_train,y_train)

In [20]:
y_train_os.value_counts()

0    199032
1     99516
Name: Class, dtype: int64

In [21]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train_os,y_train_os)
y_pred = rf.predict(x_test)

In [22]:
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85276     7]
 [   31   129]]
0.9995552590615966
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85283
           1       0.95      0.81      0.87       160

    accuracy                           1.00     85443
   macro avg       0.97      0.90      0.94     85443
weighted avg       1.00      1.00      1.00     85443



### 4. SMOTETomek
- creates new points around the current ones (https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/)
- pitfall: takes too long depending on the dataset

In [None]:
from imblearn.combine import SMOTETomek
sm = SMOTETomek(sampling_strategy=.75)
x_train_sm, y_train_sm = sm.fit_sample(x_train,y_train)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train_sm,y_train_sm)
y_pred = rf.predict(x_test)

In [None]:
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))