# ML FOR IMBALANCED CLASSIFICATION

**CONTENTS**

- [Example Dataset](#example)
- [Methods for Imbalanced Classification](#methods)
    + [Improve The Dataset](#dataset)
        + [Up-sampling The Minority Class](#up_sampling)
        + [Down-sampling The Majority Class](#down_sampling)
    + [Changing The Performance Metric ](#metrics)
    + [Use Specific Models for Imbalanced Classification](#model)
        + [Penalizing Algorithms (Cost_sensitive Training](#penalizing)
        + [Tree-Based Algorithms](#tree-base)

- [Good reference for handling imbalanced classification](https://elitedatascience.com/imbalanced-classes)
- [Ref2](https://www.svds.com/learning-imbalanced-classes/)
- Examples of imbalanced classes:
    + Fraud detection
    + Spam filtering
    + Disease screening
    + SaaS subscription churn
    + Advertising click-throughs

<a id='example'></a>
## 1. EXAMPLE DATASET OF IMBALANCED CLASSIFICATION

### 1.1. DATASET DESCRIPTION

- [Balance Scale Weight & Distance Database](http://archive.ics.uci.edu/ml/datasets/balance+scale)
- Number of Attributes: 4 (numeric) + class name = 5
- Attribute Information:
	1. Class Name: 3 (L, B, R)
	2. Left-Weight: 5 (1, 2, 3, 4, 5)
	3. Left-Distance: 5 (1, 2, 3, 4, 5)
	4. Right-Weight: 5 (1, 2, 3, 4, 5)
	5. Right-Distance: 5 (1, 2, 3, 4, 5)
- Number of Instances: 625 (49 balanced, 288 left, 288 right)
- Class Distribution: 
   1. 46.08 percent are L
   2. 07.84 percent are B
   3. 46.08 percent are R
   
   
**=> In this study, the dataset will be converted into a binary classification problem: balanced and non-balanced class.**

### 1.2. PROCESS THE DATA

In [90]:
import pandas as pd
import numpy as np

In [97]:
data = pd.read_csv("data/balance-scale.data", header = None, \
                   names = ['class', 'x1', 'x2', 'x3', 'x4'])
print(data.shape)
data.head(3)

(625, 5)


Unnamed: 0,class,x1,x2,x3,x4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3


In [98]:
data['class'].value_counts()

R    288
L    288
B     49
Name: class, dtype: int64

In [99]:
### Convert the 3-classes into binary class
data['balance'] = [1 if b == 'B' else 0 for b in data['class']]
data.drop(['class'], axis = 1, inplace = True)
data.head(3)

Unnamed: 0,x1,x2,x3,x4,balance
0,1,1,1,1,1
1,1,1,1,2,0
2,1,1,1,3,0


In [100]:
data.balance.value_counts()

0    576
1     49
Name: balance, dtype: int64

### 1.3. BRIEFLY TRAIN THE MODEL

In [85]:
from sklearn import tree
from sklearn import ensemble
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# sklearn metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

def fitModels(X, y):
    """ Create and train the data using different ML models
    Print out all the accuracy, recall, precision and f1 score of each model
    Return a list of model for plotting
    """
    models = []
    #### Prepare the dataset
    ### Use option stratify for imbalanced class
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
    #### Naive Bayes
    model_name = "Naive Bayes"
    clf = GaussianNB()
    clf = clf.fit(X_train, y_train)
    models.append([model_name, clf])
    # Evaluate the model
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("{} - Accuracy score: {}".format(model_name,\
                                accuracy))
    print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
    
    #### support vector machine
    model_name = "Support Vector Machine"
    clf = svm.SVC(kernel = "linear", gamma = "auto")
    clf = clf.fit(X_train, y_train)
    models.append([model_name, clf])
    # Evaluate the model
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("{} - Accuracy score: {}".format(model_name,\
                                accuracy))
    print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
    
    #### k-nearest neighbor
    model_name = "K-nearest Neighbor"
    clf = KNeighborsClassifier(n_neighbors = 2)
    clf = clf.fit(X_train, y_train)
    models.append([model_name, clf])
    # Evaluate the model
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("{} - Accuracy score: {}".format(model_name,\
                                accuracy))
    print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
    
    #### LogisticRegression
    model_name = "Logistic Regression"
    clf = LogisticRegression()
    clf = clf.fit(X_train, y_train)
    models.append([model_name, clf])
    # Evaluate the model
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("{} - Accuracy score: {}".format(model_name,\
                                accuracy))
    print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
    
    
    #### Decision tree
    model_name = "Decision Tree"
    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(X_train, y_train)
    models.append([model_name, clf])
    # Evaluate the model
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("{} - Accuracy score: {}".format(model_name,\
                                accuracy))
    print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
    
    #### Random forest
    model_name = "Random Forest"
    clf = ensemble.RandomForestClassifier(n_estimators = 100)
    clf = clf.fit(X_train, y_train)
    models.append([model_name, clf])
    # Evaluate the model
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("{} - Accuracy score: {}".format(model_name,\
                                accuracy))
    print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
    
    #### AdaBoost
    model_name = "AdaBoost"
    clf = ensemble.AdaBoostClassifier(n_estimators = 100)
    clf = clf.fit(X_train, y_train)
    models.append([model_name, clf])
    # Evaluate the model
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("{} - Accuracy score: {}".format(model_name,\
                                accuracy))
    print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
    
    #### GradientBoost
    model_name = "Gradient Boosting"
    clf = ensemble.GradientBoostingClassifier(n_estimators = 100)
    clf = clf.fit(X_train, y_train)
    models.append([model_name, clf])
    # Evaluate the model
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("{} - Accuracy score: {}".format(model_name,\
                                accuracy))
    print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
    
    return models

In [86]:
X = np.array(data[['x1', 'x2', 'x3', 'x4']])
y = np.array(data['balance'])
models = fitModels(X, y)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Naive Bayes - Accuracy score: 0.92
Recall: 0.0, Precision: 0.0, F1 score: 0.0

Support Vector Machine - Accuracy score: 0.92
Recall: 0.0, Precision: 0.0, F1 score: 0.0

K-nearest Neighbor - Accuracy score: 0.904
Recall: 0.0, Precision: 0.0, F1 score: 0.0

Logistic Regression - Accuracy score: 0.92
Recall: 0.0, Precision: 0.0, F1 score: 0.0

Decision Tree - Accuracy score: 0.84
Recall: 0.0, Precision: 0.0, F1 score: 0.0

Random Forest - Accuracy score: 0.912
Recall: 0.0, Precision: 0.0, F1 score: 0.0

AdaBoost - Accuracy score: 0.92
Recall: 0.0, Precision: 0.0, F1 score: 0.0

Gradient Boosting - Accuracy score: 0.896
Recall: 0.0, Precision: 0.0, F1 score: 0.0



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


**=> Could not predict the minor class**

<a id='methods'></a>
## 2. METHODS FOR IMBALANCED CLASSIFICATION

<a id='dataset'></a>
### 2.1. IMPROVE THE DATASET

- Some methods to improve the dataset (make imbalanced classification more balanced):
    + For minority class:
        + Up-sampling the minority class
        + Combine minority classes (for multi-class problem):
            + Example: predict credit card fraud: There are different fraud class => combine all together.
        + Data Augmentation: create synthetic samples.
    + For majority class: Down-sampling
- Below is the techniques for up-sampling and down-sampling

<a id='up_sampling'></a>
#### 2.1.1. UP-SAMPLING THE MINORITY CLASS

- Randomly duplicating observations from the minority class to reinforce its signal.
- Simple and most common method: resample the minority class with replacement using `resample` from `sklearn`.

In [113]:
from sklearn.utils import resample

data_major = data.query('balance == 0')
data_minor = data.query('balance == 1')
print("Majority class: {}, Minority class: {}".format(data_major.shape, data_minor.shape))
### Up-sampling the minority class
data_minor_upsampled = resample(data_minor, replace = True, 
                                n_samples = data_major.shape[0], random_state = 42)
print("Minority class - upsampled.shape", data_minor_upsampled.shape)
# combine the class
data_upsampled = pd.concat([data_major, data_minor_upsampled])
print('Count target y in each class:')
data_upsampled.balance.value_counts()

Majority class: (576, 5), Minority class: (49, 5)
Minority class - upsampled.shape (576, 5)
Count target y in each class:


1    576
0    576
Name: balance, dtype: int64

In [110]:
X = np.array(data_upsampled[['x1', 'x2', 'x3', 'x4']])
y = np.array(data_upsampled['balance'])
models = fitModels(X, y)

Naive Bayes - Accuracy score: 0.5064935064935064
Recall: 0.5043478260869565, Precision: 0.5043478260869565, F1 score: 0.5043478260869565

Support Vector Machine - Accuracy score: 0.5194805194805194
Recall: 0.5826086956521739, Precision: 0.5153846153846153, F1 score: 0.5469387755102041

K-nearest Neighbor - Accuracy score: 0.9567099567099567
Recall: 1.0, Precision: 0.92, F1 score: 0.9583333333333334

Logistic Regression - Accuracy score: 0.4935064935064935
Recall: 0.4782608695652174, Precision: 0.49107142857142855, F1 score: 0.4845814977973568

Decision Tree - Accuracy score: 0.9567099567099567
Recall: 1.0, Precision: 0.92, F1 score: 0.9583333333333334

Random Forest - Accuracy score: 0.974025974025974
Recall: 1.0, Precision: 0.9504132231404959, F1 score: 0.9745762711864406





AdaBoost - Accuracy score: 0.5324675324675324
Recall: 0.5826086956521739, Precision: 0.5275590551181102, F1 score: 0.5537190082644627

Gradient Boosting - Accuracy score: 0.8874458874458875
Recall: 0.9739130434782609, Precision: 0.8296296296296296, F1 score: 0.896



In [112]:
#### Choose random forest model as the best performance model => Predict the real y_test
X = np.array(data[['x1', 'x2', 'x3', 'x4']])
y = np.array(data['balance'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    random_state = 42, stratify = y)

print(models[5][0])
# Evaluate the model
y_pred = models[5][1].predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy: {}\nRecall: {}\nPrecision: {}\nF1 score: {}".format(accuracy, recall, precision, f1))

print("y_test", y_test)
print("y_pred", y_pred)

Random Forest
Accuracy: 0.96
Recall: 1.0
Precision: 0.6666666666666666
F1 score: 0.8
y_test [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
y_pred [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


<a id='down_sampling'></a>
#### 2.1.2. DOWN-SAMPLING THE MINORITY CLASS

- Simple and most common method: resample the majority class without replacement using `resample` from `sklearn`.
- Down-sampling will reduce significantly the number of observations => Could lead to underfitting model.

In [115]:
from sklearn.utils import resample

data_major = data.query('balance == 0')
data_minor = data.query('balance == 1')
print("Majority class: {}, Minority class: {}".format(data_major.shape, data_minor.shape))
### Down-sampling the majority class
data_major_downsampled = resample(data_major, replace = False, 
                                n_samples = data_minor.shape[0], random_state = 42)
print("Majority class, downsampled.shape", data_major_downsampled.shape)
# combine the class
data_downsampled = pd.concat([data_minor, data_major_downsampled])
print('Count target y in each class:')
data_downsampled.balance.value_counts()

Majority class: (576, 5), Minority class: (49, 5)
Majority class, downsampled.shape (49, 5)
Count target y in each class:


1    49
0    49
Name: balance, dtype: int64

In [116]:
X = np.array(data_downsampled[['x1', 'x2', 'x3', 'x4']])
y = np.array(data_downsampled['balance'])
models = fitModels(X, y)

Naive Bayes - Accuracy score: 0.35
Recall: 0.3, Precision: 0.3333333333333333, F1 score: 0.3157894736842105

Support Vector Machine - Accuracy score: 0.25
Recall: 0.2, Precision: 0.2222222222222222, F1 score: 0.2105263157894737

K-nearest Neighbor - Accuracy score: 0.55
Recall: 0.3, Precision: 0.6, F1 score: 0.4

Logistic Regression - Accuracy score: 0.25
Recall: 0.2, Precision: 0.2222222222222222, F1 score: 0.2105263157894737

Decision Tree - Accuracy score: 0.5
Recall: 0.4, Precision: 0.5, F1 score: 0.4444444444444445

Random Forest - Accuracy score: 0.4
Recall: 0.4, Precision: 0.4, F1 score: 0.4000000000000001





AdaBoost - Accuracy score: 0.4
Recall: 0.5, Precision: 0.4166666666666667, F1 score: 0.45454545454545453

Gradient Boosting - Accuracy score: 0.5
Recall: 0.5, Precision: 0.5, F1 score: 0.5



In [117]:
#### Choose random forest model as the best performance model => Predict the real y_test
X = np.array(data[['x1', 'x2', 'x3', 'x4']])
y = np.array(data['balance'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    random_state = 42, stratify = y)

print(models[5][0])
# Evaluate the model
y_pred = models[5][1].predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy: {}\nRecall: {}\nPrecision: {}\nF1 score: {}".format(accuracy, recall, precision, f1))

print("y_test", y_test)
print("y_pred", y_pred)

Random Forest
Accuracy: 0.616
Recall: 0.7
Precision: 0.1346153846153846
F1 score: 0.22580645161290322
y_test [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
y_pred [0 1 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0
 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1
 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 1 1 0
 1 1 0 0 1 1 0 1 1 0 1 1 0 1]


<a id='metrics'></a>
### 2.2. CHANGING THE PERFORMANCE METRICS

- Using AUROC value - Mostly use for binary classification.

In [118]:
#### Using logistic regression to have the predict_proba value 
### Train on the upsampled data

X = np.array(data_upsampled[['x1', 'x2', 'x3', 'x4']])
y = np.array(data_upsampled['balance'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

#### LogisticRegression
model_name = "Logistic Regression"
clf = LogisticRegression(solver="liblinear")
clf = clf.fit(X_train, y_train)
# Evaluate the model
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("{} - Accuracy score: {}".format(model_name,\
                            accuracy))
print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))

Logistic Regression - Accuracy score: 0.5151515151515151
Recall: 0.6605504587155964, Precision: 0.4897959183673469, F1 score: 0.5624999999999999



In [127]:
### Calculate the AUROC score. Need to have the attribute clf.predict_proba() 
# predict probability for each class
from sklearn.metrics import roc_auc_score
### keeo only the positive class (class 1 - minority class in binary classification) which is [1])
prob_y_train_pred = clf.predict_proba(X_train)[:, 1] 
prob_y_test_pred = clf.predict_proba(X_test)[:, 1]
print("AUROC val on train data", roc_auc_score(y_train,prob_y_train_pred )) 
print("AUROC val on test data", roc_auc_score(y_test, prob_y_test_pred))

AUROC val on train data 0.5445905536322387
AUROC val on test data 0.5156414498420815


<a id='model'></a>
### 2.3. USE SPECIFIC MODELS FOR IMBALANCED CLASSIFICATION
- Some models works well for imbalanced classification:
    + SVC with penalizing
    + Decision Trees algorithms
- We can also change the problem to Anomaly Detection/Outlier Detection (clustering, k-nearest neigbor, ect.) 

<a id='penalizing'></a>
#### 2.3.1. PENALIZING ALGORITHMS (COST-SENSITIVE TRAINING)
  
- Penalize learning algorithms that increase the cost of classification mistakes on the minority class.
- Common technique: Penalized-SVM:
    + SVC using `class_weight = "balanced"` => penalize mistakes
    + Set `probability = True` for calculation AUROC score
    

In [132]:
#### Using SVC 
### Train on the unbalanced dataset

X = np.array(data[['x1', 'x2', 'x3', 'x4']])
y = np.array(data['balance'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

#### SVC
from sklearn.svm import SVC
model_name = "Penalized SVM"
clf_svc = SVC(kernel = 'linear',
             class_weight = 'balanced',
             probability = True)
clf_svc  = clf_svc .fit(X_train, y_train)
# Evaluate the model
y_pred = clf_svc .predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("{} - Accuracy score: {}".format(model_name,\
                            accuracy))
print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))

Penalized SVM - Accuracy score: 0.48
Recall: 0.2, Precision: 0.03389830508474576, F1 score: 0.05797101449275363



In [133]:
prob_y_train_pred = clf_svc.predict_proba(X_train)[:, 1] 
prob_y_test_pred = clf_svc.predict_proba(X_test)[:, 1]
print("AUROC val on train data", roc_auc_score(y_train,prob_y_train_pred )) 
print("AUROC val on test data", roc_auc_score(y_test, prob_y_test_pred))

AUROC val on train data 0.47099393737137774
AUROC val on test data 0.6765217391304348


In [134]:
y_test

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [135]:
y_pred

array([1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1])

<a id='tree-base'></a>
#### 2.3.2. TREE-BASED ALGORITHMS 
- Decision trees often work well on imbalanced datasets: Their hierarchical structure allows them to learn signals from both classes.
=> As you can see in **Part 2.1**, random forest algorithm works best.


In [138]:
#### Using RandomForest
### Train on the upsampled dataset
X = np.array(data_upsampled[['x1', 'x2', 'x3', 'x4']])
y = np.array(data_upsampled['balance'])
X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
### Test on unbalanced test dataset
X = np.array(data[['x1', 'x2', 'x3', 'x4']])
y = np.array(data['balance'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    random_state = 42, stratify = y)

#### Random forest
model_name = "RandomForest"
clf_rf = ensemble.RandomForestClassifier(n_estimators = 100)
clf_rf  = clf_rf .fit(X_train_up, y_train_up)
# Evaluate the model
y_pred = clf_rf .predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("{} - Accuracy score: {}".format(model_name,\
                            accuracy))
print("Recall: {}, Precision: {}, F1 score: {}\n".format(recall, precision, f1))
prob_y_train_pred = clf_rf.predict_proba(X_train_up)[:, 1] 
prob_y_test_pred = clf_rf.predict_proba(X_test)[:, 1]
print("AUROC val on train data", roc_auc_score(y_train_up,prob_y_train_pred )) 
print("AUROC val on test data", roc_auc_score(y_test, prob_y_test_pred))

RandomForest - Accuracy score: 0.96
Recall: 1.0, Precision: 0.6666666666666666, F1 score: 0.8

AUROC val on train data 0.9999999999999999
AUROC val on test data 1.0


In [139]:
y_test

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [140]:
y_pred

array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

=> Recall in test set = 1: Identify all the positive cases.
=> Low precision: high false positive rate.