<a href="https://colab.research.google.com/github/kochlisGit/Advanced-ML/blob/main/Cost-Sensitive-Learning/cost_sensitive_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Installing required libraries.

!pip install scikit-learn==0.22.2.post1
!pip install costcla

Collecting scikit-learn==0.22.2.post1
  Downloading scikit_learn-0.22.2.post1-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 6.2 MB/s 
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.4 requires scikit-learn>=1.0.0, but you have scikit-learn 0.22.2.post1 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.22.2.post1 which is incompatible.[0m
Successfully installed scikit-learn-0.22.2.post1
Collecting costcla
  Downloading costcla-0.6-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 7.0 MB/s 
Collecting pye

In [None]:
# Downloading Statlog (Heart) Data Set from UCI
import pandas as pd

column_names = [
                'Age',
                'Sex',
                'Chest Pain Type',
                'Resting Blood Pressure',
                'Serum Cholesterol (mg/dl)',
                'Fasting Blood Sugar > 120 mg/dl',
                'Resting Electrocardiographic Results',
                'Maximum Heart Rate Achieved',
                'Exercise Induced Angina',
                'Depression by Exercise/Rest',
                'Slope of Peak Exercise',
                'Number of Major Vessels',
                'Thal',
                'Presence'
]
delimiter = ' '
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat', names=column_names, delimiter=delimiter)

data

Unnamed: 0,Age,Sex,Chest Pain Type,Resting Blood Pressure,Serum Cholesterol (mg/dl),Fasting Blood Sugar > 120 mg/dl,Resting Electrocardiographic Results,Maximum Heart Rate Achieved,Exercise Induced Angina,Depression by Exercise/Rest,Slope of Peak Exercise,Number of Major Vessels,Thal,Presence
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
265,52.0,1.0,3.0,172.0,199.0,1.0,0.0,162.0,0.0,0.5,1.0,0.0,7.0,1
266,44.0,1.0,2.0,120.0,263.0,0.0,0.0,173.0,0.0,0.0,1.0,0.0,7.0,1
267,56.0,0.0,2.0,140.0,294.0,0.0,2.0,153.0,0.0,1.3,2.0,0.0,3.0,1
268,57.0,1.0,4.0,140.0,192.0,0.0,0.0,148.0,0.0,0.4,2.0,0.0,6.0,1


In [None]:
# Generating training dataset.

from sklearn import model_selection

TEST_SIZE = 0.3
RANDOM_STATE = 0
SHUFFLE = True

data = data.dropna()
targets = data['Presence'].replace({1: 0, 2: 1})
inputs = data.drop(columns=['Presence'])

x_train, x_test, y_train, y_test = model_selection.train_test_split(inputs, targets, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=SHUFFLE)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((189, 13), (189,), (81, 13), (81,))

In this dataset, we have 189 samples available for training and 81 samples for evaluation. Unfortunately, we are not provided with enough data, so our training will be extremely difficult. The special issue about this dataset is the fact that the classifier that we are going to build is intended to be used for medical purposes, so we have different costs for each kind of prediction. For example, a classifier that outputs lots of **False Positives** could be dangerous and even inappropriate to use.

In [None]:
# Training classifiers & Evaluating using the Cost

from sklearn import ensemble, svm, naive_bayes
from sklearn import metrics
import numpy as np

# Defining the cost according to the recommended cost matrix of the dataset.
# It is important not to output "absence" if a patient suffers from a heart desease.

TP = 0
TN = 0
FP = 5
FN = 1
cost = np.float32([[TN , FP], [FN, TP]])

# Defining the classifiers.

classifiers = {
    'Random Forest Classifier': ensemble.RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    'Support Vector Classifier': svm.LinearSVC(random_state=RANDOM_STATE),
    'Naive Bayes Classifier': naive_bayes.GaussianNB()
}

# Training & Evaluating the classifiers.

for classifier_name, clf in classifiers.items():
  clf.fit(x_train, y_train)
  y_pred = clf.predict(x_test)
  
  confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T
  cost_loss = np.sum(confusion_matrix * cost)

  print('\nEvaluating {}'.format(classifier_name))
  print(metrics.classification_report(y_test, y_pred))
  print('Cost loss = {}'.format(cost_loss))
  print('By Confusion Matrix: TP: {}, TN: {}, FP: {}, FN: {}'.format(
      confusion_matrix[1][1],
      confusion_matrix[0][0],
      confusion_matrix[0][1],
      confusion_matrix[1][0]
  ))



Evaluating Random Forest Classifier
              precision    recall  f1-score   support

           0       0.88      0.77      0.82        48
           1       0.72      0.85      0.78        33

    accuracy                           0.80        81
   macro avg       0.80      0.81      0.80        81
weighted avg       0.81      0.80      0.80        81

Cost loss = 36.0
By Confusion Matrix: TP: 28, TN: 37, FP: 5, FN: 11

Evaluating Support Vector Classifier
              precision    recall  f1-score   support

           0       0.89      0.85      0.87        48
           1       0.80      0.85      0.82        33

    accuracy                           0.85        81
   macro avg       0.85      0.85      0.85        81
weighted avg       0.85      0.85      0.85        81

Cost loss = 32.0
By Confusion Matrix: TP: 28, TN: 41, FP: 5, FN: 7

Evaluating Naive Bayes Classifier
              precision    recall  f1-score   support

           0       0.85      0.81      0.83   

As we can see, in this dataset the Random Forest classifier has the highest `Accuracy` along with Naive Bayes Classifier and SVC performs the worst. More specifically, Random Forest has `Accuracy = 0.80, F1-Score(0) = 0.82, F1-Score(1) = 0.78` and Naive Bayes Classifier has `Accuracy = 0.80, F1-Score(0) = 0.83, F1-Score(1) = 0.76` percent However, we are mainly insterested in the classifier with the minimum cost loss. In this case, the rankings are:

1.   Support Vector Classifier:` Cost loss = 36`
2.   Random Forest Classifier: `Cost loss = 91`
3.   Naive Bayes Classifier: `Cost loss = 41 `

However, a model with cost of 36 seems to be quite high for medical applications. We can minimize the cost loss even further using the following techniques:



1.   Calibration
2.   Weighting
3.   Weighting + Calibration on Data
4.   Rebalancing


1. **Calibration (Sigmoid/Isotonic)**

In [None]:
# 1. Adding calibration to the model's output.
# Calibration methods: a) Sigmoid, b) Isotonic

from sklearn import calibration
from costcla.models import BayesMinimumRiskClassifier
from costcla.metrics import cost_loss

TP = np.zeros((y_test.shape[0], 1))
TN = np.zeros((y_test.shape[0], 1))
FP = np.full((y_test.shape[0],1), 5)
FN = np.full((y_test.shape[0],1), 1)
cost_matrix = np.hstack((FP, FN, TP, TN))
CV = 3

calibration_methods = {
    'Sigmoid': 'sigmoid',
    'Isotonic': 'isotonic'
}


for calibration_method, cc in calibration_methods.items():
  classifiers = {
      'Random Forest Classifier': ensemble.RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
      'Support Vector Classifier': svm.SVC(kernel='linear',probability=True, random_state=RANDOM_STATE),
      'Naive Bayes Classifier': naive_bayes.GaussianNB()
  }

  for classifier_name, clf in classifiers.items():
    cc_clf = calibration.CalibratedClassifierCV(clf, method=cc, cv=CV)
    model = clf.fit(x_train, y_train)
    prob_test = model.predict_proba(x_test)
    bmr = BayesMinimumRiskClassifier(calibration=False)
    y_pred = bmr.predict(prob_test, cost_matrix)

    loss = cost_loss(y_test, y_pred, cost_matrix)
    print('\nEvaluating {} using {} Calibration'.format(classifier_name, calibration_method))
    print(metrics.classification_report(y_test, y_pred))
    print('Cost loss = {}'.format(loss))

    confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T
    print('By Confusion Matrix: TP: {}, TN: {}, FP: {}, FN: {}'.format(
      confusion_matrix[1][1],
      confusion_matrix[0][0],
      confusion_matrix[0][1],
      confusion_matrix[1][0]
    ))


Evaluating Random Forest Classifier using Sigmoid Calibration
              precision    recall  f1-score   support

           0       0.71      0.98      0.82        48
           1       0.93      0.42      0.58        33

    accuracy                           0.75        81
   macro avg       0.82      0.70      0.70        81
weighted avg       0.80      0.75      0.73        81

Cost loss = 24.0
By Confusion Matrix: TP: 14, TN: 47, FP: 19, FN: 1

Evaluating Support Vector Classifier using Sigmoid Calibration
              precision    recall  f1-score   support

           0       0.72      1.00      0.83        48
           1       1.00      0.42      0.60        33

    accuracy                           0.77        81
   macro avg       0.86      0.71      0.72        81
weighted avg       0.83      0.77      0.74        81

Cost loss = 19.0
By Confusion Matrix: TP: 14, TN: 48, FP: 19, FN: 0

Evaluating Naive Bayes Classifier using Sigmoid Calibration
              precisio

As we can see, using the calibration methods we built a model with a cost of 19, while our best previous model made predictions with a cost of 36. Using the sigmoid or the isotonic calibration method we managed to reduce the cost almost by half. Additionally, all our classifiers performed better in terms of cost minimization. More specifically:

Classifier     | No Calibration | Sigmoid Calibration | Isotonic Calibration 
---------------|----------------|---------------------|---------------------
Random Forest  |36              |24                   |24
Support Vector |91              |19                   |19
Naive Bayes    |44              |32                   |32

We also notice that the cost of SVM is dramatically dropped from 91 to 19. This is because SVM overestimates the low probabilities and underestimates the high probabilities, as we have seen in this lecture. By calibrating the classifier, we correctly estimate the probabilities of SVM.

Another thing we notice is that by minimizing the risk, we changed the accuracies of the classifiers. The accuracy of SVM and Random Forest has dropped by a very small percentage, however, Naive Bayes reduced its cost while also improved its prediction score metrics.

2. **Example Weighting**

In [None]:
# 2. Adding class weights during training.
# According to the cost matrix, it is much important that a patient with heart disease is not misclassified.

weights = np.full(y_train.shape[0], 1)
weights[np.where(y_train == 0)] = 5;

classifiers = {
    'Random Forest Classifier': ensemble.RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    'Support Vector Classifier': svm.LinearSVC(random_state=RANDOM_STATE),
    'Naive Bayes Classifier': naive_bayes.GaussianNB()
}

# Training & Evaluating the classifiers.

for classifier_name, clf in classifiers.items():
  clf.fit(x_train, y_train, weights)
  y_pred = clf.predict(x_test)

  confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T
  cost_loss = np.sum(confusion_matrix * cost)

  print('\nEvaluating {}'.format(classifier_name))
  print(metrics.classification_report(y_test, y_pred))
  print('Cost loss = {}'.format(cost_loss))

  confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T
  print('By Confusion Matrix: TP: {}, TN: {}, FP: {}, FN: {}'.format(
      confusion_matrix[1][1],
      confusion_matrix[0][0],
      confusion_matrix[0][1],
      confusion_matrix[1][0]
  ))


Evaluating Random Forest Classifier
              precision    recall  f1-score   support

           0       0.86      0.79      0.83        48
           1       0.73      0.82      0.77        33

    accuracy                           0.80        81
   macro avg       0.80      0.80      0.80        81
weighted avg       0.81      0.80      0.80        81

Cost loss = 40.0
By Confusion Matrix: TP: 27, TN: 38, FP: 6, FN: 10

Evaluating Support Vector Classifier
              precision    recall  f1-score   support

           0       0.83      0.90      0.86        48
           1       0.83      0.73      0.77        33

    accuracy                           0.83        81
   macro avg       0.83      0.81      0.82        81
weighted avg       0.83      0.83      0.83        81

Cost loss = 50.0
By Confusion Matrix: TP: 24, TN: 43, FP: 9, FN: 5

Evaluating Naive Bayes Classifier
              precision    recall  f1-score   support

           0       0.86      0.90      0.88   

As we can see, by adding class weights to our training, it improved the overall accuracy metrics of the classifiers. We used the cost matrix to define the weights for each class (**5 for absence and 1 for presence**). However, the costs for each classifier still remains high, so probably needs to be combined with a calibration method.

3. **Weighting + Calibration on Data**

In [None]:
# Bonus Method: Calibrating the classifiers on the training set with a neural network.
# 1. Classifiers will trained, without weights.
# 2. Then, the output probabilities will be computed for each training example and will be concatenated with the training inputs.
# 3. The new training inputs will be fed into a neural network.
# 4. Loss weights will be added to the final classifier.

import tensorflow as tf
tf.random.set_seed(0)

EPOCHS = 100
BATCH_SIZE = 16

def build_model(input_shape):
  inputs = tf.keras.layers.Input(input_shape)
  x = tf.keras.layers.Dense(units=64, use_bias=False)(inputs)
  x = tf.keras.layers.BatchNormalization()(x)
  x = tf.keras.activations.gelu(x)
  outputs = tf.keras.layers.Dense(units=1, activation='sigmoid')(x)
  nnet = tf.keras.Model(inputs, outputs)
  nnet.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'], loss_weights=weights)
  return nnet


classifiers = {
    'Random Forest Classifier': ensemble.RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    'Support Vector Classifier': svm.SVC(kernel='linear', probability=True, random_state=RANDOM_STATE),
    'Naive Bayes Classifier': naive_bayes.GaussianNB()
}

for classifier_name, clf in classifiers.items():
  # Buildin & Training the model.

  model = clf.fit(x_train, y_train)
  prob_train = model.predict_proba(x_train)
  nnet_train_inputs = np.hstack((x_train, prob_train))
  nnet = build_model(input_shape=[nnet_train_inputs.shape[1]])
  nnet.fit(nnet_train_inputs, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=0)

  # Evaluating the model.

  prob_test = model.predict_proba(x_test)
  nnet_test_inputs = np.hstack((x_test, prob_test))

  y_pred_probabilities = nnet.predict(nnet_test_inputs)
  y_pred = np.round(y_pred_probabilities)
  
  confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T
  cost_loss = np.sum(confusion_matrix * cost)

  print('\nEvaluating {}'.format(classifier_name))
  print(metrics.classification_report(y_test, y_pred))
  print('Cost loss = {}'.format(cost_loss))
  print('By Confusion Matrix: TP: {}, TN: {}, FP: {}, FN: {}'.format(
      confusion_matrix[1][1],
      confusion_matrix[0][0],
      confusion_matrix[0][1],
      confusion_matrix[1][0]
  ))


Evaluating Random Forest Classifier
              precision    recall  f1-score   support

           0       0.86      0.75      0.80        48
           1       0.69      0.82      0.75        33

    accuracy                           0.78        81
   macro avg       0.77      0.78      0.78        81
weighted avg       0.79      0.78      0.78        81

Cost loss = 42.0
By Confusion Matrix: TP: 27, TN: 36, FP: 6, FN: 12

Evaluating Support Vector Classifier
              precision    recall  f1-score   support

           0       0.90      0.58      0.71        48
           1       0.60      0.91      0.72        33

    accuracy                           0.72        81
   macro avg       0.75      0.75      0.72        81
weighted avg       0.78      0.72      0.71        81

Cost loss = 35.0
By Confusion Matrix: TP: 30, TN: 28, FP: 3, FN: 20

Evaluating Naive Bayes Classifier
              precision    recall  f1-score   support

           0       0.93      0.52      0.67  

Below, we present the final costs by combining example weighting with calibration on dataset:

Classifier     | No Cost Min.   | Weighting           | Calibration + Weighting 
---------------|----------------|---------------------|---------------------
Random Forest  |36              |40                   |42
Support Vector |91              |50                   |35
Naive Bayes    |44              |40                   |33

4. **Train/Test Stratification (aka Rebalancing)**

In [None]:
from collections import Counter
from imblearn import under_sampling, over_sampling

sampling_methods = {
    'Under Sampling': under_sampling.RandomUnderSampler(sampling_strategy={0: 10, 1: 50}, random_state=RANDOM_STATE),
    'Over Sampling': over_sampling.RandomOverSampler(sampling_strategy={0: 1000, 1: 5000}, random_state=RANDOM_STATE)
}

for method, sampler in sampling_methods.items():
  x_rs, y_rs = sampler.fit_resample(x_train, y_train)

  print('\n\n\n------------- Using {} -------------'.format(method))
  print('Training data: {}'.format(Counter(y_rs)))

  classifiers = {
    'Random Forest Classifier': ensemble.RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    'Support Vector Classifier': svm.LinearSVC(random_state=RANDOM_STATE),
    'Naive Bayes Classifier': naive_bayes.GaussianNB()
  }

  for classifier_name, clf in classifiers.items():
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    
    confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T
    cost_loss = np.sum(confusion_matrix * cost)

    print('\nEvaluating {}'.format(classifier_name))
    print(metrics.classification_report(y_test, y_pred))
    print('Cost loss = {}'.format(cost_loss))
    print('By Confusion Matrix: TP: {}, TN: {}, FP: {}, FN: {}'.format(
      confusion_matrix[1][1],
      confusion_matrix[0][0],
      confusion_matrix[0][1],
      confusion_matrix[1][0]
    ))

print('\n\n\n ------------- Combining methods -------------')

sampler = under_sampling.RandomUnderSampler(sampling_strategy={0: 10, 1: 50}, random_state=RANDOM_STATE)
x_rs, y_rs = sampler.fit_resample(x_train, y_train)
sampler = over_sampling.RandomOverSampler(sampling_strategy={0: 1000, 1: 5000}, random_state=RANDOM_STATE)
x_rs, y_rs = sampler.fit_resample(x_train, y_train)
print('Training data: {}'.format(Counter(y_rs)))

classifiers = {
  'Random Forest Classifier': ensemble.RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
  'Support Vector Classifier': svm.LinearSVC(random_state=RANDOM_STATE),
  'Naive Bayes Classifier': naive_bayes.GaussianNB()
}

for classifier_name, clf in classifiers.items():
  clf.fit(x_train, y_train)
  y_pred = clf.predict(x_test)
  
  confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T
  cost_loss = np.sum(confusion_matrix * cost)

  print('\nEvaluating {}'.format(classifier_name))
  print(metrics.classification_report(y_test, y_pred))
  print('Cost loss = {}'.format(cost_loss))
  print('By Confusion Matrix: TP: {}, TN: {}, FP: {}, FN: {}'.format(
      confusion_matrix[0][0],
      confusion_matrix[0][1],
      confusion_matrix[1][0],
      confusion_matrix[1][1]
  ))




------------- Using Under Sampling -------------
Training data: Counter({1: 50, 0: 10})

Evaluating Random Forest Classifier
              precision    recall  f1-score   support

           0       0.88      0.77      0.82        48
           1       0.72      0.85      0.78        33

    accuracy                           0.80        81
   macro avg       0.80      0.81      0.80        81
weighted avg       0.81      0.80      0.80        81

Cost loss = 36.0
By Confusion Matrix: TP: 28, TN: 37, FP: 5, FN: 11

Evaluating Support Vector Classifier
              precision    recall  f1-score   support

           0       0.89      0.85      0.87        48
           1       0.80      0.85      0.82        33

    accuracy                           0.85        81
   macro avg       0.85      0.85      0.85        81
weighted avg       0.85      0.85      0.85        81

Cost loss = 32.0
By Confusion Matrix: TP: 28, TN: 41, FP: 5, FN: 7

Evaluating Naive Bayes Classifier
          

Below, we present the final costs of each sampling method separately as well as the combination of both UnderSampling & OverSampling. It also worths mentioning that the accuracy metrics of some classifiers have also been improved. 

Classifier     | UnderSampling  | OverSampling | UnderSampling + OverSampling 
---------------|----------------|---------------------|---------------------
Random Forest  |36              |36                   |36
Support Vector |32              |32                   |32
Naive Bayes    |44              |44                   |44

**Conclusion**

In this dataset, we have tested 4 cost minimization techniques, which are ***Probability Calibration, Weighting, Calibration on Training Data, Rebalancing**. From all the above methods, *Probability Calibration* achieved remarkable results on minimizing the desired costs. However, all 4 methods successfully managed to drop the prediction costs. In some cases, using these methods after the training of the classifiers increased their score metrics as well, while in some other cases the accuracy dropped by a small percentage. However, in medical applications it is important to build models with the lowest possible False Positive rate, as the decisions of such models could have an impact on human lives.