### Cost-Sensitive Learning on Statlog(Heart) dataset

### Panagiotis Doupidis, DWS 89
[Link to dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat)


---

In [1]:
!pip install scikit-learn==0.22.2 costcla

Collecting scikit-learn==0.22.2
  Downloading scikit_learn-0.22.2-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 5.3 MB/s 
[?25hCollecting costcla
  Downloading costcla-0.6-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 30.5 MB/s 
Collecting pyea>=0.2
  Downloading pyea-0.2.tar.gz (10 kB)
Building wheels for collected packages: pyea
  Building wheel for pyea (setup.py) ... [?25l[?25hdone
  Created wheel for pyea: filename=pyea-0.2-py3-none-any.whl size=6018 sha256=7c830b3b4e6228e79b3ae1a811a6c99210e49ce0602497900df87837dedfc05b
  Stored in directory: /root/.cache/pip/wheels/c4/c7/f9/c43bd31860d7235d875091659066bf793ea300fd0621156737
Successfully built pyea
Installing collected packages: scikit-learn, pyea, costcla
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's depen

In [2]:
import pandas as pd

#### Import the dataset and transform the response (heart_disease) to 0 (absence) and 1 (presence)

In [3]:

header_ = ['age', 'sex', 'chest_pain', 'resting_bp', 'cholesterol', 'blood_sugar', 
           'resting_ekg', 'max_hr', 'angina', 'oldpeak', 'slope_ST', 'no_of_vessels', 'thal', 'heart_disease']

# File is fetched from the UCI webpage
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat',
                 names=header_, header=None, delim_whitespace=True)

'''
  subtract 1 from the response variable so that 0 indicates absense and 1
  presense of heart condition instead of the original 1 and 2 respectively
'''

df.heart_disease = df.heart_disease - 1

df.heart_disease.value_counts() # 0=absence, 1=presence

0    150
1    120
Name: heart_disease, dtype: int64

### No cost minimization

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from costcla.metrics import cost_loss

import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

X, y = df.drop(columns='heart_disease'), df.heart_disease

# 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

fp = np.full((y_test.shape[0],1), 1)
fn = np.full((y_test.shape[0],1), 5)
tp = np.zeros((y_test.shape[0],1))
tn = np.zeros((y_test.shape[0],1))

cost_matrix = np.hstack((fp, fn, tp, tn))

# rows are the predicted values, columns the actual like in the slides
cost_matrix_uci = np.array(((0,5),(1,0)))

classifiers = [RandomForestClassifier(n_estimators=100), 
               SVC(kernel='linear', C=1, probability=True), MultinomialNB()]

for clf in classifiers:
    print(clf.__class__.__name__)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T # transpose to align with slides
    print(conf_m) 
    print(f"Native implementation : {np.sum(conf_m * cost_matrix_uci)}")
    loss = cost_loss(y_test, y_pred, cost_matrix)
    print(f"Using the library function : {loss:.0f}", end='\n\n')
    print('-' * 50)

RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.892     0.868     0.880        38
    presence      0.839     0.867     0.852        30

    accuracy                          0.868        68
   macro avg      0.865     0.868     0.866        68
weighted avg      0.868     0.868     0.868        68

[[33  4]
 [ 5 26]]
Native implementation : 25
Using the library function : 25

--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      0.943     0.868     0.904        38
    presence      0.848     0.933     0.889        30

    accuracy                          0.897        68
   macro avg      0.896     0.901     0.896        68
weighted avg      0.901     0.897     0.897        68

[[33  2]
 [ 5 28]]
Native implementation : 15
Using the library function : 15

--------------------------------------------------
MultinomialNB
              precision    recall  f1-score  

Before applying a cost-minimization method we can deduce that the linear SVM (with default parameters) is the classifier with the least cost on the test set and the Naive Bayes classifer has the worst scores of all three. Here the cost matrix is taken straight from the UCI website and it states that false positives have a cost of 1, where as false negatives a cost of 5. This means that the penalty for predicting an individual with heart disease as healthy is 5 times greater than the opposite. True positives and true negatives have a cost of 0. Also, although we print the classification matrix with the precision, recall and f1 metrics we don't base our decisions on it since we care mostly about the total cost.


---



---



Try to minimize the expected cost using costcla's Minimum Risk Classifier
(no calibration)

In [None]:
from sklearn.calibration import CalibratedClassifierCV
from costcla.models import BayesMinimumRiskClassifier
import joblib

np.random.seed(42)

# No calibration using Costcla's classifier
for clf in classifiers:
    print(clf.__class__.__name__)
    clf.fit(X_train, y_train)
    y_pred_prob = clf.predict_proba(X_test)

    bmr = BayesMinimumRiskClassifier(calibration=False)
    y_pred = bmr.predict(y_pred_prob, cost_matrix)
    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T # transpose to align with slides
    print(conf_m) 
    loss = cost_loss(y_test, y_pred, cost_matrix)
    print(f"\nCost : {loss:.0f}", end='\n\n')
    print('-' * 50)

RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.938     0.395     0.556        38
    presence      0.558     0.967     0.707        30

    accuracy                          0.647        68
   macro avg      0.748     0.681     0.631        68
weighted avg      0.770     0.647     0.623        68

[[15  1]
 [23 29]]

Cost : 28

--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      1.000     0.526     0.690        38
    presence      0.625     1.000     0.769        30

    accuracy                          0.735        68
   macro avg      0.812     0.763     0.729        68
weighted avg      0.835     0.735     0.725        68

[[20  0]
 [18 30]]

Cost : 18

--------------------------------------------------
MultinomialNB
              precision    recall  f1-score   support

     absence      0.903     0.737     0.812        38
    presence      0.730     0.90

Linear SVM seems to outperform the other 2 classifiers having the lowest cost. Also, naive Bayes has much lower cost than before (37->25) surpassing the Random Forest classifier. 

So far, we have some indications that SVM's fit nicely to this task but lets see how things are going to change when we  calibrate the probabilities.

In [None]:
# Calibration on training set using Costcla's classifier
np.random.seed(42)

for clf in classifiers:
    print(clf.__class__.__name__)
    
    clf.fit(X_train, y_train)
    y_train_prob = clf.predict_proba(X_train)
    
    bmr = BayesMinimumRiskClassifier(calibration=True)
    bmr.fit(y_train.values.reshape(-1,1), y_train_prob)

    y_test_prob = clf.predict_proba(X_test)
    y_pred = bmr.predict(y_test_prob, cost_matrix)
    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T 
    print(conf_m) 
    loss = cost_loss(y_test, y_pred, cost_matrix)
    print(f"\nCost : {loss:.0f}", end='\n\n')
    print('-' * 50)

RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.892     0.868     0.880        38
    presence      0.839     0.867     0.852        30

    accuracy                          0.868        68
   macro avg      0.865     0.868     0.866        68
weighted avg      0.868     0.868     0.868        68

[[33  4]
 [ 5 26]]

Cost : 25

--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      0.963     0.684     0.800        38
    presence      0.707     0.967     0.817        30

    accuracy                          0.809        68
   macro avg      0.835     0.825     0.808        68
weighted avg      0.850     0.809     0.807        68

[[26  1]
 [12 29]]

Cost : 17

--------------------------------------------------
MultinomialNB
              precision    recall  f1-score   support

     absence      0.944     0.447     0.607        38
    presence      0.580     0.96

After calibrating the probabilities we can observe a slight improvement in the expected cost of the SVM and in the case of the Random Forest classifier. In the case of the SVM's which are known to overestimate low probabilities and underestimate high ones, we can see from the confusion matrix that we have less false positives than before and more true positives. Also in the naive Bayes model which is also known for producing inaccurate probabilities, we have just one false negative at the expense of higher false positives, which are not as bad as false negatives but ideally we would like that number to be lower. Random forest ensemble method emphasizes on detecting the true positives, and also having low false positives although 5 examples are misclassified as negatives which contribute to the higher cost.

In [None]:
# sigmoid calibration
np.random.seed(42)

for clf in classifiers:
    print(clf.__class__.__name__)
    calib_clf = CalibratedClassifierCV(base_estimator=clf, method='sigmoid',cv=5)
    
    calib_clf.fit(X_train, y_train)

    y_test_prob = calib_clf.predict_proba(X_test)
    y_pred = BayesMinimumRiskClassifier(calibration=False).predict(y_test_prob, cost_matrix)

    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T 
    print(conf_m) 
    loss = cost_loss(y_test, y_pred, cost_matrix)
    print(f"\nCost : {loss:.0f}", end='\n\n')
    print('-' * 50)

RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.938     0.395     0.556        38
    presence      0.558     0.967     0.707        30

    accuracy                          0.647        68
   macro avg      0.748     0.681     0.631        68
weighted avg      0.770     0.647     0.623        68

[[15  1]
 [23 29]]

Cost : 28

--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      1.000     0.395     0.566        38
    presence      0.566     1.000     0.723        30

    accuracy                          0.662        68
   macro avg      0.783     0.697     0.644        68
weighted avg      0.809     0.662     0.635        68

[[15  0]
 [23 30]]

Cost : 23

--------------------------------------------------
MultinomialNB
              precision    recall  f1-score   support

     absence      0.000     0.000     0.000        38
    presence      0.441     1.00

Here, we make use of the Platt scaling technique to calibrate the probabilities. At first glance, we can see an increase in cost across all classifiers. Although this method managed to deal with the false negatives by minimizing them on all 3 classifiers, it seems a bit too aggresive, especially in the case of naive Bayes, where cost minimization is achieved by classifying all the test set examples as negative which of course results in a recall value of 0 in the case of the absence of heart disease. 

---



In [None]:
# isotonic calibration
np.random.seed(42)

for clf in classifiers:
    print(clf.__class__.__name__)
    calib_clf = CalibratedClassifierCV(base_estimator=clf, method='isotonic',cv=5)
    
    calib_clf.fit(X_train, y_train)

    y_test_prob = calib_clf.predict_proba(X_test)
    y_pred = BayesMinimumRiskClassifier(calibration=False).predict(y_test_prob, cost_matrix)

    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T 
    print(conf_m) 
    loss = cost_loss(y_test, y_pred, cost_matrix)
    print(f"\nCost : {loss:.0f}", end='\n\n')
    print('-' * 50)

RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.955     0.553     0.700        38
    presence      0.630     0.967     0.763        30

    accuracy                          0.735        68
   macro avg      0.792     0.760     0.732        68
weighted avg      0.812     0.735     0.728        68

[[21  1]
 [17 29]]

Cost : 22

--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      1.000     0.579     0.733        38
    presence      0.652     1.000     0.789        30

    accuracy                          0.765        68
   macro avg      0.826     0.789     0.761        68
weighted avg      0.847     0.765     0.758        68

[[22  0]
 [16 30]]

Cost : 16

--------------------------------------------------
MultinomialNB
              precision    recall  f1-score   support

     absence      0.913     0.553     0.689        38
    presence      0.622     0.93

Isotinic regression is another method of calibrating the expected probabilities. Based on the output we can claim that it outperforms the Platt scaling method although it is not much better than the calibration achieved by the costcla library implementation. 

---



---



### Rebalancing

### **Runtime needs to be restarted at this point to load a newer version of sklearn that is compatible with the imbalanced-learn library**

In [None]:
!pip install scikit-learn>=1.0 imbalanced-learn

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

In [None]:

header_ = ['age', 'sex', 'chest_pain', 'resting_bp', 'cholesterol', 'blood_sugar', 
           'resting_ekg', 'max_hr', 'angina', 'oldpeak', 'slope_ST', 'no_of_vessels', 'thal', 'heart_disease']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat',
                 names=header_, header=None, delim_whitespace=True)

'''
  subtract 1 from the response variable so that 0 indicates absense and 1
  presense of heart condition instead of the original 1 and 2 respectively
'''

df.heart_disease = df.heart_disease - 1

df.heart_disease.value_counts() # 0=absence, 1=presence

0    150
1    120
Name: heart_disease, dtype: int64

In [None]:
np.random.seed(42)

# Undersample majority class (absence)

X, y = df.drop(columns='heart_disease'), df.heart_disease

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

fp = np.full((y_test.shape[0],1), 1)
fn = np.full((y_test.shape[0],1), 5)
tp = np.zeros((y_test.shape[0],1))
tn = np.zeros((y_test.shape[0],1))

cost_matrix = np.hstack((fp, fn, tp, tn))

# rows are the predicted values, columns the actual like in the slides
cost_matrix_uci = np.array(((0,5),(1,0)))

classifiers = [RandomForestClassifier(n_estimators=100), 
               SVC(kernel='linear', C=1, probability=True), MultinomialNB()]

sampler = RandomUnderSampler(sampling_strategy='majority')

print('before', Counter(y_train))
X_train_rs, y_train_rs = sampler.fit_resample(X_train, y_train)

print('after', Counter(y_train_rs), end='\n\n') # both classes have equal number of samples

for clf in classifiers:
    print(clf.__class__.__name__)
    clf.fit(X_train_rs, y_train_rs)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T # transpose to align with slides
    print(conf_m) 
    print(f"Cost : {np.sum(conf_m * cost_matrix_uci)}")
    print('-' * 50)


before Counter({0: 112, 1: 90})
after Counter({0: 90, 1: 90})

RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.912     0.816     0.861        38
    presence      0.794     0.900     0.844        30

    accuracy                          0.853        68
   macro avg      0.853     0.858     0.852        68
weighted avg      0.860     0.853     0.853        68

[[31  3]
 [ 7 27]]
Cost : 22
--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      0.941     0.842     0.889        38
    presence      0.824     0.933     0.875        30

    accuracy                          0.882        68
   macro avg      0.882     0.888     0.882        68
weighted avg      0.889     0.882     0.883        68

[[32  2]
 [ 6 28]]
Cost : 16
--------------------------------------------------
MultinomialNB
              precision    recall  f1-score   support

     absence      0.833    

Here we use another technique, rebalancing, to alter the distributions of the classes either by undersampling the majority or by oversampling the minority class. 

In the case of undersampling here we randomely choose samples (without replacement) from the majority class so that we end up with the same number of training examples from both classes (90 positives - 90 negatives). In the cases of Random forest and SVM the costs are similar to the Isotinic regression method but in the case of naive Bayes the 6 negative misclassifications drive the total expected cost up.

In [None]:
# Oversample 
np.random.seed(42)

sampler = RandomOverSampler(sampling_strategy='minority')

print('before', Counter(y_train))
X_train_rs, y_train_rs = sampler.fit_resample(X_train, y_train)

print('after', Counter(y_train_rs), end='\n\n') # both classes have equal number of samples

for clf in classifiers:
    print(clf.__class__.__name__)
    clf.fit(X_train_rs, y_train_rs)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T # transpose to align with slides
    print(conf_m) 
    print(f"Cost : {np.sum(conf_m * cost_matrix_uci)}")
    print('-' * 50)

before Counter({0: 112, 1: 90})
after Counter({0: 112, 1: 112})

RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.886     0.816     0.849        38
    presence      0.788     0.867     0.825        30

    accuracy                          0.838        68
   macro avg      0.837     0.841     0.837        68
weighted avg      0.843     0.838     0.839        68

[[31  4]
 [ 7 26]]
Cost : 27
--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      0.939     0.816     0.873        38
    presence      0.800     0.933     0.862        30

    accuracy                          0.868        68
   macro avg      0.870     0.875     0.867        68
weighted avg      0.878     0.868     0.868        68

[[31  2]
 [ 7 28]]
Cost : 17
--------------------------------------------------
MultinomialNB
              precision    recall  f1-score   support

     absence      0.857  

Here we use oversampling instead to increase the number of examples of the minority class to be the same as the majority. Again, SVM is very consistent on delivering low cost followed by the Random forest classifier, which sees a increase in expected cost. Oversampling has benefited the naive Bayes classifier by lowering the expected cost from 38 to 33 although it is still considerably worse that the other 2.

As a note on oversampling-undersampling we can point out that naive Bayes, which is an method that heavily relies on good probabilities, does not benefit from stratification. Next, we'll try to combat that by combining the 2 methods (oversampling & undersampling)

In [None]:
# Combine the 2 methods
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

sampler1 = RandomUnderSampler(sampling_strategy='majority')
sampler2 = RandomOverSampler(sampling_strategy={0:120, 1:250})

print('before', Counter(y_train))
X_train_rs, y_train_rs = sampler1.fit_resample(X_train, y_train)

print('after under sample', Counter(y_train_rs)) # both classes have equal number of samples

X_train_rs2, y_train_rs2 = sampler2.fit_resample(X_train_rs, y_train_rs)

print('after oversample', Counter(y_train_rs2), end='\n\n') # both classes have equal number of samples

for clf in classifiers:
    print(clf.__class__.__name__)
    clf.fit(X_train_rs2, y_train_rs2)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T # transpose to align with slides
    print(conf_m) 
    print(f"Cost : {np.sum(conf_m * cost_matrix_uci)}")
    print('-' * 50)

before Counter({0: 112, 1: 90})
after under sample Counter({0: 90, 1: 90})
after oversample Counter({1: 250, 0: 120})

RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.903     0.737     0.812        38
    presence      0.730     0.900     0.806        30

    accuracy                          0.809        68
   macro avg      0.816     0.818     0.809        68
weighted avg      0.827     0.809     0.809        68

[[28  3]
 [10 27]]
Cost : 25
--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      0.935     0.763     0.841        38
    presence      0.757     0.933     0.836        30

    accuracy                          0.838        68
   macro avg      0.846     0.848     0.838        68
weighted avg      0.857     0.838     0.838        68

[[29  2]
 [ 9 28]]
Cost : 19
--------------------------------------------------
MultinomialNB
              precision   

After combining the 2 sampling strategies, first by undersampling the majority class and then oversampling the minority so that it has approximately twice the samples of the majority, we can see a massive decrease in the expected cost of the naive Bayes classifier (from over 33 cost down to 23). So by putting more emphasis on the class with the highest cost the classifier has managed to reduce the false negatives. 


---



---



### Weighting

Weight each example based on its misclassification cost.

If the response variable for an example is 0 (absence) weight it with 1, otherwise weight is equal to 5

In [None]:
np.random.seed(42)

weights = np.zeros_like(y_train)
weights[np.where(y_train == 0)] = 1
weights[np.where(y_train == 1)] = 5 

for clf in classifiers:
    print(clf.__class__.__name__)
    clf.fit(X_train, y_train, weights)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred, 
                                digits=3, target_names=['absence', 'presence']))
    conf_m = confusion_matrix(y_test, y_pred).T # transpose to align with slides
    print(conf_m) 
    print(f"\nCost w/ weighting : {np.sum(conf_m * cost_matrix_uci)}")
    print('-' * 50)


RandomForestClassifier
              precision    recall  f1-score   support

     absence      0.846     0.868     0.857        38
    presence      0.828     0.800     0.814        30

    accuracy                          0.838        68
   macro avg      0.837     0.834     0.835        68
weighted avg      0.838     0.838     0.838        68

[[33  6]
 [ 5 24]]

Cost w/ weighting : 35
--------------------------------------------------
SVC
              precision    recall  f1-score   support

     absence      0.926     0.658     0.769        38
    presence      0.683     0.933     0.789        30

    accuracy                          0.779        68
   macro avg      0.804     0.796     0.779        68
weighted avg      0.819     0.779     0.778        68

[[25  2]
 [13 28]]

Cost w/ weighting : 23
--------------------------------------------------
MultinomialNB
              precision    recall  f1-score   support

     absence      0.903     0.737     0.812        38
    pres

Here we see a noticable increase in the expected cost of the Random forest and the SVM classifiers, much worse than the 2 other methods (calibration, rebalancing). Naive Bayes seems to be the one that mostly benefits from this method since the expected cost is much lower than doing no cost-minimzation at all and on par with calibration/rebalancing (w/ combination of sampling).

##### We can conclude by saying that Support Vector Machines consistently outperformed the other 2 algorithms on almost all instances on this particular dataset. They delivered the lowest expected cost and didn't seem to be very sensitive to the method used whether it was calibration, rebalancing or weighting (although they didn't perform great here, it was still the best among the 3). Random forests didn't benefit greatly from calibration since the probabilities they emit are very accurate. Some of their best results were achieved using isotonic regression and undersampling although still were not far off the baseline (without cost minimization). Naive Bayes was really affected by calibration since the probabilities it outputs are not very accurate.