# Pattern Recognition - Project #2

*   **Course:** Pattern Recognition - Fall 2020
*   **Instructor:** [Dr. Morteza Analoui](https://www.scopus.com/authid/detail.uri?authorId=16835800400)
*   **Teaching Assistants:** Seyed Hassan Tabatabaei, Pedram Dadkhah
*   **Student:** [Parsa Abbasi](https://parsa-abbasi.ir/)
*   ***Iran University of Science and Technology (IUST)***




## Libraries

In [1]:
# General
from scipy.io import arff
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from tqdm.notebook import trange, tqdm

# Visualization
import plotly.express as px
import plotly.graph_objects as go

# Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn import svm
from sklearn.neighbors import NearestNeighbors

# Metrics
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score

## 1) Dataset

The dataset we will use in this project is called ***habermanImb*** and the name refers to the *S.J. Haberman* research in 1976.   
The dataset was obtained from a study conducted between 1958 and 1970 at the University of *Chicago's Billings Hospital* on the survival of patients after breast cancer surgery [[1]](https://lib.ugent.be/fulltxt/RUG01/002/163/676/RUG01-002163676_2014_0001_AC.pdf).   
It contains two classes that are already imbalanced. The <font color='green'>positive</font> class means that the patient survived 5 years or longer, while the <font color='red'>negative</font> class means that the patient passed away within 5 years.

**Attributes (Columns) :**
*   *Age:* Age of patient at the time of operation
*   *Year:* Patient's year of operation (year - 1900)
*   *Positive:* Number of positive axillary nodes detected
    *    If lymph nodes have some cancer cells in them, they are called positive [[2]](https://www.breastcancer.org/symptoms/diagnosis/lymph_nodes).
    *   A positive axillary lymph node is a lymph node in the area of the armpit (axilla) to which cancer has spread [[3]](https://en.wikipedia.org/wiki/Positive_axillary_lymph_node#:~:text=A%20positive%20axillary%20lymph%20node,whether%20cancer%20cells%20are%20present). 
*   *Class:* Survival status
    *   <font color='green'>Positive = +1:</font> The patient survived 5 years or longer
    *   <font color='red'>Negative = -1:</font> The patient passed away within 5 years





### 1.1) Loading

In [2]:
# The paths to dataset files from my google drive
files_addresses = [
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-1tra.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-1tst.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-2tra.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-2tst.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-3tra.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-3tst.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-4tra.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-4tst.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-5tra.arff',
                   '/content/drive/MyDrive/Dataset/HabermanImbalanced/habermanImb-5-5tst.arff'
]

In [3]:
# Creating a dataframe that contains all the extracted data
data = pd.DataFrame(columns=['Age', 'Year', 'Positive', 'Class'])

In [4]:
# Read each of the dataset files
for address in files_addresses:
  file_arff = arff.loadarff(address)
  file_df = pd.DataFrame(file_arff[0])
  data = data.append(file_df,ignore_index=True)

In [5]:
# Take a look at the data
data[:10]

Unnamed: 0,Age,Year,Positive,Class
0,38.0,59.0,2.0,b'negative'
1,39.0,63.0,4.0,b'negative'
2,49.0,62.0,1.0,b'negative'
3,56.0,67.0,0.0,b'negative'
4,64.0,58.0,0.0,b'negative'
5,55.0,69.0,22.0,b'negative'
6,45.0,66.0,0.0,b'positive'
7,52.0,61.0,0.0,b'negative'
8,61.0,65.0,8.0,b'negative'
9,54.0,62.0,0.0,b'negative'


### 1.2) Encode classes

In [6]:
# Change the class labels to a real value
data.loc[data['Class'] == b'negative', 'Class'] = -1
data.loc[data['Class'] == b'positive', 'Class'] = 1

In [7]:
data[:10]

Unnamed: 0,Age,Year,Positive,Class
0,38.0,59.0,2.0,-1
1,39.0,63.0,4.0,-1
2,49.0,62.0,1.0,-1
3,56.0,67.0,0.0,-1
4,64.0,58.0,0.0,-1
5,55.0,69.0,22.0,-1
6,45.0,66.0,0.0,1
7,52.0,61.0,0.0,-1
8,61.0,65.0,8.0,-1
9,54.0,62.0,0.0,-1


In [8]:
# Extract positive and negative samples separately

positive_samples = data.loc[data['Class'] == 1]
negative_samples = data.loc[data['Class'] == -1]

positive_counts = len(positive_samples)
negative_counts = len(negative_samples)

dif_counts = negative_counts - positive_counts
ratio = negative_counts/positive_counts
num_data = len(data)

print("The number of data:", num_data)
print("The number of positive samples:", positive_counts)
print("The number of negative samples:", negative_counts)
print("The difference between the majority class to the minority class:", dif_counts)
print("The ratio of majority class to the minority class:", ratio)

The number of data: 1530
The number of positive samples: 405
The number of negative samples: 1125
The difference between the majority class to the minority class: 720
The ratio of majority class to the minority class: 2.7777777777777777


In [9]:
duplicates_num = len(data)-len(data.drop_duplicates())
print('Number of duplicate data:', duplicates_num)
print('Number of unique data:', len(data.drop_duplicates()))

Number of duplicate data: 1241
Number of unique data: 289


Most of the examples in the dataset are repetitive. The question is should we remove the duplicate samples? In most of the situation, we should do that, but here as the data related to diseases, it's more cautious to suppose each of them is identical.

In [10]:
# 3D visualization
px.scatter_3d(data, x='Positive', y='Age', z='Year',
              title='Original Data', color='Class',
              color_discrete_sequence=px.colors.qualitative.Dark2)

### 1.3) Normalization

In [11]:
def min_max_normalization(dataframe, column):
  # Finding the minimum and maximum values of the feature
  minimum_value = dataframe[column].min()
  maximum_value = dataframe[column].max()
  # Subtracting the minimum value of the feature
  numerator = dataframe[column] - minimum_value
  # Dividing by the range
  denominator = maximum_value - minimum_value
  # Putting all together
  dataframe[column] = numerator/denominator

In [12]:
# Apply min-max normalization on the Age, Year, and Positive columns
min_max_normalization(data, 'Age')
min_max_normalization(data, 'Year')
min_max_normalization(data, 'Positive')

In [13]:
data.head()

Unnamed: 0,Age,Year,Positive,Class
0,0.150943,0.090909,0.038462,-1
1,0.169811,0.454545,0.076923,-1
2,0.358491,0.363636,0.019231,-1
3,0.490566,0.818182,0.0,-1
4,0.641509,0.0,0.0,-1


In [14]:
# 3D visualization
px.scatter_3d(data, x='Positive', y='Age', z='Year',
              title='Normalized Data', color='Class',
              color_discrete_sequence=px.colors.qualitative.Dark2)

### 1.4) Prepare data

Splitting features from the label.

In [15]:
# Splitting feature vectors and class labels
X = data[['Age','Year','Positive']].values
y = data['Class'].values
y = y.astype('int')

### 1.5) K-fold

In [16]:
# Setting 5-fold
five_fold = KFold(n_splits=5, random_state=42, shuffle=True)

### 1.6) Visualization and Metric functions

In [17]:
def get_metrics(true_labels, predicted_labels):
  accuracy = accuracy_score(true_labels, predicted_labels)
  precision = precision_score(true_labels, predicted_labels)
  recall = recall_score(true_labels, predicted_labels)
  fscore = f1_score(true_labels, predicted_labels, labels=np.unique(predicted_labels))
  result = {'Precision': precision, 'Recall': recall, 'F1score':fscore, 'Accuracy':accuracy}
  cm = confusion_matrix(true_labels, predicted_labels).ravel()
  # Output = (tn, fp, fn, tp)
  result['TN'], result['FP'], result['FN'], result['TP'] = cm
  return result

In [18]:
class LossPlotter:
  def __init__(self, title, xaxis, yaxis):
    self.fig = go.Figure()
    self.fig.update_layout(title=title,
                    xaxis_title=xaxis,
                    yaxis_title=yaxis
                    )

  def add_scatter(self, losses, name):
    fig_x = list(range(len(losses)))
    fig_x = [i+1 for i in fig_x]
    self.fig = self.fig.add_trace(go.Scatter(x=fig_x, 
                                        y=losses,
                                        name=name))

  def show(self):
    self.fig.show()

In [19]:
class RocPlotter:
  def __init__(self, title):
    self.fig = go.Figure()
    self.fig.add_shape(
        type='line', line=dict(dash='dash'),
        x0=0, x1=1, y0=0, y1=1
    )
    self.fig.update_layout(title=title)

  def add_scatter(self, true_label, pred_scores, name):
    fpr, tpr, thresholds = roc_curve(true_label, pred_scores)
    auc_score = roc_auc_score(true_label, pred_scores)
    scatter_name = f"{name} (AUC={auc_score:.4f})"
    self.fig.add_trace(go.Scatter(x=fpr, y=tpr, name=scatter_name))
    return auc_score

  def show(self):
    self.fig.update_layout(
        xaxis_title='False Positive Rate',
        yaxis_title='True Positive Rate',
        yaxis=dict(scaleanchor="x", scaleratio=1),
        xaxis=dict(constrain='domain'),
        width=700, height=500
    )
    self.fig.show()

In [20]:
def plot_histogram(true_label, pred_scores, title):
  fpr, tpr, thresholds = roc_curve(true_label, pred_scores)

  # The histogram of scores compared to true labels
  fig_hist = px.histogram(
      x=pred_scores, color=true_label, nbins=50,
      labels=dict(color='True Labels', x='Score')
  )
  fig_hist.update_layout(title=title)
  fig_hist.show()

The following function takes the labels and ensemble values then compute exponential loss.

In [21]:
def ensemble_loss_binary(true_labels, pred_labels, values):

  # The number of samples
  m = len(true_labels)

  # The exponential loss value
  loss = 0

  # For each sample
  for index, hfin in enumerate(values):
    current_loss = np.exp(-true_labels[index]*hfin)
    loss += current_loss

  loss = (1/m)*loss

  return loss

### 1.7) Hyperparameter and Variables

The hyperparameters we should set are following :

*   **max_depth:** It defines the maximum depth for the decision tree. If it will set to 1, then the decision tree act like a stump.
*   **max_features:** It defines the maximum features that the decision tree should be considered. We usually set it equal to 1 in Boosting algorithms.
*   **k_in_knn:** The number of nearest neighbors. As in this particular dataset, there are many duplicate examples, and also we need a good diversity in SMOTE algorithm this hyperparameter was set to 20.
*   **smote_ratio:** The SMOTE ratio when creating synthetic examples. For example, if it was set to 200, then the number of synthetic examples will be equal to double the minority class size.
*   **T:** A list of the number of hypothesis/classifier we should make in each boosting algorithm.



In [22]:
max_depth = 1
max_features = 1
k_in_knn = 20
smote_ratio = 100

# Each value of T is the number of classifiers
T = [10, 50, 100]

# A dataframe containing the results for 5-fold
five_fold_results = pd.DataFrame(
    columns=['Algorithm', 'T', 'Loss', 'Precision', 'Recall',
             'F1score', 'AUC', 'Accuracy', 'TN', 'FP', 'FN', 'TP']
    )

# A dataframe containing the results for whole training data
train_results = pd.DataFrame(
    columns=['Algorithm', 'T', 'Loss', 'Precision', 'Recall',
             'F1score', 'AUC', 'Accuracy', 'TN', 'FP', 'FN', 'TP']
    )

## 2) AdaBoost.M2

**AdaBoost** is a practical and easy to implement boosting algorithm. It can be used to significantly reduce the error of any learning algorithm that consistently generates classifiers whose performance is a little better than random guessing [[4]](https://cseweb.ucsd.edu/~yfreund/papers/boostingexperiments.pdf).

In multi-label problems with $k>2$ the requirement that the error be less than $1/2$ is may often be hard to meet in **AdaBoost.M1**, but in the second version of the algorithm, **AdaBoost.M2**, this difficulty solved by using a more sophisticated error measure called *pseudo-loss*.   
Pseudo-loss is a method for forcing a learning algorithm of multi-label concepts to concentrate on the labels that are hardest to discriminate.

The pseudo-code shown in the bellow is from a [paper](https://arxiv.org/pdf/1708.03704.pdf) by *Alan Mosca* and *George D. Magoulas*.

![picture](https://drive.google.com/uc?id=1smHu06WuANhTiJ4EqO0XIrn7ud5PeuAV)

**Inputs:**
*   $X$: sequence of $m$ samples feature vectors $[x_1, x_2, ..., x_m]$
*   $y$: sequence of $m$ samples labels $[y_1, y_2, ..., y_m]$
*   $T$: an integer specifying the number of iterations


### 2.1) Algorithm implementation

In [23]:
class AdaBoostM2:
  """
    Functions:
      run(self) -> Run the algorithm of AdaBoost.M2
      predict(self, input) -> Predict the labels for given samples using created ensemble
      pseudo_losses(self) -> The pseudo-loss for each hypothesis
  """


  # Initilizition
  def __init__(self, X, y, T):
    """
      X: Feature vectors
      y: Labels
      m: Number of samples
      T: Number of iterations/classifiers/hypothesis
      H: Ensemble of classifiers
      beta_list: A list containing the beta value for each classifier
      loss_list: A list containing the pseudo-loss of each classifier
    """
    self.X = X
    self.y = y
    self.m = len(X)
    self.T = T
    self.H = []
    self.beta_list = []
    self.loss_list = []
    self.index_to_label = {}
    self.label_to_index = {}
  

  # Running the AdaBoost.M2 algorithm
  def run(self):

    # Initilize distribution
    D_t = (1./self.m) * np.ones(self.m)

    for t in trange(self.T):

      # Train a new classifier on training data with distibution D_t
      weak_learn = DecisionTreeClassifier(max_depth=max_depth, max_features=max_features)
      weak_learn.fit(self.X, self.y, sample_weight=D_t)
      
      # Get predictions and probabilities for the training data
      predictions = weak_learn.predict(self.X)
      probabilities = weak_learn.predict_proba(self.X)

      if t == 0:
        for index, value in enumerate(weak_learn.classes_):
          self.index_to_label[index] = value
          self.label_to_index[value] = index
      
      # Compute pseudo-loss
      pseudo_loss = 0

      # For each prediction
      for index, pr in enumerate(predictions):

        # Find the true label of the sample
        true_label = self.y[index]
        
        # Check if it's misclassified
        if pr != true_label:
          # Probability of assigning the predicted label
          pr_index = self.label_to_index[pr]
          pred_prob = probabilities[index][pr_index]
          # Probability of assigning the true label
          true_index = self.label_to_index[true_label]
          true_prob = probabilities[index][true_index]
          # Compute loss for the given sample
          current_loss = D_t[index] * (1 - true_prob + pred_prob)
          # Add it to pseudo_loss
          pseudo_loss += current_loss

      # Final step to compute pseudo-loss
      pseudo_loss = (1/2)*pseudo_loss

      # Compute beta
      beta = pseudo_loss / (1 - pseudo_loss)
      
      # Update distribution
      for index, value in enumerate(D_t):
        predicted_label = predictions[index]
        true_label = self.y[index]
        # Probability of assigning the predicted label
        pr_index = self.label_to_index[predicted_label]
        pred_prob = probabilities[index][pr_index]
        # Probability of assigning the true label
        true_index = self.label_to_index[true_label]
        true_prob = probabilities[index][true_index]
        power = (1/2) * (1 + true_prob - pred_prob)
        D_t[index] = (value) * pow(beta,power)
      
      # Normalization factor
      #Z_t = np.sum(D_t)
      D_t = D_t / np.sum(D_t)

      # Append weak learner to the ensemble
      self.H.append(weak_learn)
      # Append computed loss of the weak learner
      self.loss_list.append(pseudo_loss)
      # Append computed beta of the weak learner
      self.beta_list.append(beta)
  
      # print('t =', t, '--> pseudo-loss =', pseudo_loss)
    
    # Convert beta and loss lists to numpy array
    self.beta_list = np.array(self.beta_list)
    self.loss_list = np.array(self.loss_list)


  def predict(self, input):
    # Sum of the probabilities made by each hypothesis for a given sample and label
    sum_of_h = None
    # Compute the multiply factors
    alpha_list = np.log(1/self.beta_list)
    # For each hypothesis
    for index, h in enumerate(self.H):
      # Predict probabilities
      probabilities = h.predict_proba(input)
      # Multiply probabilities by the hypothesis computed alpha
      weighted_prob = alpha_list[index] * probabilities
      # If it's the first hypothesis
      if index == 0:
        sum_of_h = weighted_prob
      else:
        sum_of_h += weighted_prob
    # Choose the label using argmax

    predicted_idx = np.argmax(sum_of_h, axis=1)
    predicted_labels = []
    for pr in predicted_idx:

      predicted_labels.append(self.index_to_label[pr])
    predicted_labels = np.array(predicted_labels)
    #print(predicted_labels)
    # Remember the h_finite value
    h_fin = [sum_of_h[index][label]*predicted_labels[index] for index, label in enumerate(predicted_idx)]

    #print(h_fin)
    alpha_norm = np.linalg.norm(alpha_list, ord=1)
    h_fin = h_fin/alpha_norm
    return predicted_labels, h_fin

  # Print the computed pseudo-loss of each classifier
  def pseudo_losses(self):
    return self.loss_list


The final loss of the ensemble will be computed using a exponential loss function. It will only suitable for binary classification (-1 and +1 labels). The exponential loss function is sensitive to outliers/label noise.

$Loss(X) = \frac{1}{m}\sum_{i=1}^{m}e^{-y_i\bar{h}(x_i)}$, where $\bar{h}(x_i)=y\sum_{t=1}^{T}\alpha_th_t(x_i)$ for $y$ that maximizes the $h_{fin}(x_i)$.

### 2.2) Run on 5-fold

Train the algorithm using 4-fold data and evaluate its performance using the remaining 1-fold data.

In [24]:
# Remove last results
five_fold_results = five_fold_results[five_fold_results['Algorithm'] != 'AdaBoostM2']

for t in T:

  # Print the pseudo-loss of each classifier
  loss_plotter = LossPlotter('Loss of each classifier for T='+str(t), 't', 'pseudo-loss')
  roc_plotter = RocPlotter('ROC curve for T='+str(t))
  # Fold counter
  fold_count = 1
  # A list containing all the fold results
  all_fold_results = []
  print('T=',str(t))
  # Print a horizontal line
  print ('-' * 128)

  # For each fold
  for train_index, test_index in five_fold.split(X):
    # Split the train and test set
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Create an AdaBoostM2 object
    adaboostm2 = AdaBoostM2(X_train, y_train, t)
    # Run the algorithm
    adaboostm2.run()
    # Add scatter of this pseudo-loss to the loss plotter
    pseudo_losses = adaboostm2.pseudo_losses()
    loss_plotter.add_scatter(pseudo_losses, 'fold-'+str(fold_count))
    # Prediction on the test data
    predictions, hfin_values = adaboostm2.predict(X_test)
    model_margin = abs(hfin_values[0])
    for value in hfin_values:
      current_margin = abs(value)
      if current_margin < model_margin:
        model_margin = current_margin
    # Plot ROC curve
    auc_score = roc_plotter.add_scatter(y_test, hfin_values, 'Fold-'+str(fold_count))
    # Compute the loss on the test data
    loss = ensemble_loss_binary(y_test, predictions, hfin_values)
    results = {'Loss': loss, 'Margin': model_margin}
    results['AUC'] = auc_score
    results.update(get_metrics(y_test, predictions))
    all_fold_results.append(results)
    print('Fold-'+str(fold_count), results, '\n')
    fold_count += 1
  
  # Compute the average
  df_results = pd.DataFrame(all_fold_results)
  avg_results = dict(df_results.mean())
  # Print a horizontal line
  print ('-' * 128)

  # Remember the results
  avg_results['Algorithm'] = 'AdaBoostM2'
  avg_results['T'] = t
  five_fold_results = five_fold_results.append(avg_results, ignore_index=True)
  print('Average', avg_results)

  # Show the loss plot
  loss_plotter.show()
  roc_plotter.show()

T= 10
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-1 {'Loss': 0.8396406711472865, 'Margin': 0.5044568489811636, 'AUC': 0.7167810355583281, 'Precision': 1.0, 'Recall': 0.05194805194805195, 'F1score': 0.09876543209876544, 'Accuracy': 0.761437908496732, 'TN': 229, 'FP': 0, 'FN': 73, 'TP': 4} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-2 {'Loss': 0.8423715581836357, 'Margin': 0.5150196848340229, 'AUC': 0.721030042918455, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.761437908496732, 'TN': 233, 'FP': 0, 'FN': 73, 'TP': 0} 




Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-3 {'Loss': 0.789353235578546, 'Margin': 0.5049659659543426, 'AUC': 0.7893815289648624, 'Precision': 1.0, 'Recall': 0.1527777777777778, 'F1score': 0.26506024096385544, 'Accuracy': 0.8006535947712419, 'TN': 234, 'FP': 0, 'FN': 61, 'TP': 11} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-4 {'Loss': 0.904876179297535, 'Margin': 0.5244152617647644, 'AUC': 0.7258841195049966, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.7091503267973857, 'TN': 217, 'FP': 0, 'FN': 89, 'TP': 0} 




Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-5 {'Loss': 0.9200641919971729, 'Margin': 0.5042644447407374, 'AUC': 0.7499749096748294, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.6928104575163399, 'TN': 212, 'FP': 0, 'FN': 94, 'TP': 0} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8592611672408351, 'Margin': 0.5106244412550062, 'AUC': 0.7406103273242943, 'Precision': 0.4, 'Recall': 0.04094516594516595, 'F1score': 0.07276513461252418, 'Accuracy': 0.7450980392156863, 'TN': 225.0, 'FP': 0.0, 'FN': 78.0, 'TP': 3.0, 'Algorithm': 'AdaBoostM2', 'T': 10}



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



T= 50
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-1 {'Loss': 0.8348766025528134, 'Margin': 0.5002841869241916, 'AUC': 0.7431804003856406, 'Precision': 0.6176470588235294, 'Recall': 0.2727272727272727, 'F1score': 0.3783783783783784, 'Accuracy': 0.7745098039215687, 'TN': 216, 'FP': 13, 'FN': 56, 'TP': 21} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-2 {'Loss': 0.8515135092509271, 'Margin': 0.5028538626134305, 'AUC': 0.7227644188370862, 'Precision': 0.4838709677419355, 'Recall': 0.2054794520547945, 'F1score': 0.28846153846153844, 'Accuracy': 0.7581699346405228, 'TN': 217, 'FP': 16, 'FN': 58, 'TP': 15} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-3 {'Loss': 0.795272332439189, 'Margin': 0.5003885584313804, 'AUC': 0.8056742640075973, 'Precision': 0.7058823529411765, 'Recall': 0.3333333333333333, 'F1score': 0.45283018867924524, 'Accuracy': 0.8104575163398693, 'TN': 224, 'FP': 10, 'FN': 48, 'TP': 24} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-4 {'Loss': 0.8638848045580193, 'Margin': 0.5005755433738264, 'AUC': 0.7319939936830114, 'Precision': 0.6666666666666666, 'Recall': 0.2696629213483146, 'F1score': 0.384, 'Accuracy': 0.7483660130718954, 'TN': 205, 'FP': 12, 'FN': 65, 'TP': 24} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-5 {'Loss': 0.8836907306554217, 'Margin': 0.5008839618718305, 'AUC': 0.7478422320353272, 'Precision': 0.6666666666666666, 'Recall': 0.23404255319148937, 'F1score': 0.3464566929133858, 'Accuracy': 0.7287581699346405, 'TN': 201, 'FP': 11, 'FN': 72, 'TP': 22} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8458475958912741, 'Margin': 0.5009972226429319, 'AUC': 0.7502910617897326, 'Precision': 0.628146742567995, 'Recall': 0.2630491065310409, 'F1score': 0.37002535968650957, 'Accuracy': 0.7640522875816993, 'TN': 212.6, 'FP': 12.4, 'FN': 59.8, 'TP': 21.2, 'Algorithm': 'AdaBoostM2', 'T': 50}


T= 100
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-1 {'Loss': 0.8328284213656552, 'Margin': 0.5004276025486022, 'AUC': 0.7608177848352521, 'Precision': 0.6363636363636364, 'Recall': 0.2727272727272727, 'F1score': 0.3818181818181818, 'Accuracy': 0.7777777777777778, 'TN': 217, 'FP': 12, 'FN': 56, 'TP': 21} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-2 {'Loss': 0.8541824086113001, 'Margin': 0.5004997492409027, 'AUC': 0.7442824387089187, 'Precision': 0.4838709677419355, 'Recall': 0.2054794520547945, 'F1score': 0.28846153846153844, 'Accuracy': 0.7581699346405228, 'TN': 217, 'FP': 16, 'FN': 58, 'TP': 15} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-3 {'Loss': 0.7874428362027867, 'Margin': 0.5003327281832215, 'AUC': 0.8041607312440646, 'Precision': 0.7931034482758621, 'Recall': 0.3194444444444444, 'F1score': 0.4554455445544554, 'Accuracy': 0.8202614379084967, 'TN': 228, 'FP': 6, 'FN': 49, 'TP': 23} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-4 {'Loss': 0.8754823744372592, 'Margin': 0.5001556787789521, 'AUC': 0.7432817273339202, 'Precision': 0.6153846153846154, 'Recall': 0.2696629213483146, 'F1score': 0.37499999999999994, 'Accuracy': 0.738562091503268, 'TN': 202, 'FP': 15, 'FN': 65, 'TP': 24} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-5 {'Loss': 0.8955267187204742, 'Margin': 0.5000241173173546, 'AUC': 0.7575270975511842, 'Precision': 0.6111111111111112, 'Recall': 0.23404255319148937, 'F1score': 0.3384615384615385, 'Accuracy': 0.7189542483660131, 'TN': 198, 'FP': 14, 'FN': 72, 'TP': 22} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8490925518674951, 'Margin': 0.5002879752138066, 'AUC': 0.7620139559346679, 'Precision': 0.627966755775432, 'Recall': 0.2602713287532631, 'F1score': 0.36783736065914285, 'Accuracy': 0.7627450980392156, 'TN': 212.4, 'FP': 12.6, 'FN': 60.0, 'TP': 21.0, 'Algorithm': 'AdaBoostM2', 'T': 100}


### 2.3) Run on whole training data

Train and evaluate the algorithm both on the training data.

In [25]:
# Remove last results
train_results = train_results[train_results['Algorithm'] != 'AdaBoostM2']

for t in T:

  # Print the pseudo-loss of each classifier
  loss_plotter = LossPlotter('Loss of each classifier for T='+str(t), 't', 'pseudo-loss')
  roc_plotter = RocPlotter('ROC curve for T='+str(t))
  print('T=',str(t))
  # Print a horizontal line
  print ('-' * 128)

  # Create an AdaBoostM2 object
  adaboostm2 = AdaBoostM2(X, y, t)
  # Run the algorithm
  adaboostm2.run()
  # Add scatter of this pseudo-loss to the loss plotter
  pseudo_losses = adaboostm2.pseudo_losses()
  loss_plotter.add_scatter(pseudo_losses, 'train data')
  # Prediction on the test data
  predictions, hfin_values = adaboostm2.predict(X)
  model_margin = abs(hfin_values[0])
  for value in hfin_values:
    current_margin = abs(value)
    if current_margin < model_margin:
      model_margin = current_margin
  # Plot ROC curve
  auc_score = roc_plotter.add_scatter(y, hfin_values, 'train data')
  # Compute the loss on the test data
  loss = ensemble_loss_binary(y, predictions, hfin_values)
  # Compute metrics
  results = {'Loss': loss, 'Margin': model_margin}
  results.update(get_metrics(y, predictions))

  # Remember the results
  results['Algorithm'] = 'AdaBoostM2'
  results['T'] = t
  results['AUC'] = auc_score
  train_results = train_results.append(results, ignore_index=True)
  print(results)

  # Show the loss plot
  loss_plotter.show()
  roc_plotter.show()
  plot_histogram(y, hfin_values, 'The histogram of scores compared to true labels for T='+str(t))

T= 10
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


{'Loss': 0.8735696856084166, 'Margin': 0.5389236257710969, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.7352941176470589, 'TN': 1125, 'FP': 0, 'FN': 405, 'TP': 0, 'Algorithm': 'AdaBoostM2', 'T': 10, 'AUC': 0.7382167352537723}



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



T= 50
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


{'Loss': 0.8451630507923947, 'Margin': 0.5001877338973072, 'Precision': 0.6363636363636364, 'Recall': 0.25925925925925924, 'F1score': 0.368421052631579, 'Accuracy': 0.7647058823529411, 'TN': 1065, 'FP': 60, 'FN': 300, 'TP': 105, 'Algorithm': 'AdaBoostM2', 'T': 50, 'AUC': 0.769519890260631}


T= 100
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


{'Loss': 0.8430391210672405, 'Margin': 0.5004520203527987, 'Precision': 0.65625, 'Recall': 0.25925925925925924, 'F1score': 0.37168141592920345, 'Accuracy': 0.7679738562091504, 'TN': 1070, 'FP': 55, 'FN': 300, 'TP': 105, 'Algorithm': 'AdaBoostM2', 'T': 100, 'AUC': 0.7721810699588477}


## 3) RUSBoost

**RUSBoost** provides a simpler and faster alternative to *SMOTEBoost*. The way it works is so similar to *AdaBoost.M2*. The only difference is in each iteration (training a new hypothesis), Random Under Sampling (RUS) is applied to remove the majority class examples until $N\%$ of the temporary training data set $S_t'$ belongs to the minority class [[5]](https://ieeexplore.ieee.org/document/5299216).

![picture](https://drive.google.com/uc?id=1KD6uEISQzB5eoEuac1JfuRZhKizza_Yq)

### 3.1) Algorithm implementation

In [26]:
class RUSBoost:
  """
    Functions:
      run(self) -> Run the algorithm of RUSBoost
      RUS(self, D_t) -> Choose the samples using the Random Undersampling technique
      predict(self, input) -> Predict the labels for given samples using created ensemble
      pseudo_losses(self) -> The pseudo-loss for each hypothesis
  """


  # Initilizition
  def __init__(self, X, y, T, N):
    """
      X: Feature vectors
      y: Labels
      m: Number of samples
      T: Number of iterations/classifiers/hypothesis
      N: The N% of the new training set in RUS belongs to the minority class
      H: Ensemble of classifiers
      beta_list: A list containing the beta value for each classifier
      loss_list: A list containing the pseudo-loss of each classifier
    """
    self.X = X
    self.y = y
    self.m = len(X)
    self.T = T
    self.N = N
    self.H = []
    self.beta_list = []
    self.loss_list = []
    self.minority_label = 0
    self.majority_label = 0
    self.minority_size = 0
    self.majority_size = 0
    self.minority_idx = []
    self.majority_idx = []
    self.label_to_index = {}
    self.index_to_label = {}
    self.set_minority_majority()

  # Find which label is minority and which majority
  def set_minority_majority(self):
    labels = np.unique(self.y)
    count = []
    for label in labels:
      cnt = len((self.y==label).nonzero()[0])
      count.append(cnt)
    minority_idx = 0
    majority_idx = 0
    if count[0]>count[1]:
      minority_idx = 1
    else:
      majority_idx = 1
    self.majority_label = labels[majority_idx]
    self.majority_size = count[majority_idx]
    self.majority_idx = (self.y == labels[majority_idx]).nonzero()[0]
    self.minority_label = labels[minority_idx]
    self.minority_size = count[minority_idx]
    self.minority_idx = (self.y == labels[minority_idx]).nonzero()[0]

  # Choose the samples using RUS
  def RUS(self, D_t):
    needed_majority = round(((100-self.N) * self.minority_size)/self.N)
    # Make sure that the maximum possible number for needed data 
    # is the number of data in majority class 
    if needed_majority > self.majority_size:
      needed_majority = self.majority_size
    # Randomly choose the needed number unique data from majority class
    chosen_idx = np.random.choice(self.majority_idx, needed_majority, replace=False)
    # Make the data ready
    chosen_majority = self.X[chosen_idx]
    minority_data = self.X[self.minority_idx]
    # Concatenate the randomly chosen majority data with minority data
    data = np.concatenate((chosen_majority, minority_data), axis=0)
    # Make the weights ready
    chosen_weight = D_t[chosen_idx]
    minority_weight = D_t[self.minority_idx]
    # Concatenate the randomly chosen majority data with minority data
    weight = np.concatenate((chosen_weight, minority_weight), axis=0)
    # Make the label lists
    label_majority = [self.majority_label for i in chosen_majority]
    label_minority = [self.minority_label for i in minority_data]
    label = label_majority + label_minority
    label = np.array(label)
    # Shuffle the data
    data, label, weight = shuffle(data, label, weight)
    return data, label, weight
  

  # Running the RUSBoost algorithm
  def run(self):

    # Initilize distribution
    D_t = (1./self.m) * np.ones(self.m)

    for t in trange(self.T):

      # Choose the new (temporary) training set
      X_temp, y_temp, D_temp = self.RUS(D_t)

      # Train a new classifier on training data with distibution D_t
      weak_learn = DecisionTreeClassifier(max_depth=max_depth, max_features=max_features)
      weak_learn.fit(X_temp, y_temp, sample_weight=D_temp)
      
      # Get predictions and probabilities for the training data
      predictions = weak_learn.predict(self.X)
      probabilities = weak_learn.predict_proba(self.X)
      
      if t == 0:
        for index, value in enumerate(weak_learn.classes_):
          self.index_to_label[index] = value
          self.label_to_index[value] = index
      
      # Compute pseudo-loss
      pseudo_loss = 0

      # For each prediction
      for index, pr in enumerate(predictions):

        # Find the true label of the sample
        true_label = self.y[index]
        
        # Check if it's misclassified
        if pr != true_label:
          # Probability of assigning the predicted label
          pr_index = self.label_to_index[pr]
          pred_prob = probabilities[index][pr_index]
          # Probability of assigning the true label
          true_index = self.label_to_index[true_label]
          true_prob = probabilities[index][true_index]
          # Compute loss for the given sample
          current_loss = D_t[index] * (1 - true_prob + pred_prob)
          # Add it to pseudo_loss
          pseudo_loss += current_loss

      # Final step to compute pseudo-loss
      pseudo_loss = (1/2)*pseudo_loss

      # Compute beta
      beta = pseudo_loss / (1 - pseudo_loss)
      
      # Update distribution
      for index, value in enumerate(D_t):
        predicted_label = predictions[index]
        true_label = self.y[index]
        # Probability of assigning the predicted label
        pr_index = self.label_to_index[predicted_label]
        pred_prob = probabilities[index][pr_index]
        # Probability of assigning the true label
        true_index = self.label_to_index[true_label]
        true_prob = probabilities[index][true_index]
        power = (1/2) * (1 + true_prob - pred_prob)
        D_t[index] = (value) * pow(beta,power)
      
      # Normalization factor
      Z_t = np.sum(D_t)
      D_t = D_t / sum(D_t)
      
      # Append weak learner to the ensemble
      self.H.append(weak_learn)
      # Append computed loss of the weak learner
      self.loss_list.append(pseudo_loss)
      # Append computed beta of the weak learner
      self.beta_list.append(beta)
  
      # print('t =', t, '--> pseudo-loss =', pseudo_loss)
    
    # Convert beta and loss lists to numpy array
    self.beta_list = np.array(self.beta_list)
    self.loss_list = np.array(self.loss_list)


  def predict(self, input):
    # Sum of the probabilities made by each hypothesis for a given sample and label
    sum_of_h = None
    # Compute the multiply factors
    alpha_list = np.log(1/self.beta_list)
    # For each hypothesis
    for index, h in enumerate(self.H):
      # Predict probabilities
      probabilities = h.predict_proba(input)
      # Multiply probabilities by the hypothesis computed alpha
      weighted_prob = alpha_list[index] * probabilities
      # If it's the first hypothesis
      if index == 0:
        sum_of_h = weighted_prob
      else:
        sum_of_h += weighted_prob
    # Choose the label using argmax
    predicted_idx = np.argmax(sum_of_h, axis=1)
    predicted_labels = []
    for pr in predicted_idx:
      predicted_labels.append(self.index_to_label[pr])
    predicted_labels = np.array(predicted_labels)
    #print(predicted_labels)
    # Remember the h_finite value
    h_fin = [sum_of_h[index][label]*predicted_labels[index] for index, label in enumerate(predicted_idx)]
    #print(h_fin)
    alpha_norm = np.linalg.norm(alpha_list, ord=1)
    h_fin = h_fin/alpha_norm
    return predicted_labels, h_fin

  # Print the computed pseudo-loss of each classifier
  def pseudo_losses(self):
    return self.loss_list


### 3.2) Run on 5-fold

In [27]:
# Remove last results
five_fold_results = five_fold_results[five_fold_results['Algorithm'] != 'RUSBoost']

for t in T:

  # Print the pseudo-loss of each classifier
  loss_plotter = LossPlotter('Loss of each classifier for T='+str(t), 't', 'pseudo-loss')
  roc_plotter = RocPlotter('ROC curve for T='+str(t))
  # Fold counter
  fold_count = 1
  # A list containing all the fold results
  all_fold_results = []
  print('T=',str(t))
  # Print a horizontal line
  print ('-' * 128)

  # For each fold
  for train_index, test_index in five_fold.split(X):
    # Split the train and test set
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Create an RUSBoost object
    rusboost = RUSBoost(X_train, y_train, t, 50)
    # Run the algorithm
    rusboost.run()
    # Add scatter of this pseudo-loss to the loss plotter
    pseudo_losses = rusboost.pseudo_losses()
    loss_plotter.add_scatter(pseudo_losses, 'fold-'+str(fold_count))
    # Prediction on the test data
    predictions, hfin_values = rusboost.predict(X_test)
    model_margin = abs(hfin_values[0])
    for value in hfin_values:
      current_margin = abs(value)
      if current_margin < model_margin:
        model_margin = current_margin
    # Plot ROC curve
    auc_score = roc_plotter.add_scatter(y_test, hfin_values, 'Fold-'+str(fold_count))
    # Compute the loss on the test data
    loss = ensemble_loss_binary(y_test, predictions, hfin_values)
    results = {'Loss': loss, 'Margin': model_margin}
    results['AUC'] = auc_score
    results.update(get_metrics(y_test, predictions))
    all_fold_results.append(results)
    print('Fold-'+str(fold_count), results, '\n')
    fold_count += 1
  
  # Compute the average
  df_results = pd.DataFrame(all_fold_results)
  avg_results = dict(df_results.mean())
  # Print a horizontal line
  print ('-' * 128)

  # Remember the results
  avg_results['Algorithm'] = 'RUSBoost'
  avg_results['T'] = t
  five_fold_results = five_fold_results.append(avg_results, ignore_index=True)
  print('Average', avg_results)

  # Show the loss plot
  loss_plotter.show()
  roc_plotter.show()

T= 10
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-1 {'Loss': 0.9147466853476287, 'Margin': 0.5052820731184814, 'AUC': 0.6922248057619236, 'Precision': 0.4270833333333333, 'Recall': 0.5324675324675324, 'F1score': 0.47398843930635837, 'Accuracy': 0.7026143790849673, 'TN': 174, 'FP': 55, 'FN': 36, 'TP': 41} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-2 {'Loss': 0.8772763746393035, 'Margin': 0.5016424413694379, 'AUC': 0.7617731789052855, 'Precision': 0.46153846153846156, 'Recall': 0.5753424657534246, 'F1score': 0.5121951219512195, 'Accuracy': 0.738562091503268, 'TN': 184, 'FP': 49, 'FN': 31, 'TP': 42} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-3 {'Loss': 0.8648414735731987, 'Margin': 0.500195935454499, 'AUC': 0.7719907407407408, 'Precision': 0.47572815533980584, 'Recall': 0.6805555555555556, 'F1score': 0.56, 'Accuracy': 0.7483660130718954, 'TN': 180, 'FP': 54, 'FN': 23, 'TP': 49} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-4 {'Loss': 0.9142907404435362, 'Margin': 0.5005928428310303, 'AUC': 0.7179361052141044, 'Precision': 0.49504950495049505, 'Recall': 0.5617977528089888, 'F1score': 0.5263157894736842, 'Accuracy': 0.7058823529411765, 'TN': 166, 'FP': 51, 'FN': 39, 'TP': 50} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-5 {'Loss': 0.8703393422532005, 'Margin': 0.5059479983131102, 'AUC': 0.7531613809714973, 'Precision': 0.6, 'Recall': 0.5106382978723404, 'F1score': 0.5517241379310344, 'Accuracy': 0.7450980392156863, 'TN': 180, 'FP': 32, 'FN': 46, 'TP': 48} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8882989232513735, 'Margin': 0.5027322582173118, 'AUC': 0.7394172423187103, 'Precision': 0.4918798910324192, 'Recall': 0.5721603208915683, 'F1score': 0.5248446977324592, 'Accuracy': 0.7281045751633988, 'TN': 176.8, 'FP': 48.2, 'FN': 35.0, 'TP': 46.0, 'Algorithm': 'RUSBoost', 'T': 10}


T= 50
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-1 {'Loss': 0.8941946980332097, 'Margin': 0.5004052517149855, 'AUC': 0.7149662564509726, 'Precision': 0.45652173913043476, 'Recall': 0.5454545454545454, 'F1score': 0.4970414201183432, 'Accuracy': 0.7222222222222222, 'TN': 179, 'FP': 50, 'FN': 35, 'TP': 42} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-2 {'Loss': 0.8360030333075035, 'Margin': 0.5004445588578692, 'AUC': 0.7730613204773943, 'Precision': 0.5287356321839081, 'Recall': 0.6301369863013698, 'F1score': 0.5750000000000001, 'Accuracy': 0.7777777777777778, 'TN': 192, 'FP': 41, 'FN': 27, 'TP': 46} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-3 {'Loss': 0.8791126870964705, 'Margin': 0.5002512128768019, 'AUC': 0.7888176638176638, 'Precision': 0.4594594594594595, 'Recall': 0.7083333333333334, 'F1score': 0.5573770491803279, 'Accuracy': 0.7352941176470589, 'TN': 174, 'FP': 60, 'FN': 21, 'TP': 51} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-4 {'Loss': 0.9051147481541993, 'Margin': 0.5004565891401971, 'AUC': 0.7452493139336198, 'Precision': 0.504424778761062, 'Recall': 0.6404494382022472, 'F1score': 0.5643564356435643, 'Accuracy': 0.7124183006535948, 'TN': 161, 'FP': 56, 'FN': 32, 'TP': 57} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-5 {'Loss': 0.9076827807634228, 'Margin': 0.5001495058504113, 'AUC': 0.7552940586109995, 'Precision': 0.5213675213675214, 'Recall': 0.648936170212766, 'F1score': 0.5781990521327014, 'Accuracy': 0.7091503267973857, 'TN': 156, 'FP': 56, 'FN': 33, 'TP': 61} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8844215894709612, 'Margin': 0.500341423688053, 'AUC': 0.7554777226581301, 'Precision': 0.49410182618047715, 'Recall': 0.6346620947008523, 'F1score': 0.5543947914149874, 'Accuracy': 0.7313725490196079, 'TN': 172.4, 'FP': 52.6, 'FN': 29.6, 'TP': 51.4, 'Algorithm': 'RUSBoost', 'T': 50}


T= 100
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-1 {'Loss': 0.8662021090335115, 'Margin': 0.5000753734031289, 'AUC': 0.7600805308228888, 'Precision': 0.5, 'Recall': 0.6233766233766234, 'F1score': 0.5549132947976878, 'Accuracy': 0.7483660130718954, 'TN': 181, 'FP': 48, 'FN': 29, 'TP': 48} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-2 {'Loss': 0.8392572372994735, 'Margin': 0.5002404673940212, 'AUC': 0.7774119583749779, 'Precision': 0.5256410256410257, 'Recall': 0.5616438356164384, 'F1score': 0.543046357615894, 'Accuracy': 0.7745098039215687, 'TN': 196, 'FP': 37, 'FN': 32, 'TP': 41} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-3 {'Loss': 0.876684224173863, 'Margin': 0.5000989541205821, 'AUC': 0.8038046058879392, 'Precision': 0.46551724137931033, 'Recall': 0.75, 'F1score': 0.574468085106383, 'Accuracy': 0.738562091503268, 'TN': 172, 'FP': 62, 'FN': 18, 'TP': 54} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-4 {'Loss': 0.93532475157558, 'Margin': 0.5000486431745241, 'AUC': 0.7645109511727852, 'Precision': 0.4636363636363636, 'Recall': 0.5730337078651685, 'F1score': 0.5125628140703518, 'Accuracy': 0.6830065359477124, 'TN': 158, 'FP': 59, 'FN': 38, 'TP': 51} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-5 {'Loss': 0.8872372889252076, 'Margin': 0.5000510683717561, 'AUC': 0.7605881172219991, 'Precision': 0.5447154471544715, 'Recall': 0.7127659574468085, 'F1score': 0.6175115207373272, 'Accuracy': 0.7287581699346405, 'TN': 156, 'FP': 56, 'FN': 27, 'TP': 67} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8809411222015271, 'Margin': 0.5001029012928025, 'AUC': 0.7732792326961181, 'Precision': 0.4999020155622341, 'Recall': 0.6441640248610078, 'F1score': 0.5605004144655288, 'Accuracy': 0.734640522875817, 'TN': 172.6, 'FP': 52.4, 'FN': 28.8, 'TP': 52.2, 'Algorithm': 'RUSBoost', 'T': 100}


### 3.3) Run on whole training data

In [28]:
# Remove last results
train_results = train_results[train_results['Algorithm'] != 'RUSBoost']

for t in T:

  # Print the pseudo-loss of each classifier
  loss_plotter = LossPlotter('Loss of each classifier for T='+str(t), 't', 'pseudo-loss')
  roc_plotter = RocPlotter('ROC curve for T='+str(t))
  print('T=',str(t))
  # Print a horizontal line
  print ('-' * 128)

  # Create an RUSBoost object
  rusboost = RUSBoost(X, y, t, 50)
  # Run the algorithm
  rusboost.run()
  # Add scatter of this pseudo-loss to the loss plotter
  pseudo_losses = rusboost.pseudo_losses()
  loss_plotter.add_scatter(pseudo_losses, 'train data')
  # Prediction on the test data
  predictions, hfin_values = rusboost.predict(X)
  model_margin = abs(hfin_values[0])
  for value in hfin_values:
    current_margin = abs(value)
    if current_margin < model_margin:
      model_margin = current_margin
  # Plot ROC curve
  auc_score = roc_plotter.add_scatter(y, hfin_values, 'train data')
  # Compute the loss on the test data
  loss = ensemble_loss_binary(y, predictions, hfin_values)
  # Compute metrics
  results = {'Loss': loss, 'Margin': model_margin}
  results.update(get_metrics(y, predictions))

  # Remember the results
  results['Algorithm'] = 'RUSBoost'
  results['T'] = t
  results['AUC'] = auc_score
  train_results = train_results.append(results, ignore_index=True)
  print(results)

  # Show the loss plot
  loss_plotter.show()
  roc_plotter.show()
  plot_histogram(y, hfin_values, 'The histogram of scores compared to true labels for T='+str(t))

T= 10
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


{'Loss': 0.9076226137301749, 'Margin': 0.5071202960416664, 'Precision': 0.4666666666666667, 'Recall': 0.6049382716049383, 'F1score': 0.5268817204301075, 'Accuracy': 0.7124183006535948, 'TN': 845, 'FP': 280, 'FN': 160, 'TP': 245, 'Algorithm': 'RUSBoost', 'T': 10, 'AUC': 0.7482578875171468}


T= 50
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


{'Loss': 0.8492775356977729, 'Margin': 0.5004176727092048, 'Precision': 0.5454545454545454, 'Recall': 0.6666666666666666, 'F1score': 0.6, 'Accuracy': 0.7647058823529411, 'TN': 900, 'FP': 225, 'FN': 135, 'TP': 270, 'Algorithm': 'RUSBoost', 'T': 50, 'AUC': 0.77440329218107}


T= 100
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


{'Loss': 0.859390827342643, 'Margin': 0.5001070428020202, 'Precision': 0.5319148936170213, 'Recall': 0.6172839506172839, 'F1score': 0.5714285714285713, 'Accuracy': 0.7549019607843137, 'TN': 905, 'FP': 220, 'FN': 155, 'TP': 250, 'Algorithm': 'RUSBoost', 'T': 100, 'AUC': 0.7920987654320988}


## 4) SMOTEBoost

SMOTEBoost goal is to utilize SMOTE [[6]](https://arxiv.org/pdf/1106.1813.pdf) for improving the prediction of the minority classes, and to utilize boosting to not sacrifice accuracy over the entire data set [[7]](https://link.springer.com/chapter/10.1007/978-3-540-39804-2_12). The new synthetic minority samples are created as follows :
*  Take the difference between a feature vector (minority class sample) and one of its k nearest neighbors (minority class samples).
*  Multiply this difference by a random number between 0 and 1.
*  Add this difference to the feature value of the original feature vector, thus creating a new feature vector.

### 4.1) K-Nearest Neighbors

The following code is a pure implementation of finding k-nearest neighbors. As expected it's not time-efficient, so it's better to use a better implementation from a well-known library like sklearn [[8]](https://scikit-learn.org/stable/modules/neighbors.html).

In [29]:
def KNN(samples, chosen, k):
  distances = []
  for index, sample in enumerate(samples):
    # np.linalg.norm computes the euclidean distance
    dist = np.linalg.norm(chosen - sample)
    distances.append((index, dist))
    distances.sort(key=lambda tup: tup[1])
  # A list containing the k neares neighbors
  k_neighbors = []
  index = 0
  created_num = 0
  while (created_num != k):
    sample_index = distances[index][0]
    neighbor = samples[sample_index]
    # The chosen sample itself (distance=0) should not be considered
    if (distances[index][1] != 0):
      k_neighbors.append(neighbor)
      created_num += 1
    index += 1
  return k_neighbors

### 4.2) SMOTE

The **SMOTE** function creates synthetic examples based on given samples.   
Consider that $\alpha = \frac{amount}{100}$ the function looks at each given sample and their $k$ neighbors, then creates $int(\alpha)$ synthetic examples based on each of them. Also for the floating part of the $\alpha$, it randomly chooses samples and creates one synthetic example per each of them.

In [30]:
def SMOTE(given_samples, amount, k):
  # The number of samples
  samples_num = len(given_samples)
  # The total number of samples need to be created
  needed_num = (amount/100)*samples_num
  # The number of synthetic samples need to be created from each sample
  full_num = int(needed_num/samples_num)
  # The number of samples need to be created from random samples
  remain_num = round(needed_num - (full_num*samples_num))
  # A list containing the new synthetic samples
  synthetic_samples = []
  # The defulat metric is euclidean distance
  nbrs = NearestNeighbors(n_neighbors=k).fit(given_samples)

  for sample in given_samples:
    # Find k nearest neighbors
    #k_neighbors = KNN(given_samples, sample, k)
    distances, indices = nbrs.kneighbors(sample.reshape(1, -1))
    non_zero_indices = []
    for i, d in enumerate(distances[0]):
      if d != 0:
        non_zero_indices.append(indices[0][i])
    # Choose 'full_num' random samples from this neighbors
    chosen_idx = np.random.choice(non_zero_indices, full_num, replace=False)
    for idx in chosen_idx:
      chosen = given_samples[idx]
      difference = abs(sample - chosen)
      rand_num = np.random.random(1)
      created_sample = sample + (rand_num*difference)
      synthetic_samples.append(created_sample)

  # For the ramaining number, we choose randomly from given_samples
  # and create just one synthetic sample from each of them
  chosen_idx = np.random.choice(len(given_samples), remain_num, replace=False)
  for idx in chosen_idx:
    non_zero_indices = []
    for i, d in enumerate(distances[0]):
      if d != 0:
        non_zero_indices.append(indices[0][i])
    chosen_neighbor_idx = np.random.choice(non_zero_indices, 1)
    chosen_neighbor = given_samples[chosen_neighbor_idx[0]]
    difference = given_samples[idx] - chosen_neighbor
    rand_num = np.random.random(1)
    created_sample = given_samples[idx] + (rand_num*difference)
    synthetic_samples.append(created_sample)
  
  # Shuffle the data
  return np.array(synthetic_samples)

### 4.3) Algorithm implementation

The procedure of **SMOTEBoost** algorithm consists of introducing SMOTE in each round of boosting.   
Therefore each learner will be able to learn from more of the minority class cases, thus learning broader decision regions for the minority class.

<font color='red'>**Note:**</font> The synthetically created minority class cases are discarded after learning a classifier at each iteration. That is, they are not added to the original data set.

![picture](https://drive.google.com/uc?id=1_n6gZPKL2MaD9CvPBeJctFkaRVUuI3gK)

In [31]:
class SMOTEBoost:
  """
    Functions:
      run(self) -> Run the algorithm of SMOTEBoost
      predict(self, input) -> Predict the labels for given samples using created ensemble
      pseudo_losses(self) -> The pseudo-loss for each hypothesis
  """


  # Initilizition
  def __init__(self, X, y, T, N, k):
    """
      X: Feature vectors
      y: Labels
      m: Number of samples
      T: Number of iterations/classifiers/hypothesis
      N: The SMOTE amount (percentage)
      k: The value of k in KNN when introducing SMOTE
      H: Ensemble of classifiers
      beta_list: A list containing the beta value for each classifier
      loss_list: A list containing the pseudo-loss of each classifier
    """
    self.X = X
    self.y = y
    self.m = len(X)
    self.T = T
    self.N = N
    self.k = k
    self.H = []
    self.beta_list = []
    self.loss_list = []
    self.minority_label = 0
    self.majority_label = 0
    self.minority_size = 0
    self.majority_size = 0
    self.minority_data = []
    self.majority_data = []
    self.index_to_label = {}
    self.label_to_index = {}
    self.set_minority_majority()

  # Find which label is minority and which majority
  def set_minority_majority(self):
    labels = np.unique(self.y)
    count = []
    for label in labels:
      cnt = len((self.y==label).nonzero()[0])
      count.append(cnt)
    minority_idx = 0
    majority_idx = 0
    if count[0]>count[1]:
      minority_idx = 1
    else:
      majority_idx = 1
    self.majority_label = labels[majority_idx]
    self.majority_size = count[majority_idx]
    self.majority_data = self.X[(self.y == labels[majority_idx]).nonzero()[0]]
    self.minority_label = labels[minority_idx]
    self.minority_size = count[minority_idx]
    self.minority_data = self.X[(self.y == labels[minority_idx]).nonzero()[0]]

  # Running the SMOTEBoost algorithm
  def run(self):

    # Initilize distribution
    D_t = (1./self.m) * np.ones(self.m)

    for t in trange(self.T):
      X_synthetic = SMOTE(self.minority_data, self.N, self.k)
      y_synthetic = np.full(X_synthetic.shape[0], dtype=np.int64,
                           fill_value=self.minority_label)
      D_synthetic = np.full(X_synthetic.shape[0], fill_value=(1./self.m))
      X_new = np.vstack((self.X, X_synthetic))
      y_new = np.concatenate((self.y, y_synthetic))
      D_new = np.concatenate((D_t, D_synthetic))
      

      # Train a new classifier on training data with distibution D_t
      weak_learn = DecisionTreeClassifier(max_depth=max_depth, max_features=max_features)
      weak_learn.fit(X_new, y_new, sample_weight=D_new)
      
      # Get predictions and probabilities for the training data
      predictions = weak_learn.predict(self.X)
      probabilities = weak_learn.predict_proba(self.X)
      
      if t == 0:
        for index, value in enumerate(weak_learn.classes_):
          self.index_to_label[index] = value
          self.label_to_index[value] = index
      
      # Compute pseudo-loss
      pseudo_loss = 0

      # For each prediction
      for index, pr in enumerate(predictions):

        # Find the true label of the sample
        true_label = self.y[index]
        
        # Check if it's misclassified
        if pr != true_label:
          # Probability of assigning the predicted label
          pr_index = self.label_to_index[pr]
          pred_prob = probabilities[index][pr_index]
          # Probability of assigning the true label
          true_index = self.label_to_index[true_label]
          true_prob = probabilities[index][true_index]
          # Compute loss for the given sample
          current_loss = D_t[index] * (1 - true_prob + pred_prob)
          # Add it to pseudo_loss
          pseudo_loss += current_loss

      # Final step to compute pseudo-loss
      pseudo_loss = (1/2)*pseudo_loss

      # Compute beta
      beta = pseudo_loss / (1 - pseudo_loss)
      
      # Update distribution
      for index, value in enumerate(D_t):
        predicted_label = predictions[index]
        true_label = self.y[index]
        # Probability of assigning the predicted label
        pr_index = self.label_to_index[predicted_label]
        pred_prob = probabilities[index][pr_index]
        # Probability of assigning the true label
        true_index = self.label_to_index[true_label]
        true_prob = probabilities[index][true_index]
        power = (1/2) * (1 + true_prob - pred_prob)
        D_t[index] = (value) * pow(beta,power)
      
      # Normalization factor
      Z_t = np.sum(D_t)
      D_t = D_t / Z_t
      
      # Append weak learner to the ensemble
      self.H.append(weak_learn)
      # Append computed loss of the weak learner
      self.loss_list.append(pseudo_loss)
      # Append computed beta of the weak learner
      self.beta_list.append(beta)
  
      # print('t =', t, '--> pseudo-loss =', pseudo_loss)
    
    # Convert beta and loss lists to numpy array
    self.beta_list = np.array(self.beta_list)
    self.loss_list = np.array(self.loss_list)


  def predict(self, input):
    # Sum of the probabilities made by each hypothesis for a given sample and label
    sum_of_h = None
    # Compute the multiply factors
    alpha_list = np.log(1/self.beta_list)
    # For each hypothesis
    for index, h in enumerate(self.H):
      # Predict probabilities
      probabilities = h.predict_proba(input)
      # Multiply probabilities by the hypothesis computed alpha
      weighted_prob = alpha_list[index] * probabilities
      # If it's the first hypothesis
      if index == 0:
        sum_of_h = weighted_prob
      else:
        sum_of_h += weighted_prob
    # Choose the label using argmax
    predicted_idx = np.argmax(sum_of_h, axis=1)
    predicted_labels = []
    for pr in predicted_idx:
      predicted_labels.append(self.index_to_label[pr])
    predicted_labels = np.array(predicted_labels)
    #print(predicted_labels)
    # Remember the h_finite value
    h_fin = [sum_of_h[index][label]*predicted_labels[index] for index, label in enumerate(predicted_idx)]
    #print(h_fin)
    alpha_norm = np.linalg.norm(alpha_list, ord=1)
    h_fin = h_fin/alpha_norm
    return predicted_labels, h_fin

  # Print the computed pseudo-loss of each classifier
  def pseudo_losses(self):
    return self.loss_list


### 4.4) Run on 5-fold

In [32]:
# Remove last results
five_fold_results = five_fold_results[five_fold_results['Algorithm'] != 'SMOTEBoost']

for t in T:

  # Print the pseudo-loss of each classifier
  loss_plotter = LossPlotter('Loss of each classifier for T='+str(t), 't', 'pseudo-loss')
  roc_plotter = RocPlotter('ROC curve for T='+str(t))
  # Fold counter
  fold_count = 1
  # A list containing all the fold results
  all_fold_results = []
  print('T=',str(t))
  # Print a horizontal line
  print ('-' * 128)

  # For each fold
  for train_index, test_index in five_fold.split(X):
    # Split the train and test set
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Create an SMOTEBoost object
    smoteboost = SMOTEBoost(X_train, y_train, t, smote_ratio, k_in_knn)
    # Run the algorithm
    smoteboost.run()
    # Add scatter of this pseudo-loss to the loss plotter
    pseudo_losses = smoteboost.pseudo_losses()
    loss_plotter.add_scatter(pseudo_losses, 'fold-'+str(fold_count))
    # Prediction on the test data
    predictions, hfin_values = smoteboost.predict(X_test)
    model_margin = abs(hfin_values[0])
    for value in hfin_values:
      current_margin = abs(value)
      if current_margin < model_margin:
        model_margin = current_margin
    # Plot ROC curve
    auc_score = roc_plotter.add_scatter(y_test, hfin_values, 'Fold-'+str(fold_count))
    # Compute the loss on the test data
    loss = ensemble_loss_binary(y_test, predictions, hfin_values)
    results = {'Loss': loss, 'Margin': model_margin}
    results['AUC'] = auc_score
    results.update(get_metrics(y_test, predictions))
    all_fold_results.append(results)
    print('Fold-'+str(fold_count), results, '\n')
    fold_count += 1
  
  # Compute the average
  df_results = pd.DataFrame(all_fold_results)
  avg_results = dict(df_results.mean())
  # Print a horizontal line
  print ('-' * 128)

  # Remember the results
  avg_results['Algorithm'] = 'SMOTEBoost'
  avg_results['T'] = t
  five_fold_results = five_fold_results.append(avg_results, ignore_index=True)
  print('Average', avg_results)

  # Show the loss plot
  loss_plotter.show()
  roc_plotter.show()

T= 10
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-1 {'Loss': 0.8428718559979648, 'Margin': 0.5006917884975196, 'AUC': 0.681307775194238, 'Precision': 0.5333333333333333, 'Recall': 0.5194805194805194, 'F1score': 0.5263157894736841, 'Accuracy': 0.7647058823529411, 'TN': 194, 'FP': 35, 'FN': 37, 'TP': 40} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-2 {'Loss': 0.8607411193959187, 'Margin': 0.5011254609536345, 'AUC': 0.6208771826679993, 'Precision': 0.45652173913043476, 'Recall': 0.2876712328767123, 'F1score': 0.3529411764705882, 'Accuracy': 0.7483660130718954, 'TN': 208, 'FP': 25, 'FN': 52, 'TP': 21} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-3 {'Loss': 0.7924115418032822, 'Margin': 0.5038530065221745, 'AUC': 0.7541547958214625, 'Precision': 0.5970149253731343, 'Recall': 0.5555555555555556, 'F1score': 0.5755395683453237, 'Accuracy': 0.8071895424836601, 'TN': 207, 'FP': 27, 'FN': 32, 'TP': 40} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-4 {'Loss': 0.8823276026594008, 'Margin': 0.5004372717507546, 'AUC': 0.7010821726298349, 'Precision': 0.5405405405405406, 'Recall': 0.449438202247191, 'F1score': 0.49079754601226994, 'Accuracy': 0.7287581699346405, 'TN': 183, 'FP': 34, 'FN': 49, 'TP': 40} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-5 {'Loss': 0.8426668488678342, 'Margin': 0.5017115290010411, 'AUC': 0.7528101164191087, 'Precision': 0.6527777777777778, 'Recall': 0.5, 'F1score': 0.5662650602409639, 'Accuracy': 0.7647058823529411, 'TN': 187, 'FP': 25, 'FN': 47, 'TP': 47} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8442037937448802, 'Margin': 0.5015638113450248, 'AUC': 0.7020464085465287, 'Precision': 0.5560376632310441, 'Recall': 0.46242910203199566, 'F1score': 0.5023718281085661, 'Accuracy': 0.7627450980392156, 'TN': 195.8, 'FP': 29.2, 'FN': 43.4, 'TP': 37.6, 'Algorithm': 'SMOTEBoost', 'T': 10}


T= 50
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-1 {'Loss': 0.8435606196432089, 'Margin': 0.5006152499197444, 'AUC': 0.7124142233312539, 'Precision': 0.5428571428571428, 'Recall': 0.4935064935064935, 'F1score': 0.5170068027210883, 'Accuracy': 0.7679738562091504, 'TN': 197, 'FP': 32, 'FN': 39, 'TP': 38} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-2 {'Loss': 0.8688525861297415, 'Margin': 0.5003232871877482, 'AUC': 0.7201775530601446, 'Precision': 0.45614035087719296, 'Recall': 0.3561643835616438, 'F1score': 0.4, 'Accuracy': 0.7450980392156863, 'TN': 202, 'FP': 31, 'FN': 47, 'TP': 26} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-3 {'Loss': 0.8083201583144594, 'Margin': 0.5007352368423162, 'AUC': 0.7892331433998101, 'Precision': 0.5662650602409639, 'Recall': 0.6527777777777778, 'F1score': 0.6064516129032258, 'Accuracy': 0.8006535947712419, 'TN': 198, 'FP': 36, 'FN': 25, 'TP': 47} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-4 {'Loss': 0.8722835082630842, 'Margin': 0.5005827588559082, 'AUC': 0.7318127686014602, 'Precision': 0.5714285714285714, 'Recall': 0.449438202247191, 'F1score': 0.5031446540880503, 'Accuracy': 0.7418300653594772, 'TN': 187, 'FP': 30, 'FN': 49, 'TP': 40} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-5 {'Loss': 0.8643246965162497, 'Margin': 0.5007413259765428, 'AUC': 0.7599859494179044, 'Precision': 0.6, 'Recall': 0.5425531914893617, 'F1score': 0.5698324022346368, 'Accuracy': 0.7483660130718954, 'TN': 178, 'FP': 34, 'FN': 43, 'TP': 51} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8514683137733489, 'Margin': 0.500599571756452, 'AUC': 0.7427247275621147, 'Precision': 0.5473382250807742, 'Recall': 0.49888800971649355, 'F1score': 0.5192870943894002, 'Accuracy': 0.7607843137254902, 'TN': 192.4, 'FP': 32.6, 'FN': 40.6, 'TP': 40.4, 'Algorithm': 'SMOTEBoost', 'T': 50}


T= 100
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-1 {'Loss': 0.8455789888488136, 'Margin': 0.5000435512342772, 'AUC': 0.7292009300742925, 'Precision': 0.5454545454545454, 'Recall': 0.4675324675324675, 'F1score': 0.5034965034965035, 'Accuracy': 0.7679738562091504, 'TN': 199, 'FP': 30, 'FN': 41, 'TP': 36} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-2 {'Loss': 0.8729307826802334, 'Margin': 0.5000889597044316, 'AUC': 0.7284672820271622, 'Precision': 0.45714285714285713, 'Recall': 0.4383561643835616, 'F1score': 0.44755244755244755, 'Accuracy': 0.7418300653594772, 'TN': 195, 'FP': 38, 'FN': 41, 'TP': 32} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-3 {'Loss': 0.8202679045516166, 'Margin': 0.5001017722269397, 'AUC': 0.7882537986704653, 'Precision': 0.5465116279069767, 'Recall': 0.6527777777777778, 'F1score': 0.5949367088607594, 'Accuracy': 0.7908496732026143, 'TN': 195, 'FP': 39, 'FN': 25, 'TP': 47} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-4 {'Loss': 0.8632130939798738, 'Margin': 0.5006327723478855, 'AUC': 0.739579557810801, 'Precision': 0.5822784810126582, 'Recall': 0.5168539325842697, 'F1score': 0.5476190476190478, 'Accuracy': 0.7516339869281046, 'TN': 184, 'FP': 33, 'FN': 43, 'TP': 46} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-5 {'Loss': 0.8652658970328498, 'Margin': 0.5004190077621765, 'AUC': 0.7597099558410277, 'Precision': 0.6, 'Recall': 0.5425531914893617, 'F1score': 0.5698324022346368, 'Accuracy': 0.7483660130718954, 'TN': 178, 'FP': 34, 'FN': 43, 'TP': 51} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8534513334186775, 'Margin': 0.5002572126551421, 'AUC': 0.7490423048847498, 'Precision': 0.5462775023034074, 'Recall': 0.5236147067534876, 'F1score': 0.5326874219526789, 'Accuracy': 0.7601307189542483, 'TN': 190.2, 'FP': 34.8, 'FN': 38.6, 'TP': 42.4, 'Algorithm': 'SMOTEBoost', 'T': 100}


### 4.5) Run on whole training data

In [33]:
# Remove last results
train_results = train_results[train_results['Algorithm'] != 'SMOTEBoost']

for t in T:

  # Print the pseudo-loss of each classifier
  loss_plotter = LossPlotter('Loss of each classifier for T='+str(t), 't', 'pseudo-loss')
  roc_plotter = RocPlotter('ROC curve for T='+str(t))
  print('T=',str(t))
  # Print a horizontal line
  print ('-' * 128)

  # Create an SMOTEBoost object
  smoteboost = SMOTEBoost(X, y, t, smote_ratio, k_in_knn)
  # Run the algorithm
  smoteboost.run()
  # Add scatter of this pseudo-loss to the loss plotter
  pseudo_losses = smoteboost.pseudo_losses()
  loss_plotter.add_scatter(pseudo_losses, 'train data')
  # Prediction on the test data
  predictions, hfin_values = smoteboost.predict(X)
  model_margin = abs(hfin_values[0])
  for value in hfin_values:
    current_margin = abs(value)
    if current_margin < model_margin:
      model_margin = current_margin
  # Plot ROC curve
  auc_score = roc_plotter.add_scatter(y, hfin_values, 'train data')
  # Compute the loss on the test data
  loss = ensemble_loss_binary(y, predictions, hfin_values)
  # Compute metrics
  results = {'Loss': loss, 'Margin': model_margin}
  results.update(get_metrics(y, predictions))

  # Remember the results
  results['Algorithm'] = 'SMOTEBoost'
  results['T'] = t
  results['AUC'] = auc_score
  train_results = train_results.append(results, ignore_index=True)
  print(results)

  # Show the loss plot
  loss_plotter.show()
  roc_plotter.show()
  plot_histogram(y, hfin_values, 'The histogram of scores compared to true labels for T='+str(t))

T= 10
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


{'Loss': 0.8676429037362485, 'Margin': 0.5054392291047144, 'Precision': 0.5131578947368421, 'Recall': 0.48148148148148145, 'F1score': 0.49681528662420377, 'Accuracy': 0.7418300653594772, 'TN': 940, 'FP': 185, 'FN': 210, 'TP': 195, 'Algorithm': 'SMOTEBoost', 'T': 10, 'AUC': 0.7252400548696845}


T= 50
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


{'Loss': 0.8476194158705956, 'Margin': 0.5000598872915386, 'Precision': 0.5692307692307692, 'Recall': 0.4567901234567901, 'F1score': 0.5068493150684931, 'Accuracy': 0.7647058823529411, 'TN': 985, 'FP': 140, 'FN': 220, 'TP': 185, 'Algorithm': 'SMOTEBoost', 'T': 50, 'AUC': 0.7588477366255144}


T= 100
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


{'Loss': 0.8523229072452148, 'Margin': 0.5002866642611911, 'Precision': 0.5476190476190477, 'Recall': 0.5679012345679012, 'F1score': 0.5575757575757575, 'Accuracy': 0.761437908496732, 'TN': 935, 'FP': 190, 'FN': 175, 'TP': 230, 'Algorithm': 'SMOTEBoost', 'T': 100, 'AUC': 0.7525102880658436}


## 5) RBBoost

**RBBoost** algorithm uses both *SMOTE* and *RUS* techniques. Each member of the Random Balance ensemble is trained with data sampled from the training set and augmented by artificial instances obtained using SMOTE [[9]](https://www.sciencedirect.com/science/article/abs/pii/S0950705115001720).   
First of all, we should randomly set the new size of majority and minority classes. Then *SMOTE* and *Random Undersampling* are used to respectively increase or reduce the size of classes to match the desired size.

![picture](https://drive.google.com/uc?id=1BcI9Wqghyei2UxnH7Muop_t0ECK0n0Ff)

### 5.1) Algorithm implementation

In [34]:
class RBBoost:
  """
    Functions:
      run(self) -> Run the algorithm of RBBoost
      RandomBalance(self, D_T, k) -> Make new set using Random Balance technique 
      predict(self, input) -> Predict the labels for given samples using created ensemble
      pseudo_losses(self) -> The pseudo-loss for each hypothesis
  """


  # Initilizition
  def __init__(self, X, y, T, k):
    """
      X: Feature vectors
      y: Labels
      m: Number of samples
      T: Number of iterations/classifiers/hypothesis
      k: The value of k in KNN when introducing SMOTE
      H: Ensemble of classifiers
      beta_list: A list containing the beta value for each classifier
      loss_list: A list containing the pseudo-loss of each classifier
    """
    self.X = X
    self.y = y
    self.m = len(X)
    self.T = T
    self.k = k
    self.H = []
    self.beta_list = []
    self.loss_list = []
    self.minority_label = 0
    self.majority_label = 0
    self.minority_size = 0
    self.majority_size = 0
    self.minority_idx = []
    self.majority_idx = []
    self.index_to_label = {}
    self.label_to_index = {}
    self.set_minority_majority()

  # Find which label is minority and which majority
  def set_minority_majority(self):
    labels = np.unique(self.y)
    count = []
    for label in labels:
      cnt = len((self.y==label).nonzero()[0])
      count.append(cnt)
    minority_idx = 0
    majority_idx = 0
    if count[0]>count[1]:
      minority_idx = 1
    else:
      majority_idx = 1
    self.majority_label = labels[majority_idx]
    self.majority_size = count[majority_idx]
    self.majority_idx = (self.y == labels[majority_idx]).nonzero()[0]
    self.minority_label = labels[minority_idx]
    self.minority_size = count[minority_idx]
    self.minority_idx = (self.y == labels[minority_idx]).nonzero()[0]

  # Choose the samples using Random Balance technique
  def RandomBalance(self, D_t):
    new_majority_size = np.random.randint(2, self.m - 2)
    new_minority_size = self.m - new_majority_size
    X_new = []
    y_new = []
    D_new = []
    if new_majority_size < self.majority_size:
      # Add all minority samples to the new dataset
      minority_data = self.X[self.minority_idx]
      label_minority = [self.minority_label for i in minority_data]
      minority_weights = D_t[self.minority_idx]
      X_new.extend(minority_data)
      y_new.extend(label_minority)
      D_new.extend(minority_weights)
      # Take a random sample of size new_majority_size from majority class
      # and add the sample to the new dataset
      rand_majority_idx = np.random.choice(self.majority_idx, new_majority_size, replace=False)
      rand_majority_data = self.X[rand_majority_idx]
      rand_majority_label = [self.majority_label for i in rand_majority_data]
      rand_majority_weights = D_t[rand_majority_idx]
      X_new.extend(rand_majority_data)
      y_new.extend(rand_majority_label)
      D_new.extend(rand_majority_weights)
      # Create artificial examples from minority class using SMOTE
      # and add these examples to the new dataset
      create_num = new_minority_size - self.minority_size
      create_amount = (create_num/self.minority_size)*100
      synthetic_data = SMOTE(minority_data, create_amount, self.k)
      synthetic_label = np.full(synthetic_data.shape[0], dtype=np.int64,
                                fill_value=self.minority_label)
      synthetic_weights = np.full(synthetic_data.shape[0], dtype=np.float64,
                                  fill_value=(1./self.m))
      X_new.extend(synthetic_data)
      y_new.extend(synthetic_label)
      D_new.extend(synthetic_weights)
    else:
      # Add all majority samples to the new dataset
      majority_data = self.X[self.majority_idx]
      label_majority = [self.majority_label for i in majority_data]
      majority_weights = D_t[self.majority_idx]
      X_new.extend(majority_data)
      y_new.extend(label_majority)
      D_new.extend(majority_weights)
      # Take a random sample of size new_minority_size from minority class
      # and add the sample to the new dataset
      rand_minority_idx = np.random.choice(self.minority_idx, new_minority_size, replace=False)
      rand_minority_data = self.X[rand_minority_idx]
      rand_minority_label = [self.minority_label for i in rand_minority_data]
      rand_minority_weights = D_t[rand_minority_idx]
      X_new.extend(rand_minority_data)
      y_new.extend(rand_minority_label)
      D_new.extend(rand_minority_weights)
      # Create artificial examples from majority class using SMOTE
      # and add these examples to the new dataset
      create_num = new_majority_size - self.majority_size
      create_amount = (create_num/self.majority_size)*100
      synthetic_data = SMOTE(majority_data, create_amount, self.k)
      synthetic_label = np.full(synthetic_data.shape[0], dtype=np.int64,
                                fill_value=self.majority_label)
      synthetic_weights = np.full(synthetic_data.shape[0], dtype=np.float64,
                                  fill_value=(1./self.m))
      X_new.extend(synthetic_data)
      y_new.extend(synthetic_label)
      D_new.extend(synthetic_weights)
    
    return X_new, y_new, D_new

  

  # Running the RBBoost algorithm
  def run(self):

    # Initilize distribution
    D_t = (1./self.m) * np.ones(self.m)

    for t in trange(self.T):

      # Choose the new (temporary) training set
      X_new, y_new, D_new = self.RandomBalance(D_t)

      # Train a new classifier on training data with distibution D_t
      weak_learn = DecisionTreeClassifier(max_depth=max_depth, max_features=max_features)
      weak_learn.fit(X_new, y_new, sample_weight=D_new)
      
      # Get predictions and probabilities for the training data
      predictions = weak_learn.predict(self.X)
      probabilities = weak_learn.predict_proba(self.X)
      
      if t == 0:
        for index, value in enumerate(weak_learn.classes_):
          self.index_to_label[index] = value
          self.label_to_index[value] = index
      
      # Compute pseudo-loss
      pseudo_loss = 0

      # For each prediction
      for index, pr in enumerate(predictions):

        # Find the true label of the sample
        true_label = self.y[index]
        
        # Check if it's misclassified
        if pr != true_label:
          # Probability of assigning the predicted label
          pr_index = self.label_to_index[pr]
          pred_prob = probabilities[index][pr_index]
          # Probability of assigning the true label
          true_index = self.label_to_index[true_label]
          true_prob = probabilities[index][true_index]
          # Compute loss for the given sample
          current_loss = D_t[index] * (1 - true_prob + pred_prob)
          # Add it to pseudo_loss
          pseudo_loss += current_loss

      # Final step to compute pseudo-loss
      pseudo_loss = (1/2)*pseudo_loss

      # Compute beta
      beta = pseudo_loss / (1 - pseudo_loss)
      
      # Update distribution
      for index, value in enumerate(D_t):
        predicted_label = predictions[index]
        true_label = self.y[index]
        # Probability of assigning the predicted label
        pr_index = self.label_to_index[predicted_label]
        pred_prob = probabilities[index][pr_index]
        # Probability of assigning the true label
        true_index = self.label_to_index[true_label]
        true_prob = probabilities[index][true_index]
        power = (1/2) * (1 + true_prob - pred_prob)
        D_t[index] = (value) * pow(beta,power)
      
      # Normalization factor
      Z_t = np.sum(D_t)
      D_t = D_t / sum(D_t)
      
      # Append weak learner to the ensemble
      self.H.append(weak_learn)
      # Append computed loss of the weak learner
      self.loss_list.append(pseudo_loss)
      # Append computed beta of the weak learner
      self.beta_list.append(beta)
  
      # print('t =', t, '--> pseudo-loss =', pseudo_loss)
    
    # Convert beta and loss lists to numpy array
    self.beta_list = np.array(self.beta_list)
    self.loss_list = np.array(self.loss_list)


  def predict(self, input):
    # Sum of the probabilities made by each hypothesis for a given sample and label
    sum_of_h = None
    # Compute the multiply factors
    alpha_list = np.log(1/self.beta_list)
    # For each hypothesis
    for index, h in enumerate(self.H):
      # Predict probabilities
      probabilities = h.predict_proba(input)
      # Multiply probabilities by the hypothesis computed alpha
      weighted_prob = alpha_list[index] * probabilities
      # If it's the first hypothesis
      if index == 0:
        sum_of_h = weighted_prob
      else:
        sum_of_h += weighted_prob
    # Choose the label using argmax
    predicted_idx = np.argmax(sum_of_h, axis=1)
    predicted_labels = []
    for pr in predicted_idx:
      predicted_labels.append(self.index_to_label[pr])
    predicted_labels = np.array(predicted_labels)
    #print(predicted_labels)
    # Remember the h_finite value
    h_fin = [sum_of_h[index][label]*predicted_labels[index] for index, label in enumerate(predicted_idx)]
    #print(h_fin)
    alpha_norm = np.linalg.norm(alpha_list, ord=1)
    h_fin = h_fin/alpha_norm
    return predicted_labels, h_fin

  # Print the computed pseudo-loss of each classifier
  def pseudo_losses(self):
    return self.loss_list


### 5.2) Run on 5-fold

In [35]:
# Remove last results
five_fold_results = five_fold_results[five_fold_results['Algorithm'] != 'RBBoost']

for t in T:

  # Print the pseudo-loss of each classifier
  loss_plotter = LossPlotter('Loss of each classifier for T='+str(t), 't', 'pseudo-loss')
  roc_plotter = RocPlotter('ROC curve for T='+str(t))
  # Fold counter
  fold_count = 1
  # A list containing all the fold results
  all_fold_results = []
  print('T=',str(t))
  # Print a horizontal line
  print ('-' * 128)

  # For each fold
  for train_index, test_index in five_fold.split(X):
    # Split the train and test set
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Create an RBBoost object
    rbboost = RBBoost(X_train, y_train, t, k_in_knn)
    # Run the algorithm
    rbboost.run()
    # Add scatter of this pseudo-loss to the loss plotter
    pseudo_losses = rbboost.pseudo_losses()
    loss_plotter.add_scatter(pseudo_losses, 'fold-'+str(fold_count))
    # Prediction on the test data
    predictions, hfin_values = rbboost.predict(X_test)
    model_margin = abs(hfin_values[0])
    for value in hfin_values:
      current_margin = abs(value)
      if current_margin < model_margin:
        model_margin = current_margin
    # Plot ROC curve
    auc_score = roc_plotter.add_scatter(y_test, hfin_values, 'Fold-'+str(fold_count))
    # Compute the loss on the test data
    loss = ensemble_loss_binary(y_test, predictions, hfin_values)
    results = {'Loss': loss, 'Margin': model_margin}
    results['AUC'] = auc_score
    results.update(get_metrics(y_test, predictions))
    all_fold_results.append(results)
    print('Fold-'+str(fold_count), results, '\n')
    fold_count += 1
  
  # Compute the average
  df_results = pd.DataFrame(all_fold_results)
  avg_results = dict(df_results.mean())
  # Print a horizontal line
  print ('-' * 128)

  # Remember the results
  avg_results['Algorithm'] = 'RBBoost'
  avg_results['T'] = t
  five_fold_results = five_fold_results.append(avg_results, ignore_index=True)
  print('Average', avg_results)

  # Show the loss plot
  loss_plotter.show()
  roc_plotter.show()

T= 10
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-1 {'Loss': 0.8550768179231558, 'Margin': 0.393728361163156, 'AUC': 0.6709862190211534, 'Precision': 0.4444444444444444, 'Recall': 0.05194805194805195, 'F1score': 0.09302325581395349, 'Accuracy': 0.7450980392156863, 'TN': 224, 'FP': 5, 'FN': 73, 'TP': 4} 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-2 {'Loss': 0.8492785903784266, 'Margin': 0.38452382662834544, 'AUC': 0.712387559527309, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.761437908496732, 'TN': 233, 'FP': 0, 'FN': 73, 'TP': 0} 




Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-3 {'Loss': 0.83561317385043, 'Margin': 0.4636139908924625, 'AUC': 0.776798433048433, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.7647058823529411, 'TN': 234, 'FP': 0, 'FN': 72, 'TP': 0} 




Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-4 {'Loss': 0.8992004749315696, 'Margin': 0.4726187182265631, 'AUC': 0.6806037384145395, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.7091503267973857, 'TN': 217, 'FP': 0, 'FN': 89, 'TP': 0} 




Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Fold-5 {'Loss': 0.9283576338987048, 'Margin': 0.5335885044448581, 'AUC': 0.7170814933761542, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.6928104575163399, 'TN': 212, 'FP': 0, 'FN': 94, 'TP': 0} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8735053381964575, 'Margin': 0.44961468027107704, 'AUC': 0.7115714886775178, 'Precision': 0.08888888888888888, 'Recall': 0.01038961038961039, 'F1score': 0.018604651162790697, 'Accuracy': 0.734640522875817, 'TN': 224.0, 'FP': 1.0, 'FN': 80.2, 'TP': 0.8, 'Algorithm': 'RBBoost', 'T': 10}



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



T= 50
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-1 {'Loss': 0.825867114174237, 'Margin': 0.4609782279540166, 'AUC': 0.7507797878976918, 'Precision': 0.6410256410256411, 'Recall': 0.3246753246753247, 'F1score': 0.43103448275862066, 'Accuracy': 0.7843137254901961, 'TN': 215, 'FP': 14, 'FN': 52, 'TP': 25} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-2 {'Loss': 0.8558569725325581, 'Margin': 0.43362101074768183, 'AUC': 0.7049209242165911, 'Precision': 0.48717948717948717, 'Recall': 0.2602739726027397, 'F1score': 0.33928571428571425, 'Accuracy': 0.7581699346405228, 'TN': 213, 'FP': 20, 'FN': 54, 'TP': 19} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-3 {'Loss': 0.8430737291064467, 'Margin': 0.4803058575084505, 'AUC': 0.7692010921177588, 'Precision': 1.0, 'Recall': 0.013888888888888888, 'F1score': 0.0273972602739726, 'Accuracy': 0.7679738562091504, 'TN': 234, 'FP': 0, 'FN': 71, 'TP': 1} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-4 {'Loss': 0.8733596146252495, 'Margin': 0.49392774222870695, 'AUC': 0.7413659193289494, 'Precision': 0.9090909090909091, 'Recall': 0.11235955056179775, 'F1score': 0.19999999999999998, 'Accuracy': 0.738562091503268, 'TN': 216, 'FP': 1, 'FN': 79, 'TP': 10} 



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


Fold-5 {'Loss': 0.9024633286143309, 'Margin': 0.45274788375273195, 'AUC': 0.7494229225210759, 'Precision': 0.7272727272727273, 'Recall': 0.0851063829787234, 'F1score': 0.15238095238095237, 'Accuracy': 0.7091503267973857, 'TN': 209, 'FP': 3, 'FN': 86, 'TP': 8} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8601241518105646, 'Margin': 0.4643161444383176, 'AUC': 0.7431381292164134, 'Precision': 0.7529137529137528, 'Recall': 0.15926082394149488, 'F1score': 0.23001968193985198, 'Accuracy': 0.7516339869281046, 'TN': 217.4, 'FP': 7.6, 'FN': 68.4, 'TP': 12.6, 'Algorithm': 'RBBoost', 'T': 50}


T= 100
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-1 {'Loss': 0.8345232342958149, 'Margin': 0.45647416401752733, 'AUC': 0.7385016729994897, 'Precision': 0.6041666666666666, 'Recall': 0.37662337662337664, 'F1score': 0.464, 'Accuracy': 0.7810457516339869, 'TN': 210, 'FP': 19, 'FN': 48, 'TP': 29} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-2 {'Loss': 0.8483973380610538, 'Margin': 0.4754966107010392, 'AUC': 0.7273796225527662, 'Precision': 0.5098039215686274, 'Recall': 0.3561643835616438, 'F1score': 0.4193548387096774, 'Accuracy': 0.7647058823529411, 'TN': 208, 'FP': 25, 'FN': 47, 'TP': 26} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-3 {'Loss': 0.7939943890550485, 'Margin': 0.48331910592328237, 'AUC': 0.7889957264957264, 'Precision': 0.7, 'Recall': 0.3888888888888889, 'F1score': 0.5, 'Accuracy': 0.8169934640522876, 'TN': 222, 'FP': 12, 'FN': 44, 'TP': 28} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-4 {'Loss': 0.8612087118319669, 'Margin': 0.49514330966245756, 'AUC': 0.7472169005333196, 'Precision': 0.6326530612244898, 'Recall': 0.34831460674157305, 'F1score': 0.44927536231884063, 'Accuracy': 0.7516339869281046, 'TN': 199, 'FP': 18, 'FN': 58, 'TP': 31} 



HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Fold-5 {'Loss': 0.8667848807798284, 'Margin': 0.4498242315493041, 'AUC': 0.7547169811320755, 'Precision': 0.6666666666666666, 'Recall': 0.3617021276595745, 'F1score': 0.4689655172413793, 'Accuracy': 0.7483660130718954, 'TN': 195, 'FP': 17, 'FN': 60, 'TP': 34} 

--------------------------------------------------------------------------------------------------------------------------------
Average {'Loss': 0.8409817108047426, 'Margin': 0.4720514843707221, 'AUC': 0.7513621807426756, 'Precision': 0.62265806322529, 'Recall': 0.36633867669501136, 'F1score': 0.4603191436539794, 'Accuracy': 0.7725490196078431, 'TN': 206.8, 'FP': 18.2, 'FN': 51.4, 'TP': 29.6, 'Algorithm': 'RBBoost', 'T': 100}


### 5.3) Run on whole training data

In [36]:
# Remove last results
train_results = train_results[train_results['Algorithm'] != 'RBBoost']

for t in T:

  # Print the pseudo-loss of each classifier
  loss_plotter = LossPlotter('Loss of each classifier for T='+str(t), 't', 'pseudo-loss')
  roc_plotter = RocPlotter('ROC curve for T='+str(t))
  print('T=',str(t))
  # Print a horizontal line
  print ('-' * 128)

  # Create an RBBoost object
  rbboost = RBBoost(X, y, t, k_in_knn)
  # Run the algorithm
  rbboost.run()
  # Add scatter of this pseudo-loss to the loss plotter
  pseudo_losses = rbboost.pseudo_losses()
  loss_plotter.add_scatter(pseudo_losses, 'train data')
  # Prediction on the test data
  predictions, hfin_values = rbboost.predict(X)
  model_margin = abs(hfin_values[0])
  for value in hfin_values:
    current_margin = abs(value)
    if current_margin < model_margin:
      model_margin = current_margin
  # Plot ROC curve
  auc_score = roc_plotter.add_scatter(y, hfin_values, 'train data')
  # Compute the loss on the test data
  loss = ensemble_loss_binary(y, predictions, hfin_values)
  # Compute metrics
  results = {'Loss': loss, 'Margin': model_margin}
  results.update(get_metrics(y, predictions))

  # Remember the results
  results['Algorithm'] = 'RBBoost'
  results['T'] = t
  results['AUC'] = auc_score
  train_results = train_results.append(results, ignore_index=True)
  print(results)

  # Show the loss plot
  loss_plotter.show()
  roc_plotter.show()
  plot_histogram(y, hfin_values, 'The histogram of scores compared to true labels for T='+str(t))

T= 10
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


{'Loss': 0.8837661858994456, 'Margin': 0.3700696799093046, 'Precision': 0.0, 'Recall': 0.0, 'F1score': 0.0, 'Accuracy': 0.7352941176470589, 'TN': 1125, 'FP': 0, 'FN': 405, 'TP': 0, 'Algorithm': 'RBBoost', 'T': 10, 'AUC': 0.5848559670781893}



Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



T= 50
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


{'Loss': 0.8420993755946483, 'Margin': 0.4826092067605552, 'Precision': 0.6086956521739131, 'Recall': 0.345679012345679, 'F1score': 0.4409448818897638, 'Accuracy': 0.7679738562091504, 'TN': 1035, 'FP': 90, 'FN': 265, 'TP': 140, 'Algorithm': 'RBBoost', 'T': 50, 'AUC': 0.7606584362139918}


T= 100
--------------------------------------------------------------------------------------------------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


{'Loss': 0.8341584809521311, 'Margin': 0.4572489534444815, 'Precision': 0.6129032258064516, 'Recall': 0.4691358024691358, 'F1score': 0.5314685314685315, 'Accuracy': 0.7810457516339869, 'TN': 1005, 'FP': 120, 'FN': 215, 'TP': 190, 'Algorithm': 'RBBoost', 'T': 100, 'AUC': 0.756159122085048}


## 6) SVM

### 6.1) Run on 5-fold

In [37]:
# Remove last results
five_fold_results = five_fold_results[five_fold_results['Algorithm'] != 'SVM']

roc_plotter = RocPlotter('ROC curve for T='+str(t))
# Fold counter
fold_count = 1
# A list containing all the fold results
all_fold_results = []
print('SVM')
# Print a horizontal line
print ('-' * 128)
# For each fold

for train_index, test_index in five_fold.split(X):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  svm_classifier = svm.SVC(probability=True)
  svm_classifier.fit(X_train, y_train)
  # Predict labels
  svm_pred_labels = svm_classifier.predict(X_test)
  # Predict probability
  svm_pred_prob = svm_classifier.predict_proba(X_test)
  svm_hfin_idx = np.argmax(svm_pred_prob, axis=1)
  svm_hfin = [pred[svm_hfin_idx[index]] for index, pred in enumerate(svm_pred_prob)]
  svm_hfin_label = [value * svm_pred_labels[index] for index, value in enumerate(svm_hfin)]
  # Compute the loss on the test data
  loss = ensemble_loss_binary(y_test, svm_pred_labels, svm_hfin_label)
  # Compute AUC score and ROC curve
  auc_score = roc_plotter.add_scatter(y_test, svm_hfin_label, 'train data')
  # Model margin
  model_margin = abs(svm_hfin_label[0])
  for value in svm_hfin_label:
    current_margin = abs(value)
    if current_margin < model_margin:
      model_margin = current_margin
  results = {'Loss': loss, 'Margin': model_margin}
  results['AUC'] = auc_score
  results.update(get_metrics(y_test, svm_pred_labels))
  all_fold_results.append(results)
  print('Fold-'+str(fold_count), results, '\n')
  fold_count += 1

# Compute the average
df_results = pd.DataFrame(all_fold_results)
avg_results = dict(df_results.mean())
# Print a horizontal line
print ('-' * 128)

# Remember the results
avg_results['Algorithm'] = 'SVM'
five_fold_results = five_fold_results.append(avg_results, ignore_index=True)
print('Average', avg_results)

# Show the loss plot
roc_plotter.show()

SVM
--------------------------------------------------------------------------------------------------------------------------------
Fold-1 {'Loss': 0.8141026815384366, 'Margin': 0.5057381663111769, 'AUC': 0.7743435603697613, 'Precision': 0.6774193548387096, 'Recall': 0.2727272727272727, 'F1score': 0.38888888888888884, 'Accuracy': 0.7843137254901961, 'TN': 219, 'FP': 10, 'FN': 56, 'TP': 21} 

Fold-2 {'Loss': 0.8348999783070238, 'Margin': 0.5, 'AUC': 0.7832618025751072, 'Precision': 0.5151515151515151, 'Recall': 0.2328767123287671, 'F1score': 0.320754716981132, 'Accuracy': 0.7647058823529411, 'TN': 217, 'FP': 16, 'FN': 56, 'TP': 17} 

Fold-3 {'Loss': 0.7402407552035943, 'Margin': 0.5, 'AUC': 0.8165657644824311, 'Precision': 0.8064516129032258, 'Recall': 0.3472222222222222, 'F1score': 0.48543689320388356, 'Accuracy': 0.826797385620915, 'TN': 228, 'FP': 6, 'FN': 47, 'TP': 25} 

Fold-4 {'Loss': 0.8561071862459592, 'Margin': 0.5155219660102667, 'AUC': 0.7742712162791903, 'Precision': 0.7666

### 6.2) Run on whole training data

In [38]:
# Remove last results
train_results = train_results[train_results['Algorithm'] != 'SVM']

roc_plotter = RocPlotter('ROC curve for T='+str(t))
# Fold counter
fold_count = 1
# A list containing all the fold results
all_fold_results = []
print('SVM')
# Print a horizontal line
print ('-' * 128)
# For each fold

# Train a SVM classifier on data
svm_classifier = svm.SVC(probability=True)
svm_classifier.fit(X, y)

# Predict labels
svm_pred_labels = svm_classifier.predict(X)
# Predict probability
svm_pred_prob = svm_classifier.predict_proba(X)
svm_hfin_idx = np.argmax(svm_pred_prob, axis=1)
svm_hfin = [pred[svm_hfin_idx[index]] for index, pred in enumerate(svm_pred_prob)]
svm_hfin_label = [value * svm_pred_labels[index] for index, value in enumerate(svm_hfin)]
# Compute the loss on the test data
loss = ensemble_loss_binary(y, svm_pred_labels, svm_hfin_label)
# Compute AUC score and ROC curve
auc_score = roc_plotter.add_scatter(y, svm_hfin_label, 'train data')
# Model margin
model_margin = abs(svm_hfin_label[0])
for value in svm_hfin_label:
  current_margin = abs(value)
  if current_margin < model_margin:
    model_margin = current_margin
# Remember the results
results = {'Loss': loss, 'Margin': model_margin}
results.update(get_metrics(y, svm_pred_labels))
all_fold_results.append(results)
# Remember the results
results['Algorithm'] = 'SVM'
results['AUC'] = auc_score
train_results = train_results.append(results, ignore_index=True)
print(results)
# Show the loss plot
roc_plotter.show()

SVM
--------------------------------------------------------------------------------------------------------------------------------
{'Loss': 0.8206842412976009, 'Margin': 0.5, 'Precision': 0.696969696969697, 'Recall': 0.2839506172839506, 'F1score': 0.4035087719298246, 'Accuracy': 0.7777777777777778, 'TN': 1075, 'FP': 50, 'FN': 290, 'TP': 115, 'Algorithm': 'SVM', 'AUC': 0.7916598079561042}


In [57]:
plot_histogram(y, svm_hfin_label, 'The histogram of scores compared to true label')

## 7) Results

The results of implemented models shown in this section are based on different metrics. In the 5-fold method, the algorithm trained on 4-fold of the data and tested on the other remaining fold. In the other method, both training and evaluating steps rely on the training data.

### 7.1) 5-Fold results

In [39]:
five_fold_results

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
0,AdaBoostM2,10.0,0.859261,0.4,0.040945,0.072765,0.74061,0.745098,225.0,0.0,78.0,3.0,0.510624
1,AdaBoostM2,50.0,0.845848,0.628147,0.263049,0.370025,0.750291,0.764052,212.6,12.4,59.8,21.2,0.500997
2,AdaBoostM2,100.0,0.849093,0.627967,0.260271,0.367837,0.762014,0.762745,212.4,12.6,60.0,21.0,0.500288
3,RUSBoost,10.0,0.888299,0.49188,0.57216,0.524845,0.739417,0.728105,176.8,48.2,35.0,46.0,0.502732
4,RUSBoost,50.0,0.884422,0.494102,0.634662,0.554395,0.755478,0.731373,172.4,52.6,29.6,51.4,0.500341
5,RUSBoost,100.0,0.880941,0.499902,0.644164,0.5605,0.773279,0.734641,172.6,52.4,28.8,52.2,0.500103
6,SMOTEBoost,10.0,0.844204,0.556038,0.462429,0.502372,0.702046,0.762745,195.8,29.2,43.4,37.6,0.501564
7,SMOTEBoost,50.0,0.851468,0.547338,0.498888,0.519287,0.742725,0.760784,192.4,32.6,40.6,40.4,0.5006
8,SMOTEBoost,100.0,0.853451,0.546278,0.523615,0.532687,0.749042,0.760131,190.2,34.8,38.6,42.4,0.500257
9,RBBoost,10.0,0.873505,0.088889,0.01039,0.018605,0.711571,0.734641,224.0,1.0,80.2,0.8,0.449615


#### 7.1.1) Sort by Recall

Sort the results based on the number of Recall metric.

In [40]:
five_fold_results.sort_values(by=['Recall'], ascending=False)

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
5,RUSBoost,100.0,0.880941,0.499902,0.644164,0.5605,0.773279,0.734641,172.6,52.4,28.8,52.2,0.500103
4,RUSBoost,50.0,0.884422,0.494102,0.634662,0.554395,0.755478,0.731373,172.4,52.6,29.6,51.4,0.500341
3,RUSBoost,10.0,0.888299,0.49188,0.57216,0.524845,0.739417,0.728105,176.8,48.2,35.0,46.0,0.502732
8,SMOTEBoost,100.0,0.853451,0.546278,0.523615,0.532687,0.749042,0.760131,190.2,34.8,38.6,42.4,0.500257
7,SMOTEBoost,50.0,0.851468,0.547338,0.498888,0.519287,0.742725,0.760784,192.4,32.6,40.6,40.4,0.5006
6,SMOTEBoost,10.0,0.844204,0.556038,0.462429,0.502372,0.702046,0.762745,195.8,29.2,43.4,37.6,0.501564
11,RBBoost,100.0,0.840982,0.622658,0.366339,0.460319,0.751362,0.772549,206.8,18.2,51.4,29.6,0.472051
12,SVM,,0.831023,0.678852,0.269059,0.384544,0.781886,0.771895,214.6,10.4,59.4,21.6,0.506244
1,AdaBoostM2,50.0,0.845848,0.628147,0.263049,0.370025,0.750291,0.764052,212.6,12.4,59.8,21.2,0.500997
2,AdaBoostM2,100.0,0.849093,0.627967,0.260271,0.367837,0.762014,0.762745,212.4,12.6,60.0,21.0,0.500288


#### 7.1.2) Sort by F-score

Sort the results based on the F-score metric.

In [41]:
five_fold_results.sort_values(by=['F1score'], ascending=False)

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
5,RUSBoost,100.0,0.880941,0.499902,0.644164,0.5605,0.773279,0.734641,172.6,52.4,28.8,52.2,0.500103
4,RUSBoost,50.0,0.884422,0.494102,0.634662,0.554395,0.755478,0.731373,172.4,52.6,29.6,51.4,0.500341
8,SMOTEBoost,100.0,0.853451,0.546278,0.523615,0.532687,0.749042,0.760131,190.2,34.8,38.6,42.4,0.500257
3,RUSBoost,10.0,0.888299,0.49188,0.57216,0.524845,0.739417,0.728105,176.8,48.2,35.0,46.0,0.502732
7,SMOTEBoost,50.0,0.851468,0.547338,0.498888,0.519287,0.742725,0.760784,192.4,32.6,40.6,40.4,0.5006
6,SMOTEBoost,10.0,0.844204,0.556038,0.462429,0.502372,0.702046,0.762745,195.8,29.2,43.4,37.6,0.501564
11,RBBoost,100.0,0.840982,0.622658,0.366339,0.460319,0.751362,0.772549,206.8,18.2,51.4,29.6,0.472051
12,SVM,,0.831023,0.678852,0.269059,0.384544,0.781886,0.771895,214.6,10.4,59.4,21.6,0.506244
1,AdaBoostM2,50.0,0.845848,0.628147,0.263049,0.370025,0.750291,0.764052,212.6,12.4,59.8,21.2,0.500997
2,AdaBoostM2,100.0,0.849093,0.627967,0.260271,0.367837,0.762014,0.762745,212.4,12.6,60.0,21.0,0.500288


#### 7.1.3) Sort by Loss

Sort the results based on the value of Loss.

In [42]:
five_fold_results.sort_values(by=['Loss'])

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
12,SVM,,0.831023,0.678852,0.269059,0.384544,0.781886,0.771895,214.6,10.4,59.4,21.6,0.506244
11,RBBoost,100.0,0.840982,0.622658,0.366339,0.460319,0.751362,0.772549,206.8,18.2,51.4,29.6,0.472051
6,SMOTEBoost,10.0,0.844204,0.556038,0.462429,0.502372,0.702046,0.762745,195.8,29.2,43.4,37.6,0.501564
1,AdaBoostM2,50.0,0.845848,0.628147,0.263049,0.370025,0.750291,0.764052,212.6,12.4,59.8,21.2,0.500997
2,AdaBoostM2,100.0,0.849093,0.627967,0.260271,0.367837,0.762014,0.762745,212.4,12.6,60.0,21.0,0.500288
7,SMOTEBoost,50.0,0.851468,0.547338,0.498888,0.519287,0.742725,0.760784,192.4,32.6,40.6,40.4,0.5006
8,SMOTEBoost,100.0,0.853451,0.546278,0.523615,0.532687,0.749042,0.760131,190.2,34.8,38.6,42.4,0.500257
0,AdaBoostM2,10.0,0.859261,0.4,0.040945,0.072765,0.74061,0.745098,225.0,0.0,78.0,3.0,0.510624
10,RBBoost,50.0,0.860124,0.752914,0.159261,0.23002,0.743138,0.751634,217.4,7.6,68.4,12.6,0.464316
9,RBBoost,10.0,0.873505,0.088889,0.01039,0.018605,0.711571,0.734641,224.0,1.0,80.2,0.8,0.449615


#### 7.1.4) Precision/Recall Plot

In [51]:
fig = go.Figure(data=[
    go.Bar(name='Precision', x=five_fold_results['Algorithm'], y=five_fold_results['Precision']),
    go.Bar(name='Recall', x=five_fold_results['Algorithm'], y=five_fold_results['Recall'])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

### 7.2) Training results

In [43]:
train_results

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
0,AdaBoostM2,10.0,0.87357,0.0,0.0,0.0,0.738217,0.735294,1125,0,405,0,0.538924
1,AdaBoostM2,50.0,0.845163,0.636364,0.259259,0.368421,0.76952,0.764706,1065,60,300,105,0.500188
2,AdaBoostM2,100.0,0.843039,0.65625,0.259259,0.371681,0.772181,0.767974,1070,55,300,105,0.500452
3,RUSBoost,10.0,0.907623,0.466667,0.604938,0.526882,0.748258,0.712418,845,280,160,245,0.50712
4,RUSBoost,50.0,0.849278,0.545455,0.666667,0.6,0.774403,0.764706,900,225,135,270,0.500418
5,RUSBoost,100.0,0.859391,0.531915,0.617284,0.571429,0.792099,0.754902,905,220,155,250,0.500107
6,SMOTEBoost,10.0,0.867643,0.513158,0.481481,0.496815,0.72524,0.74183,940,185,210,195,0.505439
7,SMOTEBoost,50.0,0.847619,0.569231,0.45679,0.506849,0.758848,0.764706,985,140,220,185,0.50006
8,SMOTEBoost,100.0,0.852323,0.547619,0.567901,0.557576,0.75251,0.761438,935,190,175,230,0.500287
9,RBBoost,10.0,0.883766,0.0,0.0,0.0,0.584856,0.735294,1125,0,405,0,0.37007


#### 7.2.1) Sort by Recall

Sort the results based on the number of Recall metric.

In [44]:
train_results.sort_values(by=['Recall'], ascending=False)

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
4,RUSBoost,50.0,0.849278,0.545455,0.666667,0.6,0.774403,0.764706,900,225,135,270,0.500418
5,RUSBoost,100.0,0.859391,0.531915,0.617284,0.571429,0.792099,0.754902,905,220,155,250,0.500107
3,RUSBoost,10.0,0.907623,0.466667,0.604938,0.526882,0.748258,0.712418,845,280,160,245,0.50712
8,SMOTEBoost,100.0,0.852323,0.547619,0.567901,0.557576,0.75251,0.761438,935,190,175,230,0.500287
6,SMOTEBoost,10.0,0.867643,0.513158,0.481481,0.496815,0.72524,0.74183,940,185,210,195,0.505439
11,RBBoost,100.0,0.834158,0.612903,0.469136,0.531469,0.756159,0.781046,1005,120,215,190,0.457249
7,SMOTEBoost,50.0,0.847619,0.569231,0.45679,0.506849,0.758848,0.764706,985,140,220,185,0.50006
10,RBBoost,50.0,0.842099,0.608696,0.345679,0.440945,0.760658,0.767974,1035,90,265,140,0.482609
12,SVM,,0.820684,0.69697,0.283951,0.403509,0.79166,0.777778,1075,50,290,115,0.5
1,AdaBoostM2,50.0,0.845163,0.636364,0.259259,0.368421,0.76952,0.764706,1065,60,300,105,0.500188


#### 7.2.2) Sort by F-score

Sort the results based on the F-score metric.

In [45]:
train_results.sort_values(by=['F1score'], ascending=False)

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
4,RUSBoost,50.0,0.849278,0.545455,0.666667,0.6,0.774403,0.764706,900,225,135,270,0.500418
5,RUSBoost,100.0,0.859391,0.531915,0.617284,0.571429,0.792099,0.754902,905,220,155,250,0.500107
8,SMOTEBoost,100.0,0.852323,0.547619,0.567901,0.557576,0.75251,0.761438,935,190,175,230,0.500287
11,RBBoost,100.0,0.834158,0.612903,0.469136,0.531469,0.756159,0.781046,1005,120,215,190,0.457249
3,RUSBoost,10.0,0.907623,0.466667,0.604938,0.526882,0.748258,0.712418,845,280,160,245,0.50712
7,SMOTEBoost,50.0,0.847619,0.569231,0.45679,0.506849,0.758848,0.764706,985,140,220,185,0.50006
6,SMOTEBoost,10.0,0.867643,0.513158,0.481481,0.496815,0.72524,0.74183,940,185,210,195,0.505439
10,RBBoost,50.0,0.842099,0.608696,0.345679,0.440945,0.760658,0.767974,1035,90,265,140,0.482609
12,SVM,,0.820684,0.69697,0.283951,0.403509,0.79166,0.777778,1075,50,290,115,0.5
2,AdaBoostM2,100.0,0.843039,0.65625,0.259259,0.371681,0.772181,0.767974,1070,55,300,105,0.500452


#### 7.2.3) Sort by Loss

Sort the results based on the value of Loss.

In [46]:
train_results.sort_values(by=['Loss'])

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
12,SVM,,0.820684,0.69697,0.283951,0.403509,0.79166,0.777778,1075,50,290,115,0.5
11,RBBoost,100.0,0.834158,0.612903,0.469136,0.531469,0.756159,0.781046,1005,120,215,190,0.457249
10,RBBoost,50.0,0.842099,0.608696,0.345679,0.440945,0.760658,0.767974,1035,90,265,140,0.482609
2,AdaBoostM2,100.0,0.843039,0.65625,0.259259,0.371681,0.772181,0.767974,1070,55,300,105,0.500452
1,AdaBoostM2,50.0,0.845163,0.636364,0.259259,0.368421,0.76952,0.764706,1065,60,300,105,0.500188
7,SMOTEBoost,50.0,0.847619,0.569231,0.45679,0.506849,0.758848,0.764706,985,140,220,185,0.50006
4,RUSBoost,50.0,0.849278,0.545455,0.666667,0.6,0.774403,0.764706,900,225,135,270,0.500418
8,SMOTEBoost,100.0,0.852323,0.547619,0.567901,0.557576,0.75251,0.761438,935,190,175,230,0.500287
5,RUSBoost,100.0,0.859391,0.531915,0.617284,0.571429,0.792099,0.754902,905,220,155,250,0.500107
6,SMOTEBoost,10.0,0.867643,0.513158,0.481481,0.496815,0.72524,0.74183,940,185,210,195,0.505439


In [None]:
train_results.sort_values(by=['Loss'])

Unnamed: 0,Algorithm,T,Loss,Precision,Recall,F1score,AUC,Accuracy,TN,FP,FN,TP,Margin
12,SVM,,0.820684,0.69697,0.283951,0.403509,0.79166,0.777778,1075,50,290,115,0.5
11,RBBoost,100.0,0.834158,0.612903,0.469136,0.531469,0.756159,0.781046,1005,120,215,190,0.457249
10,RBBoost,50.0,0.842099,0.608696,0.345679,0.440945,0.760658,0.767974,1035,90,265,140,0.482609
2,AdaBoostM2,100.0,0.843039,0.65625,0.259259,0.371681,0.772181,0.767974,1070,55,300,105,0.500452
1,AdaBoostM2,50.0,0.845163,0.636364,0.259259,0.368421,0.76952,0.764706,1065,60,300,105,0.500188
7,SMOTEBoost,50.0,0.847619,0.569231,0.45679,0.506849,0.758848,0.764706,985,140,220,185,0.50006
4,RUSBoost,50.0,0.849278,0.545455,0.666667,0.6,0.774403,0.764706,900,225,135,270,0.500418
8,SMOTEBoost,100.0,0.852323,0.547619,0.567901,0.557576,0.75251,0.761438,935,190,175,230,0.500287
5,RUSBoost,100.0,0.859391,0.531915,0.617284,0.571429,0.792099,0.754902,905,220,155,250,0.500107
6,SMOTEBoost,10.0,0.867643,0.513158,0.481481,0.496815,0.72524,0.74183,940,185,210,195,0.505439


#### 7.2.4) Precision/Recall Plot

In [55]:
fig = go.Figure(data=[
    go.Bar(name='Precision', x=train_results['Algorithm'], y=train_results['Precision']),
    go.Bar(name='Recall', x=train_results['Algorithm'], y=train_results['Recall'])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

## 8) Conclusion

First of all, we should consider that the dataset used in this project, Haberman, is a hard dataset [[10]](https://arxiv.org/ftp/arxiv/papers/1703/1703.08283.pdf). As the referenced paper proved, many classification methods couldn't achieve well-done results in terms of AUC and F-score metrics. Also, we can expect this complexity for the Haberman dataset if we look at the data shown in section 1.2, it has just three features and the positive and negative examples are really mixed together.

As the data relates to a disease detection task, and the minority class labeled as positve, the Recall metric is crucial for us. The tables in section 7.1.1 and 7.2.1 show that the RUSBoost with any chosen T outperforms any other algorithms in terms of Recall. RUSBoost also achieves a better F1-score than other algorithms. After RUSBOOST, the SMOTEBoost achieves noticeable results based on most metrics.

SVM performs better in terms of precision, which means that SVM reaches a better classifier on the majority class. So it's perfectly shown that SVM couldn't consider the imbalance between the data classes. But in general, it outperforms the AdaBoost.M2 for this specific dataset. It also predicts the training samples with higher scores, as shown in the histogram plot of the 6.2 section.

RBBoost performs better when the number of classifiers in the ensemble (T) increases. It also has a lower loss than most of the algorithms. But as sometimes happens that Random Balance makes artificial samples from the majority class, this method doesn't reach a better performance than the RUSBosst and SMOTEBoost.

The plot in the 7.1.4 and 7.2.4 sections shows that in RUSBoost, we had a greater recall than precision. The reason is RUSBoost doesn't consider many of the majority data in each iteration, so it focuses on minority data. But if we want a trade-off between precision and recall, SMOTEBoost performs reasonably well. It considers all the majority data, and also over sampled minority data.