<a href="https://colab.research.google.com/github/kochlisGit/Advanced-ML/blob/main/Active_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Title**

Assignment 2 - Multi-Label Learning

**Course**

Advanced Machine Learning Topics - Master in Artificial Intelligence

**Authors**:

1.   Anastasia Papadopoulou
2.   Vasileios Kochliaridis

In this dataset, we are going to explore some active learning techniques. Since it costs both money and time to label large volumes of data, active learning is a valuable option in cases where there are many unlabeled data and we may only want to intelligently label the most informative instances.

We are going to separate our dataset to an unlabeled set and a test set. Then, the active learner algorithm will pick the most useful samples, using an uncertainty sampling strategies. For this purpose, we are going to use the Library **modAL**, to apply the following techniques:

1.   **Uncertainty Sampling**
2.   **Mergin Sampling**
3.   **Entropy Sampling**

Then we will compare the above strategies with a **random sampling strategy**.

In [None]:
!pip install modAL

Collecting modAL
  Downloading modAL-0.4.1-py3-none-any.whl (27 kB)
Installing collected packages: modAL
Successfully installed modAL-0.4.1


In [None]:
from sklearn import datasets

diabetes_data = datasets.load_breast_cancer(as_frame=True)
inputs = diabetes_data['data']
targets = diabetes_data['target']

inputs

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [None]:
targets

0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: int64

In [None]:
from sklearn.model_selection import train_test_split

RANDOM_STATE = 0

inputs = inputs.to_numpy()
targets = targets.to_numpy()
x_train, x_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.5, train_size=0.5, random_state=RANDOM_STATE, shuffle=True)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((284, 30), (284,), (285, 30), (285,))

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling, margin_sampling, entropy_sampling

N_QUERIES = 50

query_strategies = {
    'Uncertainty Sampling': uncertainty_sampling,
    'Margin Sampling': margin_sampling,
    'Entropy Sampling': entropy_sampling
}

for query_strategy_name, query_strategy in query_strategies.items():
  print('\n--- Using {} strategy ---'.format(query_strategy_name))

  classifiers = {
      'Random Forest Classifier': RandomForestClassifier(random_state=RANDOM_STATE),
      'Support Vector Classifier': SVC(probability=True, random_state=RANDOM_STATE),
      'Naive Bayes Classifier': GaussianNB()
  }

  for classifier_name, classifier in classifiers.items():
    learner = ActiveLearner(
        estimator=classifier,
        query_strategy=query_strategy,
        X_training=x_train, y_training=y_train
    )

    selected_samples = []
    
    for i in range(N_QUERIES):
      query_idx, query_instance = learner.query(x_train)
      selected_samples.append(query_idx[0])
      learner.teach(query_instance, y_train[query_idx])

    y_pred = learner.predict(x_test)
    confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T

    print('\nEvaluating {}'.format(classifier_name))
    print(metrics.classification_report(y_test, y_pred))

    selected_samples.sort()
    print('\nSamples selected:', selected_samples)



--- Using Uncertainty Sampling strategy ---

Evaluating Random Forest Classifier
              precision    recall  f1-score   support

           0       0.96      0.93      0.94       101
           1       0.96      0.98      0.97       184

    accuracy                           0.96       285
   macro avg       0.96      0.95      0.96       285
weighted avg       0.96      0.96      0.96       285


Samples selected: [15, 15, 24, 30, 37, 39, 42, 43, 46, 46, 49, 52, 52, 53, 62, 66, 77, 95, 95, 96, 96, 102, 117, 118, 131, 141, 141, 150, 150, 152, 154, 164, 165, 167, 168, 170, 170, 211, 222, 224, 226, 235, 239, 250, 251, 251, 256, 262, 265, 276]

Evaluating Support Vector Classifier
              precision    recall  f1-score   support

           0       0.99      0.77      0.87       101
           1       0.89      0.99      0.94       184

    accuracy                           0.92       285
   macro avg       0.94      0.88      0.90       285
weighted avg       0.92      0.9

Now let's try random sampling and no sampling at all.

In [None]:
import random

random.seed(RANDOM_STATE)
random_indices = random.sample(range(x_train.shape[0]), k=N_QUERIES)

x_random = x_train[random_indices]
y_random = y_train[random_indices]

classifiers = {
    'Random Forest Classifier': RandomForestClassifier(random_state=RANDOM_STATE),
    'Support Vector Classifier': SVC(probability=True, random_state=RANDOM_STATE),
    'Naive Bayes Classifier': GaussianNB()
}

for classifier_name, classifier in classifiers.items():
  classifier.fit(x_random, y_random)
  y_pred = classifier.predict(x_test)
  confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T

  print('\nEvaluating {}'.format(classifier_name))
  print(metrics.classification_report(y_test, y_pred))


Evaluating Random Forest Classifier
              precision    recall  f1-score   support

           0       0.87      0.86      0.87       101
           1       0.92      0.93      0.93       184

    accuracy                           0.91       285
   macro avg       0.90      0.90      0.90       285
weighted avg       0.91      0.91      0.91       285


Evaluating Support Vector Classifier
              precision    recall  f1-score   support

           0       0.99      0.73      0.84       101
           1       0.87      0.99      0.93       184

    accuracy                           0.90       285
   macro avg       0.93      0.86      0.88       285
weighted avg       0.91      0.90      0.90       285


Evaluating Naive Bayes Classifier
              precision    recall  f1-score   support

           0       0.90      0.89      0.90       101
           1       0.94      0.95      0.94       184

    accuracy                           0.93       285
   macro avg      

In [None]:
classifiers = {
    'Random Forest Classifier': RandomForestClassifier(random_state=RANDOM_STATE),
    'Support Vector Classifier': SVC(probability=True, random_state=RANDOM_STATE),
    'Naive Bayes Classifier': GaussianNB()
}

for classifier_name, classifier in classifiers.items():
  classifier.fit(x_train, y_train)
  y_pred = classifier.predict(x_test)
  confusion_matrix = metrics.confusion_matrix(y_test, y_pred).T

  print('\nEvaluating {}'.format(classifier_name))
  print(metrics.classification_report(y_test, y_pred))


Evaluating Random Forest Classifier
              precision    recall  f1-score   support

           0       0.94      0.93      0.94       101
           1       0.96      0.97      0.96       184

    accuracy                           0.95       285
   macro avg       0.95      0.95      0.95       285
weighted avg       0.95      0.95      0.95       285


Evaluating Support Vector Classifier
              precision    recall  f1-score   support

           0       0.99      0.77      0.87       101
           1       0.89      0.99      0.94       184

    accuracy                           0.92       285
   macro avg       0.94      0.88      0.90       285
weighted avg       0.92      0.92      0.91       285


Evaluating Naive Bayes Classifier
              precision    recall  f1-score   support

           0       0.92      0.90      0.91       101
           1       0.95      0.96      0.95       184

    accuracy                           0.94       285
   macro avg      

Below, we present the **accuracy** results of *Active Learning vs Random Sampling and No Sampling*.

Classifier     | No Sampling | Random | Uncertainty | Margin | Entropy |
---------------|-------------|--------|-------------|--------|---------|
Random Forest  | 0.95        | 0.91   | 0.96        | 0.96   | 0.96    |
Support Vector | 0.92        | 0.90   | 0.92        | 0.92   | 0.92    |
Naive Bayes    | 0.94        | 0.93   | 0.94        | 0.94   | 0.94    |

It is pretty obvious that random resampling strategy has the worst performance. Also, active learning methods perform as good as training with the entire dataset.