### _Iris_ Active Learning example

This notebook contains a simple example of how to implement the Active Learning framework using modAL and the query strategies of this repository. For this we are going to use the [_Iris_](https://archive.ics.uci.edu/dataset/53/iris) dataset, loaded from _sklearn_.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from activelearning.AL_cycle import cycle_AL, plot_results, strategy_comparison
from activelearning.queries.representative.coreset_query import query_coreset
from activelearning.queries.representative.density_query import query_density
from activelearning.queries.representative.kmeans_query import query_kmeans_foreach
from activelearning.queries.representative.random_query import query_random

np.random.seed(100)

In [None]:
iris = load_iris()

X = iris["data"]  # attributes
y = iris["target"]  # labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

X_pool, y_pool = X_train, y_train
# use the next rows to split the training set in initial set and pool
# X_start, X_pool, y_start, y_pool = train_test_split(X_train, y_train, test_size = 0.8, random_state = 0)

After splitting the data in a training set and a test set, we evaluate the performance of a randomforest classifier on the complete training set. This will serve as reference metric for the active learning query strategies, as we want to reach the same accuracy but with less labeled data.

In [None]:
RF_mod = RandomForestClassifier()
RF_mod.fit(X_train, y_train)

goal_acc = RF_mod.score(X_test, y_test)
print(goal_acc)

First we implement the Active Learning cycle for pool based sampling using the *cycle_AL* function. We can specify any of the query strategies for pool based sampling from modAL or from this repository in the learner instantiation, as well as other paramaters like the number of data points to query at each iteration.

In [None]:
accuracies, instances = cycle_AL(
    X_train=None,
    y_train=None,
    X_pool=X_pool,
    y_pool=y_pool,
    X_test=X_test,
    y_test=y_test,
    classifier="randomforest",
    query_strategy=query_random,
    goal_acc=goal_acc,
)
# accuracies contains the accuracy score at each iteration
# instances contains the count of labeled instances at each iteration

In [None]:
# visualize result

plt.plot(instances, accuracies)
plt.title("Accuracy over instances")
plt.xlabel("Instances")
plt.ylabel("Accuracy")
plt.grid(True)
plt.axhline(y=goal_acc, color="y", linestyle="--")

plt.show()

To compare how different query strategies perform, we can use the *strategy_comparison* function and pass the strategies to be used. We can also pass more than one number of instances, to check whether a different batch size influences performance. *plot_results* can be used to immediatly plot the output from *strategy_comparison*, or a custom graph can be created from the scores data frame.

In [None]:
n_instances = [3]  # put [3, 9, 18] to confront different batch sizes

scores = strategy_comparison(
    X_train=None,
    y_train=None,
    X_pool=X_pool,
    y_pool=y_pool,
    X_test=X_test,
    y_test=y_test,
    classifier="randomforest",
    query_strategies=[query_kmeans_foreach, query_density, query_coreset, query_random],
    n_instances=n_instances,
    K=3,  # number of clusters for k-means query
    metric="euclidean",  # metric for density query
    goal_acc=goal_acc,
)

In [None]:
plot_results(
    scores,  # output data frame from strategy_comparison
    n_instances=n_instances,
    tot_samples=X_pool.shape[0],  # size of the original training set, for scale
    goal_acc=goal_acc,
    figsize=(7, 4),
)