Let's say that we've gotten a dataset that we'd like to benchmark for active learning. 

In [1]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100_000, n_classes=4, n_informative=10)

We will use modAL to handle some of the heavy liften. That means that we also need to write a compatible sampler if we want to have the random benchmark around.

In [2]:
import numpy as np
from modAL.uncertainty import entropy_sampling
from sklearn.linear_model import LogisticRegression

def randomly(classifier, X, n_instances=1):
    idx =  np.random.randint(0, X.shape[0] - 1, n_instances)
    return idx, X

We can now use the `run_experiment` from the `skteach.py` script to automate the data collection.

In [3]:
from skteach import run_experiment

In [6]:
?run_experiment

[0;31mSignature:[0m
[0mrun_experiment[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mestimator[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrategy[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbatch_size[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_batch[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstart_size[0m[0;34m=[0m[0;36m50[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0midx_start[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/Development/scikit-teach/skteach.py
[0;31mType:[0m      function


In [4]:
data = []
for i in range(5): 
    data += run_experiment("entropy", data=(X, y), estimator=LogisticRegression(), strategy=entropy_sampling, n_batch=50)
for i in range(5):
    data += run_experiment("randomly", data=(X, y), estimator=LogisticRegression(), strategy=randomly, n_batch=50)

100%|█████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  7.59it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  7.26it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  7.36it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  7.31it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  7.30it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 14.02it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 13.24it/s]
100%|██████████████████████████████████████████████████████████████████████████████

We can compare the scores with an altair chart now.

In [5]:
import pandas as pd
import altair as alt

pltr = pd.DataFrame(data)

(alt.Chart(pltr)
  .mark_line()
  .encode(x='batch', y='score', color='name', detail='run_id')
  .properties(width=600, height=250)
  .interactive())

Lesson one in active learning: random sampling is a suprisingly strong benchmark