In [1]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from active_tester import ActiveTester
from active_tester.estimators.naive import Naive
from active_tester.query_strategy.random import Random
from active_tester.estimators.learned import Learned
from active_tester.label_estimation.methods import oracle_one_label

## Active Testing Overview

The basic idea of active testing (see __[Active Testing: An Efficient and Robust Framework for Estimating Accuracy](https://icml.cc/Conferences/2018/Schedule?showEvent=2681)__) is to intelligently update the labels of a noisily-labeled test set to improve estimation of a performance metric for a particular system.  The noisy labels may come, for example, from crowdsourced workers and as a result are not expected to have high accuracy.  The required inputs to use active testing are:
- A system under test
- A performance metric (accuracy, precision, recall, etc.) of interest
- A test dataset where each item has at least one noisy label and a score from the system under test
- Access to a vetter that can provide (hopefully) high quality labels

Active testing has two main steps.  In the first step, items from the test dataset are queried (according to some query_strategy) and sent to the vetter to receive a label.  In the second step, some combination of the system scores, the noisy labels, the vetted labels, and the features are used to estimate the performance metric of interest.

This package implements a variety of query strategies and two metric estimation strategies.  Other notebooks will discuss these in more detail.

## Example of basic use

Running active testing in this package requires a couple of steps. 

First, we initialize the ActiveTester object, which requires us to set the `estimator` and `query_strategy` parameters, which require the user to pass an estimator object and query strategy object.  Here we use the `Naive` estimator and the `Random` query strategy.  The details of these objects are explained in other notebooks.  The only relevant point here is that the estimator object require us to specify the metric of interest.  Here we use accuracy.

```python
active_test = active_tester.ActiveTester(estimator=Naive(metric=accuracy_score), 
                                         query_strategy=Random())
```

Next, we call standardize data, which formats relevant data for the active testing routines.  Here, we need to set `X` the test dataset, `classes` the names of the classes, and `Y_noisy` the noisy labels from the experts.  Below is some additional info on the expected format for these parameters.  Later, we will discuss some additional parameters as well.
* `X` should be a (number of items) x (number of features) array.
* `classes` should be a list of strings that represent the class names
* `Y_noisy` should be a (number of items) x (number of experts) array of the noisy labels.  If there is only 1 expert, note that the shape of the array must be (number of items) x 1.  If an expert does not provide a label for a particular item, a -1 should be used as a placeholder.

```python
active_test.standardize_data(X=x, 
                             classes=c, 
                             Y_noisy=y)
```

Finally, we call `gen_model_predictions`, and pass it the model we wish to evaluate.
```python
active_test.gen_model_predictions(model)
```

In the following cells, we construct the a training and test dataset, build a classifier, and construct some noisy labels.

### Generate dataset

In [2]:
# Features
X0 = np.random.randn(100,2)
X1 = np.random.randn(100,2) + 2
# Labels
y0 = np.zeros(100)
y1 = np.ones(100)
# Stack together and split into train and test sets
X = np.vstack((X0,X1))
y = np.hstack((y0,y1))
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)

### Build classifier and compute true performance on the test dataset

In [3]:
# Train a logistic regression classifier 
model = LogisticRegression(solver='lbfgs')
model.fit(X_train,y_train)
# Predict labels for the test set and compute the true accuracy of the classifier on the test set
label_predictions = model.predict(X_test)
true_accuracy = accuracy_score(y_test,label_predictions)
print(true_accuracy)

0.89


### Generate noisy labels

In [4]:
y_noisy = []
noisy_label_accuracy = 0.75
for i in range(len(y_test)):
    if np.random.rand() < noisy_label_accuracy:
        # noisy label is correct
        y_noisy.append(y_test[i])
    else:
        # noisy label is incorrect
        y_noisy.append(np.abs(1-y_test[i]))
y_noisy = np.asarray(y_noisy, dtype=int)
y_noisy = np.reshape(y_noisy,(len(y_noisy),1)) # Remember that this shape is important!

### Setup active testing

Now that we've set up a simple dataset, we can initialize the `ActiveTester` object, and then call `standardize_data` and `gen_model_predictions`.

In [5]:
active_test = ActiveTester(estimator=Naive(metric=accuracy_score), 
                                         query_strategy=Random())
active_test.standardize_data(X=X_test, 
                             classes=['0', '1'], 
                             Y_noisy=y_noisy)
active_test.gen_model_predictions(model)

### Collect labels from vetter

Now we are ready to interactively collect labels from the vetter.  To do this, we call `query_vetted`, which has a few options:
* `interactive`: set to True to interactively vet labels
* `budget`: the number of items to vet
* `batch_size`: used by some of the estimators to control how often internal models are retrained.
* `raw`: files to display corresponding to items 
* `visualizer`: function that processes the raw features in X and returns a dictionary of output

Only the first two are important at the moment.  Running the cell below will select items and request a label from the user.

In [7]:
active_test.query_vetted(interactive=True, budget=10)

Beginning preprocessing to find vetted labels of each class...
[-1.72575868 -2.8470473 ]
The available labels are: ['0', '1']
Label the provided item: 0


[3.73346014 4.06301183]
The available labels are: ['0', '1']
Label the provided item: 1


Completed preprocessing
Budget reduced from "10" to "8"
[ 0.44015372 -0.08109233]
The available labels are: ['0', '1']
Label the provided item: 1


[ 2.20254539 -0.65944231]
The available labels are: ['0', '1']
Label the provided item: 1


[2.29622029 2.09121062]
The available labels are: ['0', '1']
Label the provided item: 1


[-0.10799278  1.14608362]
The available labels are: ['0', '1']
Label the provided item: 1


[3.48585406 2.81144642]
The available labels are: ['0', '1']
Label the provided item: 1


[ 0.17955305 -0.40492947]
The available labels are: ['0', '1']
Label the provided item: 0


[ 1.22562061 -0.2740566 ]
The available labels are: ['0', '1']
Label the provided item: 1


[-0.34345773  0.77914681]
The available labels are: ['0', '

## Estimate performance

After we are done querying labels from the vetter, we can run the `test()` method to compute our estimated performance and then retrieve the result.  Below, we compare this result with the accuracy we would have estimated, had we not updated the noisy labels.  Running `get_rest_results()` returns a dictionary containing the results.  The indices are 
* `tester_metric`: estimated value for the performance metric
* `tester_labels`: estimated labels for the dataset
* `tester_probs`: label probabilities estimated by the Learned estimator (set to None if the Naive estimator is used)

In [8]:
active_test.test()
result = active_test.get_test_results()
print('True accuracy: '+ str(true_accuracy))
print('Predicted accuracy from active testing: '+ str(result['tester_metric']))
print('Predicted accuracy without using active testing: ' + str(accuracy_score(y_noisy, label_predictions)))

True accuracy: 0.89
Predicted accuracy from active testing: 0.65
Predicted accuracy without using active testing: 0.65


## Additional paramters for standardize_data

Below, we discuss the full set of parameters for `standardize_data()`:
* `rearrange` : boolean value that determines whether to shuffle the dataset.  This is False by default and should be set to False if a list of file names is passed to the raw parameter in `query_vetted()`
* `is_img_byte`: boolean value that marks whether the data in `X` can be displayed as an image
* `num` : number of samples to draw from the dataset.  This is set to -1, which uses all data.
* `X` : array of features (described above)
* `classes` : list of class names (described above)
* `Y_ground_truth` : known ground truth labels, if available.  This is primarily useful for comparing algorithms when ground truth is known without needing to interactively vet labels.
* `Y_vetted` : vetted labels that have already been gathered
* `Y_noisy` : the noisy labels (described above)

## Evaluating a model when only the predicted probabilities are available

If the model itself is not available to pass to `active_tester`, but the probabilities predicted by the model are, we can set these directly instead of calling `gen_model_predictions()`.  The format is expected to be a (number of items) x (number of classes) array containing the predicted probabilties.  Below we show an example where this array is produced by sklearn directly.

In [9]:
active_test = ActiveTester(Naive(metric=accuracy_score), Random())
active_test.standardize_data(X=X_test, 
                             classes=['0','1'], 
                             Y_noisy=y_noisy)

# run the model on the dataset
y_predictions = model.predict_proba(X_test)
active_test.set_prob_array(y_predictions)

print(y_predictions[:5,:])

# query 5 labels from the vetter (you!)
active_test.query_vetted(True, budget=5, raw=None, visualizer=None)

# use input to estimate classifier performance and print the result
active_test.test()
result = active_test.get_test_results()
print(result)

[[9.87510038e-01 1.24899623e-02]
 [1.73037079e-01 8.26962921e-01]
 [9.25240749e-01 7.47592513e-02]
 [1.23432680e-02 9.87656732e-01]
 [9.99936729e-01 6.32707703e-05]]
Beginning preprocessing to find vetted labels of each class...
[ 1.03262418 -1.22249529]
The available labels are: ['0', '1']
Label the provided item: 0


[ 0.44015372 -0.08109233]
The available labels are: ['0', '1']
Label the provided item: 1


Completed preprocessing
Budget reduced from "5" to "3"
[ 0.47675293 -1.31891426]
The available labels are: ['0', '1']
Label the provided item: 0


[2.86072373 1.79840358]
The available labels are: ['0', '1']
Label the provided item: 1


[2.57358588 0.97990037]
The available labels are: ['0', '1']
Label the provided item: 1


{'tester_labels': array([0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 1