# ML Classifier Copies - Online Copies

Sometimes with complex feature spaces, with many dimensions, or with complex original classifiers, the volume of generated synthetic data needed to carry out the classifier copy is too large to fit in the computer memory at the same time. In these cases we can resort to **online copying**.

We have implemented a version that works with standalone sklearn classifiers as well as pipelines, as long as each transformer and estimator has a *partial_fit* function implemented. There is more than one way to train pipelines with incremental transformers and estimators, here we simply take each data batch and train the first element of the pipeline, then transform the incoming data with the first element, then train the second, then transform the data with the second, use this transformed data to train the third, etc.

The implementation has two gears that can function independently: 
* *SyntheticDataStreamer()*: a continous synthetic data generator.
* *ContinuousCopy()*: a continuous ML classifier copier.

These two elements can share a queue, where the instance of the first class can add data whenever there's an empy slot, and from which the instance of the second class can take a batch of data, as soon as there is any available.

The classifier copy can be trained for an arbitrary amount of time and then saved, to continue at a later time with more training.

----

In [1]:
import sys
sys.path.append("../")

In [2]:
import pandas as pd
from queue import Queue

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from presc.dataset import Dataset
from presc.copies.continuous import SyntheticDataStreamer, ContinuousCopy
from presc.copies.copying import ClassifierCopy
from presc.copies.sampling import normal_sampling

from ML_copies_original_models import SegmentationModel

### Load "black box" model to copy

For this example, we take an existing "black box" classifier model example of the publicly available [Image Segmentation dataset](https://archive-beta.ics.uci.edu/ml/datasets/Image+Segmentation), that we can query in order to obtain a copy. This problem has 7 classes: 

    BRICKFACE, CEMENT, FOLIAGE, GRASS, PATH, SKY, and WINDOW.

In [3]:
%%capture
# Load problem
original_model = SegmentationModel()

# Description of the feature space of the problem to carry out the sampling
feature_description = original_model.feature_description

# Original test data 
test_data = Dataset(original_model.X_test.join(original_model.y_test), label_col="class")

### Instantiate Classifier Copy

Whatever transformer or estimator family that we use for the copy needs to have an implementatin of a *partial_fit* function to be copied incrementally. Below there's a list of some of the possible transformers and classifiers implemented in sci-kit learn that we can use on their own or within a pipeline.

#### Possible sklearn incremental learning preprocessing:
* sklearn.preprocessing.StandardScaler
* sklearn.preprocessing.MinMaxScaler
* sklearn.preprocessing.MaxAbsScaler

#### Possible sklearn incremental learning classifiers:
* sklearn.naive_bayes.MultinomialNB
* sklearn.naive_bayes.BernoulliNB
* sklearn.linear_model.Perceptron
* sklearn.linear_model.SGDClassifier
* sklearn.linear_model.PassiveAggressiveClassifier
* sklearn.neural_network.MLPClassifier

In [4]:
# Instantiate the copy pipepline
sdg_normal_classifier = Pipeline([('scaler', StandardScaler()), ('sdg_classifier', 
                                   SGDClassifier())])

# Define the parameters for the copying balancer (which ensures equal amount of samples from each class)
balance_parameters={"max_iter": 50, "nbatch": 10000, "verbose": False}

# Instantiate the copier class
sdg_normal_copy = ClassifierCopy(original_model.model, sdg_normal_classifier, normal_sampling,
                                  enforce_balance=False, nsamples=20000, random_state=42,
                                  feature_parameters=feature_description, label_col="class",
                                  **balance_parameters)

### Instantiate shared queue

Here we instantiate queue that will be shared between the synthetic data generator and the copying class.

In [5]:
data_stream = Queue(maxsize=4)

### Instantiate and start data streamer

The classifier copier instance and the shared queue are the necessary parameters.

In [6]:
data_streamer = SyntheticDataStreamer(sdg_normal_copy, data_stream, verbose=True)
data_streamer.start()

### Instantiate and start classifier copier

We can specify any parameters we need for the *partial_fit* with the *fit_kwargs* parameter. For a single classifier we simply add them in a dictionary. If using a pipeline, a dictionary with an entry for each element used in the pipeline is necessary, and then with each entry containing the parameter dictionary for that transformer or estimator.

In this example we use the SDGClassifier, for which classes need to be specified.

If we want the evaluation summary of the classifier to print after each iteration, we set **verbose=True** when instantiating the online copy.

In [7]:
# Specific parameters needed to fit each element of the pipeline
fit_kwargs = {"scaler": {}, "sdg_classifier": {"classes": ['BRICKFACE', 'CEMENT', 'FOLIAGE', 
                                                           'GRASS', 'PATH', 'SKY', 'WINDOW']}}

# Instantiate and start copy
online_copy = ContinuousCopy(sdg_normal_copy, data_stream, fit_kwargs=fit_kwargs, 
                             verbose=True, test_data=test_data)
online_copy.start()


Iteration:  1
Samples:  20000 

Original Model Accuracy (test)          0.9500
Copy Model Accuracy (test)              0.6952
Empirical Fidelity Error (synthetic)    0.3502
Empirical Fidelity Error (test)         0.2833
Replacement Capability (synthetic)      0.6498
Replacement Capability (test)           0.7318

Iteration:  2
Samples:  40000 

Original Model Accuracy (test)          0.9500
Copy Model Accuracy (test)              0.7381
Empirical Fidelity Error (synthetic)    0.3351
Empirical Fidelity Error (test)         0.2405
Replacement Capability (synthetic)      0.6649
Replacement Capability (test)           0.7769

Iteration:  3
Samples:  60000 

Original Model Accuracy (test)          0.9500
Copy Model Accuracy (test)              0.5845
Empirical Fidelity Error (synthetic)    0.3615
Empirical Fidelity Error (test)         0.4095
Replacement Capability (synthetic)      0.6385
Replacement Capability (test)           0.6153

Iteration:  4
Samples:  80000 

Original Model Accurac

### Stop data streamer and online copy

Once you are finished carrying out the classifier copy, make sure to stop both the data streamer and the online copy instances, or they may stay in the background using resources from your computer.

If you stop the data streamer before stopping the online copier, the copier may keep going for a few iterations until there are't any data batches left in the queue, so it may be better to stop the copier first.

Sometimes it is not immediate and it takes a while for the threads to stop.

In [8]:
# Check if the online copier is still running
print(f"Online copier is running: {online_copy.is_alive()}\n")

# Stop the online copier
online_copy.stop()

Online copier is running: True

Stopping online classifier copier...

The classifier copy trained for 6 iterations
with a total of 120.000 samples.

Original Model Accuracy (test)          0.9500
Copy Model Accuracy (test)              0.6274
Empirical Fidelity Error (synthetic)    0.3261
Empirical Fidelity Error (test)         0.3464
Replacement Capability (synthetic)      0.6739
Replacement Capability (test)           0.6604


In [9]:
# Check if the data streamer is still running
print(f"Data streamer is running: {data_streamer.is_alive()}\n")

# Stop the thread
data_streamer.stop()
_ = data_stream.get()

Data streamer is running: True

Stopping data streamer...



In [10]:
# Check if any of the two threads are still running
print(f"Data streamer is running: {data_streamer.is_alive()}")
print(f"Online copier is running: {online_copy.is_alive()}\n")

Data streamer is running: False
Online copier is running: False



----

    
-----