# ML Classifier Copies - Autistic Spectrum Disorder Model Example

An example where we take an existing "black box" classifier model of the publicly available [Autistic Spectrum Disorder Screening for Adults dataset](https://archive-beta.ics.uci.edu/ml/datasets/Autism+Screening+Adult), that we can query in order to obtain a copy. Copying this model has the added difficulty that it uses a mix of numerical and categorical features.

Since we don't know the orginal model family, we will build several copies from different model families and compare the copy fidelity and performance.

(NOTE: Since we **did** build this model, we do have in fact all information.)

<a name="Index">
    
----
# Table of contents

    
1. [**Load original model**](#Original)
2. [**Build copies**](#Copies)
3. [**Evaluate copies**](#Evaluation)
----

In [1]:
import sys
sys.path.append("../../")

In [2]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from presc.dataset import Dataset
from presc.copies.copying import ClassifierCopy
from presc.copies.sampling import (
    normal_sampling, mixed_data_sampling,
)

from ML_copies_original_models import AutismScreeningModel

[Index](#Index)  
  
  


<a name="Original">  

-----
-----
# Load original model

We load a "black box" classifier model that we can query for the labels of any points.

In [3]:
autism_model = AutismScreeningModel()

[Index](#Index)  
  
  


<a name="Copies">  

-----
# Build copies

In [4]:
# Build separated transformer pipelines for the numerical and categorical features of the copy
numerical_features = ['age']
categorical_features = autism_model.dataset.column_names
categorical_features.remove("age")

numerical_transformer = Pipeline([('scaler', StandardScaler())])
categorical_transformer = Pipeline([('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_features),
                                               ('cat', categorical_transformer, categorical_features)])

In [5]:
log_normal_classifier = Pipeline([('preprocessor', preprocessor), 
                                  ('log_classifier', LogisticRegression(max_iter=1000))])
knn_normal_classifier = Pipeline([('preprocessor', preprocessor),
                                  ('KKN_classifier', KNeighborsClassifier(n_neighbors=30, weights="distance"))])
tree_normal_classifier = Pipeline([('preprocessor', preprocessor),
                                  ('tree_classifier', DecisionTreeClassifier())])
svm_normal_classifier = Pipeline([('preprocessor', preprocessor),
                                  ('tree_classifier', SVC(kernel="linear"))])

In [6]:
feature_description = autism_model.feature_description

In [7]:
balance_parameters={"max_iter": 50, "nbatch": 10000, "verbose": False}
log_normal_copy = ClassifierCopy(autism_model.model, log_normal_classifier,
                                 sampling_function=mixed_data_sampling, numerical_sampling=normal_sampling,
                                 enforce_balance=False, nsamples=20000, random_state=42,
                                 feature_parameters=feature_description, label_col="ASD",
                                 **balance_parameters)
log_normal_copy_training_data = log_normal_copy.copy_classifier(get_training_data=True)

knn_normal_copy = ClassifierCopy(autism_model.model, knn_normal_classifier, 
                                 sampling_function=mixed_data_sampling, numerical_sampling=normal_sampling,
                                 enforce_balance=False, nsamples=20000, random_state=42,
                                 feature_parameters=feature_description, label_col="ASD",
                                 **balance_parameters) 
knn_normal_copy_training_data = knn_normal_copy.copy_classifier(get_training_data=True)

tree_normal_copy = ClassifierCopy(autism_model.model, tree_normal_classifier,
                                  sampling_function=mixed_data_sampling, numerical_sampling=normal_sampling,
                                  enforce_balance=False, nsamples=20000, random_state=42,
                                  feature_parameters=feature_description, label_col="ASD",
                                  **balance_parameters) 
tree_normal_copy_training_data = tree_normal_copy.copy_classifier(get_training_data=True)

svm_normal_copy = ClassifierCopy(autism_model.model, svm_normal_classifier,
                                 sampling_function=mixed_data_sampling, numerical_sampling=normal_sampling,
                                 enforce_balance=False, nsamples=20000, random_state=42,
                                 feature_parameters=feature_description, label_col="ASD",
                                 **balance_parameters) 
svm_normal_copy_training_data = svm_normal_copy.copy_classifier(get_training_data=True)

[Index](#Index)  
  
  


<a name="Evaluation">  

-----
# Evaluate copies
    
### Evaluation summary

In [8]:
print("\n * Logistic regression copy:")
synthetic_log_normal_test_data = log_normal_copy.generate_synthetic_data(nsamples=2000, random_state=43)
evaluation_log_normal_copy = log_normal_copy.evaluation_summary(test_data=Dataset(
                                                autism_model.X_test.join(autism_model.y_test), label_col="ASD"), 
                                                synthetic_data=synthetic_log_normal_test_data)

print("\n * KNN copy:")
synthetic_knn_normal_test_data = knn_normal_copy.generate_synthetic_data(nsamples=2000, random_state=43)
evaluation_knn_normal_copy = knn_normal_copy.evaluation_summary(test_data=Dataset(
                                                autism_model.X_test.join(autism_model.y_test), label_col="ASD"), 
                                                synthetic_data=synthetic_knn_normal_test_data)

print("\n * Decision tree copy:")
synthetic_tree_normal_test_data = tree_normal_copy.generate_synthetic_data(nsamples=2000, random_state=43)
evaluation_tree_normal_copy = tree_normal_copy.evaluation_summary(test_data=Dataset(
                                                autism_model.X_test.join(autism_model.y_test), label_col="ASD"), 
                                                synthetic_data=synthetic_tree_normal_test_data)

print("\n * SVC copy:")
synthetic_svm_normal_test_data = svm_normal_copy.generate_synthetic_data(nsamples=2000, random_state=43)
evaluation_svm_normal_copy = svm_normal_copy.evaluation_summary(test_data=Dataset(
                                                autism_model.X_test.join(autism_model.y_test), label_col="ASD"), 
                                                synthetic_data=synthetic_svm_normal_test_data)


 * Logistic regression copy:
Original Model Accuracy (test)          0.9787
Copy Model Accuracy (test)              0.9929
Empirical Fidelity Error (synthetic)    0.0335
Empirical Fidelity Error (test)         0.0142
Replacement Capability (synthetic)      0.9665
Replacement Capability (test)           1.0145

 * KNN copy:
Original Model Accuracy (test)          0.9787
Copy Model Accuracy (test)              0.9645
Empirical Fidelity Error (synthetic)    0.0240
Empirical Fidelity Error (test)         0.0142
Replacement Capability (synthetic)      0.9760
Replacement Capability (test)           0.9855

 * Decision tree copy:
Original Model Accuracy (test)          0.9787
Copy Model Accuracy (test)              0.9787
Empirical Fidelity Error (synthetic)    0.0065
Empirical Fidelity Error (test)         0.0000
Replacement Capability (synthetic)      0.9935
Replacement Capability (test)           1.0000

 * SVC copy:
Original Model Accuracy (test)          0.9787
Copy Model Accuracy (test

#### Conclusions
* All copies have a very good accuracy.
* In the case of the Logistic Regression and the SVC Classifier copies it is even better than the original.
* However, the Decision Tree classifier is the model that reaches the lowest empirical fidelity error. Hence, it is the copy that best mimics the original model's decision boundary.

[Index](#Index)  
  
  


-----
-----