# CSI5155 Machine Learning Assignment 2

This notebook serves as a starting point to firstly evaluate the predictions from models which we trained in assignment 1, and precisely to calculate the SHAP values for each selected classifiers.

# Installing Packages and Prerequisites

In [None]:
import numpy as np;
import constants;
import os;
from fileOrganizer import unpack;
import random;
from explainer import explainer;
import gc;

# Model selection and Applying SHAP

According to assignment 1 (as shown in this notebook `CSI5155 Assignment 1 Evaluation Part - Kelvin Mock 300453668.ipynb` and the report `CSI5155 Assignment 1 Report.pdf`), we compared the Areas Under the Curves (AUCs) among 6 models. We also concluded in the report that the AUC is a great metric to inference the overall accuracy of a model. For simplicity, we only consider the **original** models where sampling techniques are not applied. We have the following conclusions:
- Based on the Chocolate dataset, the best classifier (with the largest AUC) is the Decision Tree classifier; and, 
- Based on the Magic Mushroom dataset, the best classifier (with the largest AUC) is the Multi-Layer Perceptron (MLP) classifier,

whereas,

- Based on the Chocolate dataset, the worst classifier (with the lowest AUC) is the Support Vector Machine (SVM) classifier; and, 
- Based on the Magic Mushroom dataset, the worst classifier (with the lowest AUC) is also the SVM classifier. 

## Loading the Models from files

In [None]:
def load_models(path: str):
    # Choco Best: Decision Tree
    choco_bestModel = unpack(os.path.join(path, constants.filepaths["choc_posttrained_decisionTree"]));
    # Choco Worst: SVM
    choco_worstModel = unpack(os.path.join(path, constants.filepaths["choc_posttrained_SVC"]));
    # Mushroom Best: MLP
    mush_bestModel = unpack(os.path.join(path, constants.filepaths["mushrooms_posttrained_MLP"]));
    # Mushroom Worst: SVM
    mush_worstModel = unpack(os.path.join(path, constants.filepaths["mushrooms_posttrained_SVC"]));
    return choco_bestModel, choco_worstModel, mush_bestModel, mush_worstModel;

try:
    choco_bestModel, choco_worstModel, mush_bestModel, mush_worstModel = load_models(constants.ASM1_DIR);
except FileNotFoundError:
    # alt path for remote / github action
    choco_bestModel, choco_worstModel, mush_bestModel, mush_worstModel = load_models(constants.ASM1_DIR_ALT);

### Inspecting the Imported Models

In [None]:
choco_bestModel

Since the Decision Tree classifier has been optimized by RandomizedSearchCV, we need to extract the estimator. 

In [None]:
choco_bestModel = choco_bestModel.best_estimator_;
choco_bestModel

In [None]:
choco_worstModel

Since the SVM classifier has been optimized by RandomizedSearchCV, we need to extract the estimator. 

In [None]:
choco_worstModel = choco_worstModel.best_estimator_;
choco_worstModel

In [None]:
mush_bestModel

In [None]:
mush_worstModel

Since the SVM classifier has been optimized by RandomizedSearchCV, we need to extract the estimator. 

In [None]:
mush_worstModel = mush_worstModel.best_estimator_;
mush_worstModel

## Loading the Datasets from files

In [None]:
def load_datasets(path: str):
    # Choco train set
    choco_X_train = unpack(os.path.join(path, constants.filepaths["choc_train-set_samples"]));
    choco_y_train = unpack(os.path.join(path, constants.filepaths["choc_train-set_labels"]));

    # mushroom train set
    mush_X_train = unpack(os.path.join(path, constants.filepaths["mushrooms_train-set_samples"]));
    mush_y_train = unpack(os.path.join(path, constants.filepaths["mushrooms_train-set_labels"]));

    # Choco test set
    choco_X_test = unpack(os.path.join(path, constants.filepaths["choc_test-set_samples"]));
    choco_y_test = unpack(os.path.join(path, constants.filepaths["choc_test-set_labels"]));
    # mushroom test set
    mush_X_test = unpack(os.path.join(path, constants.filepaths["mushrooms_test-set_samples"]));
    mush_y_test = unpack(os.path.join(path, constants.filepaths["mushrooms_test-set_labels"]));

    return choco_X_train, choco_y_train, mush_X_train, mush_y_train, choco_X_test, choco_y_test, mush_X_test, mush_y_test;

try:
    choco_X_train, choco_y_train, mush_X_train, mush_y_train, choco_X_test, choco_y_test, mush_X_test, mush_y_test = load_datasets(constants.ASM1_DIR);
except FileNotFoundError:
    # Using ALT paths for remote / github action
    choco_X_train, choco_y_train, mush_X_train, mush_y_train, choco_X_test, choco_y_test, mush_X_test, mush_y_test = load_datasets(constants.ASM1_DIR_ALT);

### Inspecting the Imported Data

In [None]:
print("-----Chocolate Dataset Training Set-----");
print(f"Size of the samples array in the training set from Chocolate dataset: {len(choco_X_train)}");
print(f"Size of the labels array in the training set from Chocolate dataset: {len(choco_y_train)}");
print(f"Number of features in a sample in the training set from Chocolate dataset: {len(choco_X_train[random.randint(0, len(choco_X_train)-1)])}");
unique, counts = np.unique(choco_y_train, return_counts=True);
for i in range(len(unique)):
    print(f"Label '{unique[i]}' has: {counts[i]} samples.");

print();

print("-----Mushroom Dataset Training Set-----");
print(f"Size of the samples array in the test set from Mushroom dataset: {len(mush_X_train)}");
print(f"Size of the labels array in the test set from Mushroom dataset: {len(mush_y_train)}");
print(f"Number of features in a sample in the test set from Mushroom dataset: {len(mush_X_train[random.randint(0, len(mush_X_train)-1)])}");
unique, counts = np.unique(mush_y_train, return_counts=True);
for i in range(len(unique)):
    print(f"Label '{unique[i]}' has: {counts[i]} samples.");

print();

print("-----Chocolate Dataset Test Set-----");
print(f"Size of the samples array in the test set from Chocolate dataset: {len(choco_X_test)}");
print(f"Size of the labels array in the test set from Chocolate dataset: {len(choco_y_test)}");
print(f"Number of features in a sample in the test set from Chocolate dataset: {len(choco_X_test[random.randint(0, len(choco_X_test)-1)])}");
unique, counts = np.unique(choco_y_test, return_counts=True);
for i in range(len(unique)):
    print(f"Label '{unique[i]}' has: {counts[i]} samples.");

print();

print("-----Mushroom Dataset Test Set-----");
print(f"Size of the samples array in the test set from Mushroom dataset: {len(mush_X_test)}");
print(f"Size of the labels array in the test set from Mushroom dataset: {len(mush_y_test)}");
print(f"Number of features in a sample in the test set from Mushroom dataset: {len(mush_X_test[random.randint(0, len(mush_X_test)-1)])}");
unique, counts = np.unique(mush_y_test, return_counts=True);
for i in range(len(unique)):
    print(f"Label '{unique[i]}' has: {counts[i]} samples.");

-----Chocolate Dataset Training Set-----
Size of the samples array in the training set from Chocolate dataset: 1256
Size of the labels array in the training set from Chocolate dataset: 1256
Number of features in a sample in the training set from Chocolate dataset: 13
Label 'non-user' has: 27 samples.
Label 'user' has: 1229 samples.

-----Mushroom Dataset Training Set-----
Size of the samples array in the test set from Mushroom dataset: 1256
Size of the labels array in the test set from Mushroom dataset: 1256
Number of features in a sample in the test set from Mushroom dataset: 13
Label 'non-user' has: 805 samples.
Label 'user' has: 451 samples.

-----Chocolate Dataset Test Set-----
Size of the samples array in the test set from Chocolate dataset: 629
Size of the labels array in the test set from Chocolate dataset: 629
Number of features in a sample in the test set from Chocolate dataset: 13
Label 'non-user' has: 8 samples.
Label 'user' has: 621 samples.

-----Mushroom Dataset Test Set-

## Apply SHAP Method

A SHAP value is used to represent the impact of each feature on the models’ predictions.

### Instantiating Explainers

In [None]:
treeexmplainer_choco = explainer(
    model=choco_bestModel, # Decision Tree
    data=choco_X_train,
    modelType="tree"
)
treeexmplainer_choco

Start instantiating an explainer.
A tree explainer is instantiated successfully.


<explainer.explainer at 0x21f726f06d0>

In [None]:
linearExplainer_choco = explainer(
    model=choco_worstModel, # SVM
    data=choco_X_train,
    modelType="svm"
);
linearExplainer_choco

Start instantiating an explainer.
A Linear explainer is instantiated successfully.


<explainer.explainer at 0x21f726f05b0>

In [None]:
permExplainer_mush_best = explainer(
    model=mush_bestModel, # MLP
    data=mush_X_train,
    modelType="neural"
)
permExplainer_mush_best

Start instantiating an explainer.
A Permutation Kernel explainer is instantiated successfully.


<explainer.explainer at 0x21f726f0dc0>

In [None]:
linearExplainer_mush_worst = explainer(
    model=mush_worstModel, # SVM
    data=mush_X_train,
    modelType="svm"
);
linearExplainer_mush_worst

Start instantiating an explainer.
A Linear explainer is instantiated successfully.


<explainer.explainer at 0x21f726f2650>

### Calculating SHAP values

In [None]:
# Decision Tree
SHAP_choco_best = treeexmplainer_choco.explain(
    X_test=choco_X_test
);
print(f"Shape of the SHAP values set results: {SHAP_choco_best.shape}");
print(f"The SHAP value for a sample: \n{SHAP_choco_best[random.randint(0, len(SHAP_choco_best)-1)]}");

A tree explainer is found.
Shape of the SHAP values set results: (629, 13, 2)
The SHAP value for a sample: 
[[ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [-0.00432326  0.00432325]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]
 [ 0.          0.        ]]


In [None]:
# SVM
SHAP_choco_worst = linearExplainer_choco.explain(
    X_test=choco_X_test
);
print(f"Shape of the SHAP values set results: {SHAP_choco_worst.shape}");
print(f"The SHAP value for a sample: \n{SHAP_choco_worst[random.randint(0, len(SHAP_choco_worst)-1)]}");

A Linear explainer is found
Shape of the SHAP values set results: (629, 13)
The SHAP value for a sample: 
[ 5.15351207e-04 -8.74319259e-06 -4.49127260e-06 -5.50523094e-06
 -6.56915009e-05  7.30367327e-07  1.64897071e-04 -1.99410852e-05
 -3.70730456e-05  1.40676129e-04  3.04204362e-05  3.70336865e-07
 -9.91021088e-08]


In [None]:
# MLP
SHAP_mush_best = permExplainer_mush_best.explain(
    X_test=mush_X_test
);
print(f"Shape of the SHAP values set results: {SHAP_mush_best.shape}");
print(f"The SHAP value for a sample: \n{SHAP_mush_best[random.randint(0, len(SHAP_mush_best)-1)]}");

A Permutation Explainer is found


PermutationExplainer explainer: 630it [02:15,  4.34it/s]                         

Shape of the SHAP values set results: (629, 13)
The SHAP value for a sample: 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]





In [None]:
# SVM
SHAP_mush_worst = linearExplainer_mush_worst.explain(
    X_test=mush_X_test
);
print(f"Shape of the SHAP values set results: {SHAP_mush_worst.shape}");
print(f"The SHAP value for a sample: \n{SHAP_mush_worst[random.randint(0, len(SHAP_mush_worst)-1)]}");

A Linear explainer is found
Shape of the SHAP values set results: (629, 13)
The SHAP value for a sample: 
[-1.93390072e+01 -1.37142696e-01 -8.17115302e-02  1.44853568e-02
 -3.31578908e-01  1.91051821e-03 -6.24897062e-03  1.35234131e-02
 -3.48705467e-01  3.09786916e-02 -2.62908288e-02 -2.31762422e-03
 -1.17822803e-01]


### Store the SHAP values after calculation

In [None]:
np.save("SHAP_choco_best.npy", SHAP_choco_best);
np.save("SHAP_choco_worst.npy", SHAP_choco_worst);
np.save("SHAP_mush_best.npy", SHAP_mush_best);
np.save("SHAP_mush_worst.npy", SHAP_mush_worst);

#### Verify if the Data Dump is successful

In [None]:
test = SHAP_choco_best == np.load("SHAP_choco_best.npy");
print(f"Number of inconsistencies from data dump: {len(test[test == False])}");

Number of inconsistencies from data dump: 0


In [None]:
test = SHAP_choco_worst == np.load("SHAP_choco_worst.npy");
print(f"Number of inconsistencies from data dump: {len(test[test == False])}");

Number of inconsistencies from data dump: 0


In [None]:
test = SHAP_mush_best == np.load("SHAP_mush_best.npy");
print(f"Number of inconsistencies from data dump: {len(test[test == False])}");

Number of inconsistencies from data dump: 0


In [None]:
test = SHAP_mush_worst == np.load("SHAP_mush_worst.npy");
print(f"Number of inconsistencies from data dump: {len(test[test == False])}");

Number of inconsistencies from data dump: 0


### Derive and Store the base values for plotting

In [None]:
baseVal_choco_best = treeexmplainer_choco.calBaseVal();
baseVal_choco_best

array([0.01990399, 0.98009601])

In [None]:
baseVal_choco_worst = linearExplainer_choco.calBaseVal();
baseVal_choco_worst

1.0013511252644252

In [None]:
baseVal_mush_best = permExplainer_mush_best.calBaseVal(
    model=mush_bestModel,
    X_train=mush_X_train
);
baseVal_mush_best

0.0

In [None]:
baseVal_mush_worst = linearExplainer_mush_worst.calBaseVal();
baseVal_mush_worst

29.414241247512308

In [None]:
np.save("baseVal_choco_best.npy", baseVal_choco_best);
np.save("baseVal_choco_worst", baseVal_choco_worst);
np.save("baseVal_mush_best", baseVal_mush_best);
np.save("baseVal_mush_worst.npy", baseVal_mush_worst);

#### Verify if the Data Dump is successful

In [None]:
test = baseVal_choco_best == np.load("baseVal_choco_best.npy");
print(f"Number of inconsistencies from data dump: {len(test[test == False])}");

Number of inconsistencies from data dump: 0


In [None]:
test = baseVal_choco_worst == np.load("baseVal_choco_worst.npy");
print(f"Number of inconsistencies from data dump: {len(test[test == False])}");

Number of inconsistencies from data dump: 0


In [None]:
test = baseVal_mush_best == np.load("baseVal_mush_best.npy");
print(f"Number of inconsistencies from data dump: {len(test[test == False])}");

Number of inconsistencies from data dump: 0


In [None]:
test = baseVal_mush_worst == np.load("baseVal_mush_worst.npy");
print(f"Number of inconsistencies from data dump: {len(test[test == False])}");

Number of inconsistencies from data dump: 0


We have made sure that all the data are dumped to .pkl files (and unpacked) accurately. Therefore, we can proceed with summary plots on another notebook called `CSI5155 Assignment 2 Plots - Kelvin Mock 300453668.ipynb`.