INSPIRE Program Summer 2023 Internship Project: Jessica Kim

# Post-Analysis: ML Interpretability Using Shapley Values for Trained ML Models - (for STREAMLINE)

This notebook allows user to run additional post-analysis methods for in-depth interpretability on pipeline results using Shapley Values through the SHAP framework (created by Scott M. Lundberg and Su-In Lee) to help explain model prediction outputs and feature importances.

When run, 1) trained models are unpickled 2) repsective training and testing datasets to CV partitions are loaded 3) create SHAP explainers for each model 4) compute SHAP values using the explainer & testing dataset 5) generate and save various forms of figures that provide different perspectives/interpretations of feature importances 5) create csv files of feature importances for each model per CV partition 6) save a master list of SHAP values (mean shapley values) for each model that stores all values from each CV partition.  Additionally, running force plots is optional,user must specify 'True' in order to generate the plots (force plots for some models cannot be saved and can only be displayed in the Run cell).

During the run of this notebook, all figures and csv files containing feature impomrtance SHAP values are stored in a main folder called "shap_values" under "model_evaluation" of the user's experiment dataset used in STREAMLINE.

Some figures (such as force plot) will display '0' or '1' as the 'average' model prediction but these refer to the classification of 0 or 1.  This simply demonstrates that SHAP values of features contributing have a certain magnitude on the final prediction, 0 or 1.  User will most likely see this when running Decision Tree, Logistic Regression, Extreme Gradient Boosting, and other tree-based models.

**NOTE: Creation of the notebook took into account how to apply explainers, SHAP value computation, and plots appropriately to its respective ML models and is still a work in progress due to the continuous changes in the SHAP packages.**

# Goals
* Laid out a rough outline of how SHAP would be computed, I thought I would give SHAP methods a try
* Earlier methods work and prove that the model is unpickled and can be used
* Be able to iterate through each trained model to it's respective CV dataset, create shap values, generate shap plots
* Be able to store each CV shap values for each model and store in csv file as a DataFrame
    
                LR_shap_all_CVs.csv ==> 
                                        LR_0 --> CV0
                                        LR_1 --> CV1
                                        LR_2 --> CV2




# Progress of Updates/Fixes:

**Months of June & July**
* Still need to figure out saving results into a file (pickle.dump()), create and save into designated folder **DONE**
* Figure out how to work TreeExplainer, expected_value function **DONE**
    * expected_value works for certain models/Explainers --> doesn't really work for Naive Bayes
* Find file with the feature names for corresponding dataset to load into program under 'Load Metadata" section **DONE**
* Figure out how to display other shap plots such as waterfall, force plot, etc 
    * waterfall plots might only work for certain models like tree-based ones
    * some of the summary figures functions aren't working

* Most of the program is hardcoded to specifically load one of the trained models after running STREAMLINE **resolved**
* Was able to prove that the model can be unpickled and used for .predict() and .predict_proba_() **resolved**
* Was able to use model to create SHAP explainers, calculate shap_values for CV0 testing dataset, and display plots **resolved**
* However, still need to refine the SHAP methods as there were some issues for Decision Tree Classifier **resolved**
* Was able to display Decision Tree prediction using TreeExplainer or even Explainer....I might be doing something wrong  **resolved**
* XGBOOST MODEL IS COMPATIBLE WITH ALL OF THE LISTED SHAP PLOTS **resolved**
* RF MODEL NEEDED IT'S OWN IF-STATEMENT FOR NOW BUT WILL CONDENSE FOR CLARITY ADN EFFICIENCY **resolved**
* STILL NEED TO WORK ON LIGHTGBM, CATBOOST  **resolved**
* GO BACK TO FIX DECISION TREE  **resolved**
* Go back to double check shap plot compatibility for global and local importance for linear models **resolved**
* Work through the DecisionTreeClassifier and compare to other codes out there (if possible) **resolved**
* Currently unsure if creating dataframe for each model's shap_values shuold be done in compute_shap_values() or within the nested for-loop in testing cell **resolved**
* Feature names when displaying shap plots  
    * Beeswarm plots still aren't displaying feature names (everything else is fine)

**7/29/22**
* ALL given SHAP plots seems to work for NB() when not in a defined function block and if-statement **resolved**
* Bar, scatter, waterfall, and beeswarm plots don't work for LR(), other plots work fine on LinearExplainer() and shap_values = explainer.shap_values(data) **STILL NEEDS TO BE FIXED**

**8/02/22**
* Plots and shap_values for each trained model in each CV work
* Currently unsure if creating dataframe for each model's shap_values shuold be done in compute_shap_values() or within the nested for-loop in testing cell 
    
**8/04/22**
* Can create DataFrames for each CV but feature names most likely are not matching actual values (double check it)
* Difficult looping through to merge Dataframes for all CVs features...tried temporary variable
* Must also consider that shap_values array are returned in order of features from test/train set it was passed from...not based on feature order in test/train set  **FIXED on 8/05/22**
    * Consider mapping out and ordering the values to avoid shuffling of names and values **FIXED on 8/05/22**
    
**8/05/22**
* Saving feature importance scores for each cv
* Created two different runs, one for actual test (default) and another if the user chooses to run it on the training sets for comparison  

**8/08/22**
* Iterating through multiclass shap values for Decision Tree poses issue?...ideally we'd want to get the shap absolute average for both classes 0 and 1...same might be for XGB and any other model that has multiclass output **FIXED on 8/08/22**
    * Figured out that when running the loop in shap_feature_ranking() for Decision Tree, both classes 0 and 1 are accounted for. The shap absolute averages are summed up automatically to get the overall CV feature importances for the model (i double checked this myself through creating a loop that would output two different csv files for each class it iterated through)
* **Current issue:** Figuring out how to save multiple figures for each model when calling shap_summary()...for now, I can only save each figure individually through each CV...if model NB has 2 plot function calls & iterate through 3 CVs --> total 6 shap plots for **ONE** model .....
    * **POSSIBLE FIX**  merge all images onto one pdf per model which would entail different shap summaries **OR** create the master list of feature impmortance of all CVs for each model and create shap summaries for those
    
**08/09/22 - 08/12/22**
* Goal is to save shap averages frm all CVs in each model in a final/master CSV file **(08/09/22)**
* **CURRENT ISSUE:** Struggling to create a method to save the mean shap values for each feature into a dataframe composed of all averages across all CVs **(08/10/22 - 08/11/22)**  **--> FIXED**
    * Was able to iterate through CV0 for NB0 but still need to figure out how to loop through all CVs  **(08/12/22)**
    * Maybe possible fix for this is either:
        * Implement this as a cell alone at the bottom of the notebook (my concern is redundancy in calling compute_shapValues() as well as other functions)
        * Leave as function and figure out how to append to an existing DataFrame after the firs iteration
        * Implement within shap_feature_ranking()
        
**08/15/22**
* Still trying to figure out best way on how to loop through all CVs for each model's mean absolute shap_values 
    * **NOTE**: it's better to append each CV values to FI_all list before creatibg dataframe....it would be inefficient to continuously append a row to an existing dataframe 
    
**8/16/22**
* Continuing on fixing csv issue for saving absolute mean shap_values for each model across all CVs
* Worked on (and fixed) saving summary plots to a new folder withing shap_values test run as well as saving figures for force plots if user decides to run it (aka user-specified run)
    * **NOTE:** SHAP force plots for DT, RF, LR, and XGB can only be displayed in the 'Run' cell of the notebook. These figures can be saved as png but it only shows blank space .... it works for NB 
    
**8/17/22**
* Was able to loop through alls CVs for master list with for given models but values stored in csv file **do not correspond to the output in run cell**
    * **CURENT ISSUE** Seems like when concating new rows, previous row gets overwritten with different values than the original dataframe row **FIXED 08/18/22**
    
**8/18/22**
* Fixed the issue with saving each CV of mean shap values per model --> saves and creates master list of shap values for each model successfully
* Will be focused on tesing out the code with other ML algorithms and cleaning up code

**8/29/22**
* Was able to generate and save waterfall plots for a single preediction (only) for models LR, XGB, DT, and RF
        * figure out why NB is labeled as 'Permutation' --> this is causing issues in displaying plots that require 'explainer.expected_value' parameter

# Final Steps

* Saving shap figures per model in each cv 
* Make sure you can loop through each pickled model, load it, create shap values and display plots
* Be able to load one model at a time, create shapley values for each CV train and test set, store shap scores in a dataframe
* Make sure to load original dataset features so that each csv file is the same length as the original dataset
    * This means when a CV dataset is missing a feature, we make sure to assign a shap score of 0 
    * each new csv file for loading shap scores of each trained model must include all features
    
    
                LR_shap_all_CVs.csv ==> 
                                        LR_0 --> CV0
                                        LR_1 --> CV1
                                        LR_2 --> CV2


* Save dataframe for each model in a csv file

# Possible future changes to consider: changing shap_summary()

**Instead of using the CV loop to create shap figures for each model per CV....**
* Open & read masterList csv
* sum each feature over all CVs to get shap_values average of each feature for the model
    * CV0 Feature X shap_value + CV1 Feature X shap_value + CV2 Feature X shap_value = (SUM(shap_values) / # of CVs)
    * Feature shap_values average for Model A = (SUM(shap_values) / # of CVs)
* create shap figures using finalized shap_value average of features for each model

In [1]:
# required packages
import os
import sys
import glob
import pickle
import warnings
warnings.filterwarnings('ignore')
import csv
import sklearn
import random
import shap
import numpy as np
import numpy.typing as npt
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import itertools
from itertools import chain
from fpdf import FPDF
import collections
from termcolor import colored as cl #text customization

# Model packages
import xgboost
import lightgbm as lgb
from sklearn import *
from sklearn import tree
from shap.plots import waterfall


# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


shap.initjs() # load JS visualization code to notebook. SHAP plots won't be displayed without this

# Run Parameters

In [2]:
dataset_path = "/Users/jessicakim/Desktop/STREAMLINE/DemoData"
experiment_path = "/Users/jessicakim/Desktop/STREAMLINE/DemoData/Output/hcc_demo"


# Check for Analyzed Datasets and Remove Unecessary Files

In [3]:
datasets = os.listdir(experiment_path)
experiment_name = experiment_path.split('/')[-1] #Name of experiment folder

datasets.remove('metadata.csv')
datasets.remove('metadata.pickle')
datasets.remove('algInfo.pickle')

try:
    datasets.remove('jobsCompleted')
except:
    pass
try:
    datasets.remove('UsefulNotebooks')
except:
    pass
try:
    datasets.remove('logs')
    datasets.remove('jobs')
except:
    pass
try:
    datasets.remove('DatasetComparisons') #If it has been run previously (overwrite)
except:
    pass
try:
    datasets.remove('KeyFileCopy') #If it has been run previously (overwrite)
except:
    pass
try:
    datasets.remove('.DS_Store') #If it has been run previously (overwrite)
except:
    pass
try:
    datasets.remove(experiment_name+'_ML_Pipeline_Report.pdf') #If it has been run previously (overwrite)
except:
    pass

datasets = sorted(datasets) #ensures consistent ordering of datasets
print("Analyzed Datasets: "+str(datasets))



Analyzed Datasets: ['hcc-data_example', 'hcc-data_example_no_covariates']


# Load Metadata and Other Necessary Variables

In [4]:
jupyterRun = 'True'
# Loading necessary variables specified earlier in the pipeline from metadatafor dataPrep()
file = open(experiment_path + '/' + "metadata.pickle", 'rb')
metadata = pickle.load(file)
# file.close()
# print(metadata)

class_label = metadata['Class Label']
instance_label = metadata['Instance Label']
cv_partitions = int(metadata['CV Partitions'])



# # # unpickle and load in feature_names from original dataset
original_headers = pd.read_csv(experiment_path+"/hcc-data_example/exploratory/OriginalFeatureNames.csv",sep=',').columns.values.tolist() 
print(original_headers)


alg_file = open(experiment_path + '/' + "/algInfo.pickle", 'rb')
algInfo = pickle.load(alg_file)
alg_file.close()
algorithms = []
   
    
abbrev = {}
for key in algInfo:  # pickling specific model while also checking for corresponding algInfo 
    if algInfo[key][0]: # If that algorithm was used
        algorithms.append(key)
        abbrev[key] = (algInfo[key][1])
        
        
        
print(f'\nChecking for algorithms used in STREAMLINE...\n{algorithms}')
print(f'\nChecking for abbrev for algorithms used in STREAMLINE...\n{abbrev}')

['Gender', 'Symptoms ', 'Alcohol', 'Hepatitis B Surface Antigen', 'Hepatitis B e Antigen', 'Hepatitis B Core Antibody', 'Hepatitis C Virus Antibody', 'Cirrhosis', 'Endemic Countries', 'Smoking', 'Diabetes', 'Obesity', 'Hemochromatosis', 'Arterial Hypertension', 'Chronic Renal Insufficiency', 'Human Immunodeficiency Virus', 'Nonalcoholic Steatohepatitis', 'Esophageal Varices', 'Splenomegaly', 'Portal Hypertension', 'Portal Vein Thrombosis', 'Liver Metastasis', 'Radiological Hallmark', 'Age at diagnosis', 'Grams of Alcohol per day', 'Packs of cigarets per year', 'Performance Status*', 'Encephalopathy degree*', 'Ascites degree*', 'International Normalised Ratio*', 'Alpha-Fetoprotein (ng/mL)', 'Haemoglobin (g/dL)', 'Mean Corpuscular Volume', 'Leukocytes(G/L)', 'Platelets', 'Albumin (mg/dL)', 'Total Bilirubin(mg/dL)', 'Alanine transaminase (U/L)', 'Aspartate transaminase (U/L)', 'Gamma glutamyl transferase (U/L)', 'Alkaline phosphatase (U/L)', 'Total Proteins (g/dL)', 'Creatinine (mg/dL)', 

# dataPrep(): Loading Target CV Training & Testing Sets

In [5]:
def dataPrep(train_file_path,instance_label,class_label, test_file_path):
    
    '''Loads target cv training dataset, separates class from features and removes instance labels'''
    
    
    train = pd.read_csv(train_file_path)
    if instance_label != 'None':
        train = train.drop(instance_label,axis=1)
        
    # get feature names from train dataset
    trainFeat = list(train.drop(class_label, axis=1).columns)  #note: datatype --> list
    set(itertools.chain(*trainFeat))
    
    trainX = pd.DataFrame(train.drop(class_label, axis=1).values)
    trainY = pd.DataFrame(train[class_label].values)
    del train #memory cleanup
    
   
    test = pd.read_csv(test_file_path)
    if instance_label != 'None':
        test = test.drop(instance_label,axis=1)
        
    # get feature names from test dataset
    testFeat = list(test.drop(class_label, axis=1).columns)  
    set(itertools.chain(*testFeat))

    testX = pd.DataFrame(test.drop(class_label, axis=1).values)
    testY = pd.DataFrame(test[class_label].values)
    del test #memory cleanup
    
    
    return trainX, trainY, testX, testY, trainFeat, testFeat




# SHAP: get_explainer() 

# (NOTE: Have not included KNN, SVM, ANN yet)
* if algorithm name matches ['list model names'], create explainers
* return explainer based on given model from parameter



# Types of SHAP Explainers 

**Source: https://shap-lrjball.readthedocs.io/en/latest/api.html#core-explainers**

**.Explainer()**
* Uses Shapley values to explain any machine learning model or python function.
* This is the primary explainer interface for the SHAP library
* It takes any combinationof a model and masker and returns a callable subclass object that implements the particular estimation algorithm that was chosen.

**.TreeExplainer()**
* Uses Tree SHAP algorithms to explain the output of ensemble tree models.
* Tree SHAP is a fast and exact method to estimate SHAP values for tree models and ensembles of trees, under several different possible assumptions about feature dependence. 
* It depends on fast C++implementations either inside an externel model package or in the local compiled C extention.

**.LinearExplainer()**
* Computes SHAP values for a linear model, optionally accounting for inter-feature correlations.
* This computes the SHAP values for a linear model and can account for the correlations among the input features.
* Assuming features are independent leads to interventional SHAP values which for a linear model are coef[i] * (x[i] - X.mean(0)[i]) for the ith feature. 
* If instead we accountfor correlations then we prevent any problems arising from colinearity and share credit among correlated features.
* Accounting for correlations can be computationally challenging, but LinearExplainer uses sampling to estimate a transform that can then be applied to explain any prediction of the model.

**.KernelExplainer()**
* Uses the Kernel SHAP method to explain the output of any function.
* Kernel SHAP is a method that uses a special weighted linear regression to compute the importance of each feature. 
* The computed importance values are Shapley values from game theory and also coefficents from a local linear regression.

In [6]:
def get_explainer(model, abbrev, trainX):
    
    '''Pass loaded model and abbrev to match appropriate SHAP explainer'''
    
    '''Must always use training dataset as background data in order to 
        evaluate SHAP values for either testing (usually)or training set'''
    
    explainer = None
    trained_model = model
    

    if abbrev in ["NB"]: 
        explainer = shap.Explainer(trained_model.predict, trainX)
        

        # dont use model.predict for Linear Explainer (only for Explainer using Naive Bayes) 
        # ^^^ You get a class method error when creating shap plots and values
    if abbrev in ["LR"]:
        explainer = shap.LinearExplainer(trained_model, trainX)
            
    # not tested yet with LGB, CGB
    if abbrev in ['DT', 'RF', "XGB", "LGB","CGB"]:
        explainer = shap.TreeExplainer(trained_model)
        
    # not tested yet with KNN, SVM, ANN
    if abbrev in ['KNN','SVM','ANN']:
#         explainer = shap.KernelExplainer(trained_model.predict, trainX)
        explainer = shap.KernelExplainer(trained_model.predict_proba, trainX) # SVM seen with this
            
            

    return explainer




# SHAP: compute_shapValues()

# (NOTE: Have not included KNN, SVM, ANN yet)


**NOTES**
* Parameter 'X' in this context refers to whatever training or testing dataset that was passed in from the whole run from below
* Mentioned earlier, default run uses training dataset as background data and creates shap values using testing data
* The same follows for feature_names --> either train_feat or test_feat (default) will be passed

In [7]:
def compute_shapValues(model, abbrev, explainer, X):
    
    '''This method will calculate shapley values and store these as a Pandas DataFrame for conversion to csv file
       This includes creating expected_values and shap_values --> returns shap_values (will be called by shap_summary)
    '''
  
    max_evals = max(500, (2 * len(X)) + 1)   # optional: declares number of permutations for shap.Explainer()
    shap_values = None

    if abbrev in ['NB']:
        shap_values= explainer(X)  # permutation object cannot use .expected_value function like LR
        print(shap_values)
            
            
    # not tested yet for KNN, SVM, ANN
    # SVM and KNN can use explainer.expected_value[0] when creating force plots
    if abbrev in ['LR', 'KNN', 'SVM', 'ANN']:
        shap_values = explainer.shap_values(X)
        print(shap_values)
            
            
    # i think shap_values() only works for TreeExplainer and LinearExplainer...Explainer for NB is considered a
    #       permutation object 
    # not tested yet for LGB, CGB
    if abbrev in ['DT','RF', "XGB", "LGB","CGB"]:
        shap_values = explainer.shap_values(X, approximate=False, check_additivity=False)
        print(shap_values)
            
   
        
    return shap_values 
    

**Able to confirm that shap_values are calculated in order based on data features passed in**

**shap_value feature importance method just orders features & values based on importance but feature:value remains the same**

# SHAP: shap_summary()


# (NOTE: Have not included KNN, SVM, ANN yet)



# Plot Types for SHAP v0.41.0

**Waterfall** 
* Plots an explantion of a single prediction as a waterfall plot

**Summary (type: violin & bar)**
* Summary plots of SHAP values across a whole dataset

**Dependence**
* Plots the value of the feature on the x-axis and the SHAP value of the same feature on the y-axis
* This shows how the model depends on the given feature, and is like a richer extenstion of the classical parital dependence plots. 
* Vertical dispersion of the data points represents interaction effects.
* Grey ticks along the y-axis are data points where the feature’s value was NaN.

**Force**
* Visualize cumulative SHAP values with an additive force layout.

**Beeswarm**
* Summary plots of SHAP values across a whole dataset
* Designed to display an information-dense summary of how the top features in a dataset impact the model’s output.

In [8]:
def shap_summary(abbrev, feature_names, shap_values, explainer, X, cvCount, save_path, dataset):
    '''Retrieve shap_values from previous method;
            this method will return and display different types of shap plots
            
        Figures for each model CV is saved as a png which will be merged to a 
            final summary report for each model
    '''
    
    # Make folder in shapFigures for all summary plots generated from this function
    # helps to differentiate summary plots from user-chosen force_plots
    # save_path is used as a parameter for saving shap figure for each model
    if not os.path.exists(save_path+'/SummaryPlots'):
        os.mkdir(save_path+'/SummaryPlots')
    
    # generates random number between range 0 and 54 (length of shap__values = 55)
    #         as a random single prediction for force plots
    random_single_predict = random.randint(0, len(shap_values)-1)

    # checks algorithm in given list to execute shap summaries
    if abbrev in ["NB"]:
        
        print(f'Saving Summary Plot for SHAP Values in Class 0 & 1 in {dataset} Set...')
        shap.summary_plot(shap_values, X, feature_names, plot_type='violin', show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_shapSummaryViolinPlot.png', bbox_inches='tight')
        plt.close()
        
        print(f'Saving Summary Plot for SHAP Values in Class 0 & 1 in {dataset} Set...')
        shap.summary_plot(shap_values, X, feature_names, plot_type='bar', show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_shapSummaryBarPlot.png', bbox_inches='tight')
        plt.close()
            # print('SHAP Bar Plot for Summary Plot for SHAP Values in Class 0 & 1 in Test Set:\n')
#         shap.plots._bar.bar_legacy(shap_values, feature_names, show=True) # doesnt work but should for this...attribute error

        print(f'Saving SHAP Beeswarm Plot for Top 5 SHAP Values in Class 0 & 1 in {dataset} Set...')
        shap.plots.beeswarm(shap_values, max_display=5, show=False)  #max_display allows user to choose # of features to display
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_beeswarmPlot.png', bbox_inches='tight')
        plt.close()
        
#         print('Waterfall Plot for SHAP Values in Class 0 in Test Set: \n')
#         shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0][0], testX.iloc[0], testFeat, show=True) # should work for this model



        # scatter, bar, waterfall, beeswarm plots should work for this model 
        # waterfall plot also doesnt work...i get "AttributeError: 'numpy.ndarray' object has no attribute 'base_values'"
        # Bar plot should work for this model if using .Explainer() and shap_values = explainer(data)--> 
        #     not explainer.shap_values
    elif abbrev in ["LR", 'XGB']:
        
        expected_value = explainer.expected_value
        print(f'Expected value for {abbrev}: {expected_value}')
        
        print(f'Saving Summary Plot for SHAP Values in {dataset} Set...') 
        shap.summary_plot(shap_values, X, feature_names, plot_type='violin', show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_shapSummaryViolinPlot.png', bbox_inches='tight')
        plt.close()
        
        print(f'Saving SHAP Bar Plot for SHAP Values {dataset} Set...') 
        shap.summary_plot(shap_values, X, feature_names, plot_type="bar", show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_shapSummaryBarPlot.png', bbox_inches='tight')
        plt.close()

        
        print(f'Saving SHAP Decision Plot for SHAP Values in {dataset} Set...')  
        shap.decision_plot(expected_value, shap_values, feature_names, show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_shapDecisionPlot.png', bbox_inches='tight')
        plt.close()


        print(f'Saving SHAP Decision Plot for Single-Prediction in {dataset} Set...')
        shap.decision_plot(expected_value, shap_values[54], feature_names, show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_shapDecisionPlot_singlePredict.png', bbox_inches='tight')
        plt.close()
        
        
        print(f'Saving Waterfall Plot for SHAP Values for a Single-Prediction in {dataset} Set...')
        shap.plots._waterfall.waterfall_legacy(explainer.expected_value, shap_values[random_single_predict], testX.iloc[random_single_predict], feature_names, show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_WaterfallPlot_singlePredict.png', bbox_inches='tight')
        plt.close()
         
            
#         shap.plots._bar.bar_legacy(expected_value, shap_values[10], testX[10], feature_names, show=True)


            # waterfall plot works for DT() if it uses .Explainer() and shap_vales = explainer(data)
            # instead of using TreeExplainer but other plots listed here work 
    elif abbrev in ['DT', 'RF', 'LGB','CGB']:
        expected_value = explainer.expected_value
        print(f'Expected value for {abbrev}: {expected_value}')

        print(f'Saving Bar Summary Plot for SHAP Values in Class 0 & 1 in {dataset} Set...')
            #       #tree.tree_plot(testX)  ---> helps display Decision Tree
        shap.summary_plot(shap_values, X, feature_names, plot_type='bar', class_names=['0', '1'], show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_shapSummaryBarPlot.png', bbox_inches='tight')
        plt.close()

        print(f'Saving Decision Plot for SHAP Values from Class 0 in {dataset} Set...')
        shap.decision_plot(expected_value[0], shap_values[0], feature_names=feature_names, show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_DecisionPlot_Class0.png', bbox_inches='tight')
        plt.close()

        print(f'Saving Decision Plot for SHAP Values from Class 1 in {dataset} Set...')
        shap.decision_plot(expected_value[1], shap_values[1], feature_names=feature_names, show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_DecisionPlot_Class1.png', bbox_inches='tight')
        plt.close()
        
        print(f'Saving Waterfall Plot for SHAP Values from Class 0 in {dataset} Set...')
        shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0][random_single_predict], testX[random_single_predict], feature_names, show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_WaterfallPlot_Class0_singlePredict.png', bbox_inches='tight')
        plt.close()
        
        print(f'Saving Waterfall Plot for SHAP Values from Class 1 in {dataset} Set...')
        shap.plots._waterfall.waterfall_legacy(explainer.expected_value[1], shap_values[1][random_single_predict], testX[random_single_predict], feature_names, show=False)
        plt.savefig(f'{save_path}/SummaryPlots/{abbrev}_{str(cvCount)}_WaterfallPlot_Class1_singlePredict.png', bbox_inches='tight')
        plt.close()
        
    

# SHAP: run_force_plots()
# (NOTE: Have not included KNN, SVM, ANN yet)


**OPTIONAL**
* Only runs when user sets 'run_force=True' (default is run_force = False)


In [9]:
def run_force_plots(abbrev, explainer, shap_values, X, feature_names, cvCount, save_path, dataset):
    '''This method is optional but will save force plots for each given trained model per CV
        
        By default, run_force_plots=False but can be run if user sets run_force_plots=True in 
        the 'Run' cell at the bottom of the notebook'''
    
    # Make folder in shapFigures called ForcePlots to organize user-chosen figures
    # save_path is used as a parameter for saving shap figures for each model
    
    if not os.path.exists(save_path+'/ForcePlots'):
        os.mkdir(save_path+'/ForcePlots')
        
    # generates random number between range 0 and 54 (length of shap__values = 55)
    #         as a random single prediction for force plots
    random_single_predict = random.randint(0, len(shap_values)-1)
    
    
    if abbrev in ['NB']:
        print(f'Saving Force Plot for {abbrev} SHAP Values in {dataset} Set...\n')
        shap.force_plot(shap_values[random_single_predict], X, feature_names=feature_names, matplotlib=True, show=False)
        plt.title(f'{abbrev}{cvCount} Force Plot for Single Prediction Classification 0 or 1')
        plt.savefig(f'{save_path}/ForcePlots/{abbrev}{str(cvCount)}_singlePredictFP.png', dpi=600, bbox_inches='tight')
        plt.close()
        
        
    # can only return plot when running code, png can be saved but plots don't show up
    # matplotlib doesn't support output for multiple samples just like DT and RF
    elif abbrev in ['LR', 'XGB']: # does not allow .savefig save force plot for multiple samples
        print(f'\nDisplaying Force Plot for {abbrev} SHAP Values in  Whole {dataset} Set...')
        return shap.force_plot(explainer.expected_value, shap_values, feature_names, show=True)
#         plt.savefig(save_path+abbrev+'_'+str(cvCount)+'ForcePlot.png', bbox_inches='tight')
#         plt.close()
        
    # Decision Tree has multiclass output so needed to create two separate function calls 
    # Decision Tree doesn't work when just using shap_values as a parameter
    # can only return plot when running code, png can be saved but plots don't show up
    # matplotlib doesn't support output for multiple samples just like LR and XGB
    else:
        print(f'\nDisplaying Force Plot for {abbrev} SHAP Values from Class 0 in {dataset} Set...')
        return shap.force_plot(explainer.expected_value[0], shap_values[0], feature_names,show=True)
#         plt.savefig(f'{save_path}/ForcePlots/{abbrev}{str(cvCount)}_singlePredictFP.png', dpi=600, bbox_inches='tight')
#         plt.close()
                
        print(f'\nDisplaying Force Plot for {abbrev} SHAP Values from Class 1 in {dataset} Set...')
        return shap.force_plot(explainer.expected_value[1], shap_values[1], feature_names, show=True)
#         plt.savefig(save_path+abbrev+'_'+str(cvCount)+'singlePredictFP.png', bbox_inches='tight')
#         plt.close()
        

# SHAP: shap_feature_ranking()
# (NOTE: Have not included KNN, SVM, ANN yet)


**Calculating feature importance for each trained model per CV**

In [10]:
def shap_feature_ranking(abbrev, shap_values, X, feature_names):  # 'X' and 'feature_names' argument is whichever test or train set is passed
        
    '''Calculate the average of the absolute SHAP values for each feature and use it to show 
       which features were the most important when making a prediction'''
    
    
    if abbrev in ['NB']:
        feature_order = np.argsort(np.mean(np.abs(shap_values.values), axis=0))
        shap_means = (np.mean(np.abs(shap_values.values), axis=0))
        df = pd.DataFrame({"Features": [feature_names[i] for i in feature_order][::-1],"Importance": [ np.mean(np.abs(shap_values.values), axis=0)[i] for i in feature_order][::-1]})
        
    #LR cant use shap_values.values
    elif abbrev in ['LR','LGB', 'XGB', 'CGB']: 
        feature_order = np.argsort(np.mean(np.abs(shap_values), axis=0))
        shap_means = (np.mean(np.abs(shap_values), axis=0))
        df = pd.DataFrame({"Features": [feature_names[i] for i in feature_order][::-1],"Importance": [ np.mean(np.abs(shap_values), axis=0)[i] for i in feature_order][::-1]})
    
        
    else: # For multiclass models (can be used for NB)..Loops through Class 0 and Class 1
          # Sums up the shap average values form both classes to get the shap average for the whole CV for the model
          # The solution for the 'else' statement was found on StackOverflow 
        
        c_idxs = []
        columns = feature_names
        for column in range(0, (len(columns))): 
            if isinstance(shap_values, list): 
                c_idxs.append(X.columns.get_loc(column))
                means = [np.abs(shap_values[class_][:, c_idxs]).mean(axis=0) for class_ in range(len(shap_values))] 
                shap_means = np.sum(np.column_stack(means), 1)
            else:                               # Else there is only one 2D array of shap values
                assert len(shap_values.shape) == 2, 'Expected two-dimensional shap values array.'
                shap_means = np.abs(shap_values).mean(axis=0)
        df = pd.DataFrame({'Features': feature_names, 'Importance': shap_means}).sort_values(by='Importance', ascending=False).reset_index(drop=True)
        df.index += 1
        
    return df, shap_means

# SHAP: save_shap()
# (NOTE: Have not included KNN, SVM, ANN yet)


**Function that saves shap values using shap_means taken from shap_feature_ranking() and saves all values from all CVs into one master list per trained model**

In [11]:
def save_shap(abbrev, shap_values, shap_means, original_headers, cvCount, dataset, each): # 'df' parameter is the dataframe returned from shap_feature_ranking()
    '''Create a new dataframe that stores the model's absolute mean SHAP feature importance values over each CV
            and combines with features from original dataset later on
            
        temp_list[] is returned as an array so that each CV iterated here can be stored into FI_all in the Run 
            cell of the Notebook''' 
    
    temp_list = []  
    shap_vals = []
    
    headers = pd.read_csv(f'{experiment_path}/{each}/CVDatasets/{each}_CV_{cvCount}_{dataset}.csv').columns.values.tolist()

    if instance_label != 'None':
        headers.remove(instance_label)
    headers.remove(class_label)

    
    shap_vals = np.array(shap_means)
    
    for name in original_headers:
        if name in headers:
            index = headers.index(name)
            print(f'Checking for matches...{name} is {name}')
            print(f'{name} value is {shap_vals[index]}\n')
            temp_list.append(shap_vals[index])
        else:
            temp_list.append(0.0)


    
    return np.asarray(temp_list)
        
  

# Unit Testing Cell

In [12]:
# # # # # ^^^^^
# dr = pd.DataFrame(columns=original_headers)
# dr.head()
# dr2 = pd.DataFrame()
# dr2.head()
# if not os.path.exists(full_path+'/model_evaluation/shap_values/testResults/shapFigures'):
#         os.mkdir(full_path+'/model_evaluation/shap_values/testResults/shapFigures')
# save_path = full_path + '/model_evaluation/shap_values/testResults/shapFigures'


# FI_all = []
# shap_vals = []
# temp_df = pd.DataFrame()
    
    

# for cvCount in range(0, cv_partitions):
    
    
# result_file = experiment_path+ '/hcc-data_example/models/pickledModels/NB_0.pickle'
# file = open(result_file, 'rb')
# model = pickle.load(file)
# file.close()
# print('\nChecking if correct model is loaded...\n', model)

# train_path = experiment_path +  '/hcc-data_example/CVDatasets/hcc-data_example_CV_0_Train.csv'
# test_path = experiment_path +  '/hcc-data_example/CVDatasets/hcc-data_example_CV_0_Test.csv'
# trainX, trainY,testX, testY, trainFeat, testFeat = dataPrep(train_path,instance_label,class_label, test_path)


# explainer = get_explainer(model, 'NB', trainX)
    
# shap_values = compute_shapValues(model, 'NB', explainer, testX)
#     shap_fi_df, shap_means = shap_feature_ranking('LR', shap_values, testX, testFeat)
# #     display(shap_fi_df)
#     shapFI_path = (f"{filepath}{abbrev[algorithm]}_{str(cvCount)}_shapFIValues_Test.csv")
#     shap_fi_df.to_csv(shapFI_path, header=True, index=True)
    
    
#     headers = pd.read_csv(f'{experiment_path}/hcc-data_example/CVDatasets/hcc-data_example_CV_{str(cvCount)}_Test.csv').columns.values.tolist()
#     if instance_label != 'None':
#         headers.remove(instance_label)
#     headers.remove(class_label)
    
#     temp_list = []  

#     vals = np.mean(np.abs(shap_values), axis=0)
#     shap_vals = np.array(vals)
#     for name in original_headers:
#         if name in headers:
#             index = headers.index(name)
# #             print(f'Checking for matches...{name} is {name}')
# #     #                    print(f'Index is {index}')
# #             print(f'Shap value is {shap_vals[index]}\n')
#         #                print(f'{name} value is {shap_vals[index]}')
#             temp_list.append(shap_vals[index])

#         else:                
#             temp_list.append(0.0)
#     FI_all.append(temp_list)
    
#     del temp_list

# print(np.asarray(FI_all))
# df = pd.DataFrame(np.asarray(FI_all), columns=[original_headers])
# display(df)
#     
# #     print(np.mean(np.abs(shap_values.values)))
# shap_summary('NB', testFeat,shap_values, explainer, testX, 0, save_path, 'Test')
#     run_force_plots('DT', explainer, shap_values, testX, testFeat, cvCount, save_path)


# Run SHAP for Testing Datasets

**Loop through each hcc_demo dataset to unpickle and load trained models to create Shapley values and plots**
**Default run**
* The default setting runs explainer and shap values for the TESTING datasets for each model and CV
* User has the option below to run the loop for training sets as well 

In [13]:
# testing all methods
run_force = True # parameter in run_force_plot(); set to True if user wants to display force plots for trained models
run_test = True


if run_test == True:
    for each in datasets:
        print("---------------------------------------")
        print(each)
        print("---------------------------------------")
        full_path = experiment_path+'/' + each
        filepath = f"/{full_path}/model_evaluation/shap_values/testResults/" #path to save SHAP FI value results
        
       
        
        #Make folder in experiment folder/datafolder to store all shap_values per algorithm/CV combination
        if not os.path.exists(full_path+'/model_evaluation/shap_values/testResults'):
            os.mkdir(full_path+'/model_evaluation/shap_values/testResults')

        if not os.path.exists(full_path+'/model_evaluation/shap_values/testResults/shapFigures'):
            os.mkdir(full_path+'/model_evaluation/shap_values/testResults/shapFigures')
        save_path = full_path + '/model_evaluation/shap_values/testResults/shapFigures'

        for algorithm in algorithms: #loop through algorithms
            print("---------------------------------------")
            print(abbrev[algorithm])
            print("---------------------------------------")
            
            if not os.path.exists(f'{experiment_path}/hcc-data_example/model_evaluation/shap_values/testResults/'):
                os.mkdir(f'{experiment_path}/hcc-data_example/model_evaluation/shap_values/testResults/')
            file_path = (f'{experiment_path}/hcc-data_example/model_evaluation/shap_values/testResults/{abbrev[algorithm]}_shapMasterList.csv')
     
            FI_all = []  # list to store feature importanes to create shap values master list
        
            for cvCount in range(0,cv_partitions): #loop through cv's
                print(f"{abbrev[algorithm]}{cvCount} In CV{cvCount}...")

                # unpickle and load model
                result_file = f"/{full_path}/models/pickledModels/{abbrev[algorithm]}_{str(cvCount)}.pickle"
                file = open(result_file, 'rb')
                model = pickle.load(file)
                file.close()
                print(f'\nChecking if correct model is loaded...\n{model}')


                # Load CV datasets, paths to datasets updates with each iteration
                train_path = f"/{experiment_path}/{each}/CVDatasets/{each}_CV_{str(cvCount)}_Train.csv"
                test_path =f"/{experiment_path}/{each}/CVDatasets/{each}_CV_{str(cvCount)}_Test.csv"
                trainX, trainY,testX, testY, trainFeat, testFeat = dataPrep(train_path,instance_label,class_label, test_path)
#                 print(trainX)

                # shap computation and plots
                # Sanity check: print explainer to check if explainer exists
                explainer = get_explainer(model, abbrev[algorithm], trainX)  #explainer must always use training set
                print(f"\nChecking explainer for {abbrev[algorithm]}{cvCount}...\n{explainer}\n")  

                print(f"Checking shap values for {abbrev[algorithm]}{cvCount}...\n")               
                shap_values = compute_shapValues(model, abbrev[algorithm], explainer, testX)

                print(f"\nGenerating SHAP plots for {abbrev[algorithm]}{cvCount}...\n")
                shap_summary(abbrev[algorithm], testFeat, shap_values, explainer, testX, cvCount, save_path, 'Test')
                
                #save SHAP FI results for each model per CV 
                print('Saving feature importance ranking for {}{}...\n'.format(abbrev[algorithm], cvCount))
                shap_fi_df, shap_means = shap_feature_ranking(abbrev[algorithm], shap_values, testX, testFeat) # can either choose to pass testX or trainX
                print(f'shap means for {abbrev[algorithm]} in CV{cvCount}...\n{shap_means}\n')
                shapFI_path = (f"{filepath}{abbrev[algorithm]}_{str(cvCount)}_shapFIValues_Test.csv")
                shap_fi_df.to_csv(shapFI_path, header=True, index=True)
                
                
                # OPTIONAL: set to 'TRUE' to save force plot figures
                if run_force == True:
                    run_force_plots(abbrev[algorithm], explainer, shap_values, testX, testFeat, cvCount, save_path, 'Test')
                else:
                    continue
                
                
                # create masterList of mean SHAP Values taken from shap_feature_ranking() for each model
                save = save_shap(abbrev[algorithm],shap_values, shap_means, original_headers, cvCount, 'Test', each)
                FI_all.append(save)
                print("------------------------------------------------------------------------------------------")
            temp = FI_all
            del FI_all  # free up space
            
            # create master list for each model after looping through all CVs
            df = pd.DataFrame(np.asarray(temp), columns=original_headers)
            display(df)
            
            path = (f'{filepath}{abbrev[algorithm]}_shapMasterList.csv') 
            df.to_csv(path, index=False)
           



---------------------------------------
hcc-data_example
---------------------------------------
---------------------------------------
NB
---------------------------------------
NB0 In CV0...

Checking if correct model is loaded...
GaussianNB()

Checking explainer for NB0...
shap.explainers.Permutation()

Checking shap values for NB0...

.values =
array([[ 5.83333333e-03, -2.75000000e-02, -1.83333333e-02, ...,
        -8.33333333e-04,  8.33333333e-03, -2.08333333e-02],
       [ 0.00000000e+00,  1.33333333e-02,  4.41666667e-02, ...,
         0.00000000e+00,  2.91666667e-02, -2.91666667e-02],
       [ 7.50000000e-03,  4.83333333e-02,  2.83333333e-02, ...,
         2.16666667e-02,  2.41666667e-02,  1.66666667e-03],
       ...,
       [ 3.75856753e-18, -5.00000000e-03, -1.50000000e-02, ...,
         0.00000000e+00,  3.33333333e-03, -3.33333333e-02],
       [-5.00000000e-03, -5.83333333e-03, -1.16666667e-02, ...,
         0.00000000e+00, -2.08333333e-02, -2.33333333e-02],
       [ 1.00000

Saving Summary Plot for SHAP Values in Class 0 & 1 in Test Set...
Saving SHAP Beeswarm Plot for Top 5 SHAP Values in Class 0 & 1 in Test Set...
Saving feature importance ranking for NB1...

shap means for NB in CV1...
[0.02048485 0.016      0.04098485 0.00659091 0.046      0.02551515
 0.0509697  0.00945455 0.         0.0035303  0.00990909 0.02204545
 0.0965303  0.00322727 0.01815152 0.00554545 0.02406061 0.03557576
 0.00498485 0.01537879 0.00242424 0.00760606 0.04104545 0.02640909
 0.00018182 0.02681818 0.01293939 0.01506061 0.06836364 0.00390909
 0.01209091 0.0205     0.01525758 0.06086364 0.02219697 0.0410303
 0.02730303 0.00586364 0.06130303 0.029      0.01259091]

Saving Force Plot for NB SHAP Values in Test Set...

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.061303030303030304

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.006590909090909089

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface 

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.022424,0.0,0.013182,0.0,0.0,0.012212,0.008576,0.017561,0.002061,...,0.0115,0.012348,0.0,0.073515,0.004121,0.022076,0.034652,0.021076,0.014652,0.009515
1,0.0,0.061303,0.006591,0.015379,0.002424,0.004985,0.007606,0.0,0.09653,0.005864,...,0.005545,0.046,0.012591,0.00353,0.003909,0.012939,0.022045,0.026409,0.0205,0.018152
2,0.0,0.00103,0.000742,0.0,0.008818,0.004227,0.000879,0.0,0.000106,0.0,...,0.000879,0.002015,0.0,0.00047,0.001833,0.001227,0.055,0.001167,0.00053,0.017258


---------------------------------------
LR
---------------------------------------
LR0 In CV0...

Checking if correct model is loaded...
LogisticRegression(C=0.006606805070193189, dual=True,
                   max_iter=193.8544995971634, random_state=42,
                   solver='liblinear')

Checking explainer for LR0...
<shap.explainers._linear.Linear object at 0x7fcc18d93610>

Checking shap values for LR0...

[[-1.54247265e-04 -5.20008470e-02 -4.40485958e-02 ... -1.47969848e-03
   2.31157863e-02 -2.02239904e-02]
 [ 6.85778818e-04  1.94894821e-02  2.13041189e-01 ...  1.21066239e-03
   2.31157863e-02 -1.80842415e-02]
 [ 1.51216786e-04 -1.22839969e-02  9.42115197e-04 ...  1.21066239e-03
   2.31157863e-02 -2.23637392e-02]
 ...
 [ 1.44943883e-03 -2.81707337e-02 -5.11185666e-02 ...  1.21066239e-03
   2.31157863e-02 -1.73709948e-02]
 [ 2.02218390e-03 -1.22839969e-02 -3.82640794e-02 ...  1.21066239e-03
  -6.24982370e-02 -1.23782505e-02]
 [-3.83345247e-04  2.34611703e-02 -3.11941086e-02 ...

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.03886876657626088

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.01902674623276478

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.026838415026258885

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.009670065204705305

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.015877099981977692

Checking for matches...Smoking is Smoking
Smoking value is 0.0021584128121270303

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.04411193743084139

Checking for matches...Obesity is Obesity
Obesity value is 0.018688002101128375

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.2269064171534283

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.004920516143671562

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.011006526845122698

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0114805528046631

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.029008490739009542

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.13612007680249777

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.019555993342348636

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.12170538451100983

Checking for matches...Smoking is Smoking
Smoking value is 0.10977748998232592

Checking for matches...

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.0047654411054461375

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0022148213857691858

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0009570139678777011

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0038090825853725127

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0019152217497143459

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0003307068230837304

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.005278817754709944

Checking for matches...Obesity is Obesity
Obesity value is 0.003759106278018618

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.003919150479863269

Checking for matches...Chronic 

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.038869,0.0,0.019027,0.0,0.0,0.026838,0.00967,0.015877,0.002158,...,0.027842,0.038658,0.0,0.036096,0.002635,0.04261,0.017843,0.005802,0.002473,0.035923
1,0.0,0.226906,0.004921,0.011007,0.011481,0.029008,0.13612,0.019556,0.121705,0.109777,...,0.054443,0.221572,0.025841,0.047948,0.017841,0.102912,0.077176,0.155098,0.009104,0.142575
2,0.0,0.004765,0.002215,0.0,0.000957,0.003809,0.001915,0.0,0.000331,0.0,...,0.003771,0.00777,0.0,0.000991,0.00569,0.005648,0.004723,0.005371,0.000419,0.007539


---------------------------------------
DT
---------------------------------------
DT0 In CV0...

Checking if correct model is loaded...
DecisionTreeClassifier(max_depth=17, min_samples_leaf=35, min_samples_split=45,
                       random_state=42)

Checking explainer for DT0...
<shap.explainers._tree.Tree object at 0x7fcbeb2cd640>

Checking shap values for DT0...

[array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])]

Generating SHAP plots for DT0...

Expected value for DT: [0.57272727 0.42727273]
Saving Bar Summary Plot for SHAP Values in Class 0 & 1 in

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.0

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Smoking is Smoking
Smoking value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0

Checking for matches...Chr

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.12766554142993597

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.13587538667643773

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Smoking is Smoking
Smoking value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checki

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.0

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0

Checking for matches...Chronic Renal Insufficiency is Chronic Renal Insufficiency
Chronic Renal Insufficiency value is 0.0

Checking for matches...Human Immunodeficiency Virus is Human Immun

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.127666,0.135875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.046579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.455851,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---------------------------------------
RF
---------------------------------------
RF0 In CV0...

Checking if correct model is loaded...
RandomForestClassifier(criterion='entropy', max_depth=1, max_features=None,
                       min_samples_leaf=17, min_samples_split=41,
                       n_estimators=960, random_state=42)

Checking explainer for RF0...
<shap.explainers._tree.Tree object at 0x7fcc18d62340>

Checking shap values for RF0...

[array([[ 0.        ,  0.04181896,  0.00365806, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        , -0.01898534, -0.01999966, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        , -0.01898534, -0.01840236, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.04012049,  0.02614831, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        , -0.01898534, -0.00631506, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        , -0.0189853

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.0

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Smoking is Smoking
Smoking value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0

Checking for matches...Chr

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.019735316941878446

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0007752169838459056

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0008855053507077019

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0007639992732988261

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Smoking is Smoking
Smoking value is 0.0015082743786287906

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0012367462089529

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.0

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0

Checking for matches...Chronic Renal Insufficiency is Chronic Renal Insufficiency
Chronic Renal Insufficiency value is 0.0

Checking for matches...Human Immunodeficiency Virus is Human Immun

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.003386,0.033483,0.0,0.003621,0.0,0.00241,0.002852,0.022969,0.0,0.005711
1,0.0,0.019735,0.000775,0.0,0.0,0.000886,0.000764,0.0,0.0,0.001508,...,0.005091,0.034861,0.003384,0.005623,0.001991,0.010869,0.040724,0.051997,0.04454,0.019971
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.002236,0.34348,0.0,0.0,0.039206,0.040151,0.0,0.043435,0.0,0.119536


---------------------------------------
XGB
---------------------------------------
XGB0 In CV0...

Checking if correct model is loaded...
XGBClassifier(alpha=0.0002575842389979265, base_score=0.5, booster='gbtree',
              callbacks=None, colsample_bylevel=1, colsample_bynode=1,
              colsample_bytree=0.9181376162919086, early_stopping_rounds=None,
              enable_categorical=False, eta=5.623331491160975e-07,
              eval_metric=None, gamma=0.0002786718840103683, gpu_id=-1,
              grow_policy='lossguide', importance_type=None,
              interaction_constraints='', learning_rate=5.62333128e-07,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=27,
              max_leaves=0, min_child_weight=0.20525460238584922,
              min_samples_leaf=27, min_samples_split=37, missing=nan,
              monotone_constraints='()', n_estimators=164, n_jobs=1, nthread=1, ...)

Checking explainer for XGB0...
<shap.explainers._tree.Tree o

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 5.827763516208506e-07

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 2.4962539555417607e-06

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 4.559925059766101e-07

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 1.2952527868037578e-06

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 1.5119959471121547e-06

Checking for matches...Smoking is Smoking
Smoking value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 5.943172709521605e-07

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 6.765080939885593e-08

Che

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.3974006175994873

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.07787172496318817

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.008401886560022831

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.020111560821533203

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.009383011609315872

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.04800781235098839

Checking for matches...Smoking is Smoking
Smoking value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0041534

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 5.428106760518858e-06

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 1.561762110213749e-05

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0

Checking for matches...Chronic Renal Insufficiency is Chronic Renal Insufficiency
Chronic Renal Insufficiency value is 0.0

Checking for matches...Human I

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,5.827764e-07,0.0,2e-06,0.0,0.0,4.559925e-07,1e-06,2e-06,0.0,...,3e-06,1.4e-05,0.0,4e-06,9.484509e-07,4e-06,4e-06,1.1e-05,1e-06,8e-06
1,0.0,0.3974006,0.077872,0.0,0.0,0.008402,0.02011156,0.009383,0.048008,0.0,...,0.062831,0.477733,0.139165,0.093624,0.04733162,0.176267,0.401113,0.286994,0.764464,0.362923
2,0.0,5.428107e-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000292,0.002665,0.0,2.6e-05,3.972282e-05,0.000225,1.4e-05,0.000544,8e-05,0.000649


---------------------------------------
hcc-data_example_no_covariates
---------------------------------------
---------------------------------------
NB
---------------------------------------
NB0 In CV0...

Checking if correct model is loaded...
GaussianNB()

Checking explainer for NB0...
shap.explainers.Permutation()

Checking shap values for NB0...

.values =
array([[ 0.005     , -0.0225    ,  0.        , ...,  0.00416667,
        -0.0275    , -0.02916667],
       [ 0.        ,  0.02333333,  0.        , ...,  0.01416667,
        -0.0325    , -0.0275    ],
       [ 0.01833333,  0.02583333,  0.00083333, ...,  0.06833333,
        -0.0125    , -0.02416667],
       ...,
       [ 0.00333333, -0.01083333,  0.        , ...,  0.00583333,
        -0.02666667, -0.03083333],
       [ 0.00166667, -0.00416667,  0.        , ..., -0.01666667,
        -0.03083333, -0.03      ],
       [ 0.0075    ,  0.02      ,  0.        , ..., -0.035     ,
         0.40833333, -0.02      ]])

.base_values =
array

Saving Summary Plot for SHAP Values in Class 0 & 1 in Test Set...
Saving SHAP Beeswarm Plot for Top 5 SHAP Values in Class 0 & 1 in Test Set...
Saving feature importance ranking for NB1...

shap means for NB in CV1...
[0.01822727 0.04460606 0.04827273 0.02969697 0.00584848 0.05077273
 0.00806061 0.         0.00319697 0.00728788 0.02419697 0.02656061
 0.09439394 0.0145303  0.00384848 0.03477273 0.005      0.00233333
 0.00878788 0.04080303 0.02066667 0.00021212 0.0280303  0.01354545
 0.01330303 0.06634848 0.00268182 0.00822727 0.01986364 0.01792424
 0.05909091 0.01783333 0.04069697 0.02866667 0.06713636 0.02560606]

Saving Force Plot for NB SHAP Values in Test Set...

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.06713636363636363

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0023333333333333327

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.018894,0.000212,0.010621,0.0,0.005682,0.010409,0.006848,0.014076,0.0,...,0.00947,0.011424,0.055121,0.078227,0.003348,0.018682,0.037712,0.016348,0.011364,0.007894
1,0.0,0.067136,0.0,0.0,0.002333,0.0,0.008788,0.0,0.094394,0.0,...,0.003848,0.048273,0.0,0.003197,0.002682,0.013545,0.024197,0.020667,0.019864,0.01453
2,0.0,0.0,0.000364,0.001788,0.010364,0.0,0.001121,0.000712,0.000227,0.000591,...,0.000909,0.002227,0.0,0.000303,0.001455,0.001045,0.059803,0.000939,0.000167,0.014091


---------------------------------------
LR
---------------------------------------
LR0 In CV0...

Checking if correct model is loaded...
LogisticRegression(C=0.0076324520136090606, dual=True,
                   max_iter=383.683139958808, random_state=42,
                   solver='liblinear')

Checking explainer for LR0...
<shap.explainers._linear.Linear object at 0x7fcc187a3f10>

Checking shap values for LR0...

[[-0.00015674 -0.05535253  0.01520368 ...  0.02453187 -0.02176526
  -0.01030959]
 [ 0.00068073  0.02080485  0.01520368 ...  0.02453187 -0.01946545
  -0.01559663]
 [ 0.0001478  -0.01304287  0.01520368 ...  0.02453187 -0.02406506
  -0.01030959]
 ...
 [ 0.00144207 -0.02996674 -0.00683064 ...  0.02453187 -0.01869885
  -0.00804372]
 [ 0.00201307 -0.01304287 -0.00683064 ... -0.0663269  -0.01333263
  -0.00842137]
 [-0.00038514  0.02503582 -0.00683064 ... -0.0663269   0.05412839
  -0.00842137]]

Generating SHAP plots for LR0...

Expected value for LR: -0.032810633342135964
Saving Summ

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.04124988333284681

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.009114194957419365

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.021931800198248246

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.011203964193253292

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.028888854722906872

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.010849640129879672

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.01863871226140451

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.049720725126536205

Checking for matches...Chronic Renal Insufficiency is Chronic Renal Insufficiency
Chronic Renal Insufficiency value is 0.02417620768565065

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.005671989215052704

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0005201223783531668

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.001226881222520814

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 2.3673895500685356e-05

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.002159066761354565

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0013860313033

Checking for matches...Obesity is Obesity
Obesity value is 0.00044272462420167925

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.00015114264072037324

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0011054029444667148

Checking for matches...Nonalcoholic Steatohepatitis is Non

Checking for matches...Alcohol is Alcohol
Alcohol value is 3.611043410527636e-05

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 3.7483544912258247e-05

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 1.6258715545930826e-05

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 2.91571715918726e-05

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 2.1005459891650037e-05

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 3.504321104586795e-06

Checking for matches...Smoking is Smoking
Smoking value is 4.26826080104935e-05

Checking for matches...Diabetes is Diabetes
Diabetes value is 8.574631207165098e-05

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 1.581372030006737e-05

Checking for matches...Arterial Hyperte

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.04125,0.009114,0.021932,0.0,0.011204,0.028889,0.01085,0.018639,0.0,...,0.03054,0.042692,0.01736,0.039875,0.001181,0.048746,0.019318,0.00279,0.00296,0.040764
1,0.0,0.005672,0.0,0.0,0.00052,0.0,0.001227,2.4e-05,0.002159,0.0,...,0.001006,0.005199,0.0,0.000488,0.000676,0.002486,0.002138,0.003656,0.001267,0.002743
2,0.0,0.0,3.6e-05,3.7e-05,1.6e-05,0.0,2.9e-05,2.1e-05,4e-06,4.3e-05,...,6.4e-05,0.00013,0.0,1.6e-05,9.5e-05,9.4e-05,8.4e-05,8.8e-05,4e-06,0.000125


---------------------------------------
DT
---------------------------------------
DT0 In CV0...

Checking if correct model is loaded...
DecisionTreeClassifier(max_depth=17, min_samples_leaf=35, min_samples_split=45,
                       random_state=42)

Checking explainer for DT0...
<shap.explainers._tree.Tree object at 0x7fcbeb369100>

Checking shap values for DT0...

[array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])]

Generating SHAP plots for DT0...

Expected value for DT: [0.57272727 0.42727273]
Saving Bar Summary Plot for SHAP Values in Class 0 & 1 in

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.0

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Chronic Renal Insufficiency is Chronic Renal Insufficiency
Chronic Renal Insufficiency value is 0.0

Checking for matches...Nonalcoholic Steatohepatitis is Nonalcoholic Steatohepatitis
Nonalcoholic Steatohepatitis value is 0.0

Checking for matches.

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0

Checking for matches...Nonalcoholic Steatohepatitis is Nonalcoholic Steatohepatitis
Nonalcoholic Steatohepatitis value is 0.0

Checking for matches...Portal Hypertension is Portal Hypertension
Portal Hypertension v

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Smoking is Smoking
Smoking value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0

Checking for matches...Chronic Renal Insufficiency is Chronic Renal Insufficiency
Chronic Renal 

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.455851,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---------------------------------------
RF
---------------------------------------
RF0 In CV0...

Checking if correct model is loaded...
RandomForestClassifier(max_depth=9, max_features=None, min_samples_leaf=9,
                       min_samples_split=24, n_estimators=935, random_state=42)

Checking explainer for RF0...
<shap.explainers._tree.Tree object at 0x7fcc0e09ddc0>

Checking shap values for RF0...

[array([[-8.04892952e-04,  3.49123556e-02, -7.39028100e-04, ...,
        -2.55865722e-05, -4.80373900e-04, -5.51427280e-03],
       [ 8.98581287e-04, -2.36897659e-02, -7.01881775e-05, ...,
        -4.69087156e-05, -1.53604128e-03, -9.93491258e-03],
       [ 1.09297656e-04, -2.10001659e-02,  2.39694450e-04, ...,
        -4.18242045e-05, -3.61466526e-04, -6.98520090e-03],
       ...,
       [ 3.18715672e-04,  2.90540264e-02,  2.67651465e-04, ...,
        -4.18242045e-05,  7.09998438e-05,  7.60365635e-03],
       [-1.87966055e-03, -1.58333798e-02,  2.67651465e-04, ...,
         3.18309

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.00024380603660597525

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0004848000995061814

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0009227276031035558

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0003894079218432087

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.002951757024125973

Checking for matches...Chronic Renal Insufficiency is Chronic Renal Insufficiency
Chronic Renal Insufficiency value is 0.0001952405412753242

Checking for matches...Nonalcoholic Stea

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.015247176712858846

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.001178286135152906

Checking for matches...Obesity is Obesity
Obesity value is 0.0

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0

Checking for matches...Nonalcoholic Steatohepatitis is Nonalcoholic Steatohepatitis
Nonalcoholic Steatohepatitis value is 0.0

Checking for matches...Portal Hypertension is Portal 

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Smoking is Smoking
Smoking value is 0.0005183068903087402

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0006853094103879076

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0008787778330564218

Checking for matches...Chronic Renal Insuf

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.000244,0.000485,0.0,0.0,0.000923,0.000389,0.0,0.0,0.0,...,0.007045,0.036931,0.013708,0.022242,0.01187,0.024439,0.013847,0.066051,0.008767,0.063013
1,0.0,0.015247,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.004433,0.026121,0.0,0.000519,0.003951,0.010214,0.0296,0.043139,0.027019,0.00622
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000518,...,0.003898,0.315891,0.0,0.002749,0.001432,0.005627,0.000815,0.046382,0.003094,0.048037


---------------------------------------
XGB
---------------------------------------
XGB0 In CV0...

Checking if correct model is loaded...
XGBClassifier(alpha=0.0003085901759707382, base_score=0.5, booster='gbtree',
              callbacks=None, colsample_bylevel=1, colsample_bynode=1,
              colsample_bytree=0.31595586732894876, early_stopping_rounds=None,
              enable_categorical=False, eta=0.0016131413768891527,
              eval_metric=None, gamma=1.086786493948363e-07, gpu_id=-1,
              grow_policy='lossguide', importance_type=None,
              interaction_constraints='', learning_rate=0.00161314139,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=3,
              max_leaves=0, min_child_weight=9.912142174935715,
              min_samples_leaf=32, min_samples_split=43, missing=nan,
              monotone_constraints='()', n_estimators=305, n_jobs=1, nthread=1, ...)

Checking explainer for XGB0...
<shap.explainers._tree.Tree obje

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.0

Checking for matches...Alcohol is Alcohol
Alcohol value is 0.0

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B Core Antibody is Hepatitis B Core Antibody
Hepatitis B Core Antibody value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0

Checking for matches...Chronic Renal Insufficiency is Chronic Renal Insufficiency
Chronic Renal Insufficiency value is 0.0

Checking for matches...Nonalcoholic Steatohepatitis is Nonalcoholic Steatohepatitis
Nonalcoholic Steatohepatitis value is 0.0

Checking for matches.

Checking for matches...Symptoms  is Symptoms 
Symptoms  value is 0.13446880877017975

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.020044488832354546

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.022066660225391388

Checking for matches...Obesity is Obesity
Obesity value is 0.0026706834323704243

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.007300143130123615

Checking for matches...Nonalcoholic Steatohepatitis is Nonalcoholic Steatohepatitis
Nonalcoholic Steatohepatitis value is 0.0

Ch

Checking for matches...Alcohol is Alcohol
Alcohol value is 1.3443283023661934e-05

Checking for matches...Hepatitis B Surface Antigen is Hepatitis B Surface Antigen
Hepatitis B Surface Antigen value is 0.0

Checking for matches...Hepatitis B e Antigen is Hepatitis B e Antigen
Hepatitis B e Antigen value is 0.0

Checking for matches...Hepatitis C Virus Antibody is Hepatitis C Virus Antibody
Hepatitis C Virus Antibody value is 0.0

Checking for matches...Cirrhosis is Cirrhosis
Cirrhosis value is 0.0

Checking for matches...Endemic Countries is Endemic Countries
Endemic Countries value is 0.0

Checking for matches...Smoking is Smoking
Smoking value is 0.0006315736100077629

Checking for matches...Diabetes is Diabetes
Diabetes value is 0.0028041654732078314

Checking for matches...Hemochromatosis is Hemochromatosis
Hemochromatosis value is 0.0

Checking for matches...Arterial Hypertension is Arterial Hypertension
Arterial Hypertension value is 0.0008339668274857104

Checking for matches...

Unnamed: 0,Gender,Symptoms,Alcohol,Hepatitis B Surface Antigen,Hepatitis B e Antigen,Hepatitis B Core Antibody,Hepatitis C Virus Antibody,Cirrhosis,Endemic Countries,Smoking,...,Gamma glutamyl transferase (U/L),Alkaline phosphatase (U/L),Total Proteins (g/dL),Creatinine (mg/dL),Number of Nodules,Major dimension of nodule (cm),Direct Bilirubin (mg/dL),Iron,Oxygen Saturation (%),Ferritin (ng/mL)
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.005661,0.062978,0.001796,0.007855,0.0,0.02483,0.001578,0.037759,0.000377,0.00434
1,0.0,0.134469,0.0,0.0,0.0,0.0,0.020044,0.0,0.0,0.0,...,0.03577,0.198403,0.0,0.030768,0.03206,0.093465,0.257519,0.27185,0.304379,0.169407
2,0.0,0.0,1.3e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.000632,...,0.007202,0.017751,0.0,0.001244,0.00317,0.00716,0.002523,0.010486,0.004186,0.010484


# Run SHAP for Training Sets

**Optional**

* This runs on training CV Datasets that were partiioned during STREAMLINE
* User can set run_train to 'True' for comparison between training and testing sets

In [14]:
# testing all methods
run_force = False # parameter in run_force_plot(); set to True if user wants to display force plots for trained models
run_train = False


if run_train == True:
    for each in datasets:
        print("---------------------------------------")
        print(each)
        print("---------------------------------------")
        full_path = experiment_path+'/' + each
        filepath = f"/{full_path}/model_evaluation/shap_values/trainResults/" #path to save SHAP FI value results
        
       
        
        #Make folder in experiment folder/datafolder to store all shap_values per algorithm/CV combination
        if not os.path.exists(full_path+'/model_evaluation/shap_values/trainResults'):
            os.mkdir(full_path+'/model_evaluation/shap_values/trainResults')

        if not os.path.exists(full_path+'/model_evaluation/shap_values/trainResults/shapFigures'):
            os.mkdir(full_path+'/model_evaluation/shap_values/trainResults/shapFigures')
        save_path = full_path + '/model_evaluation/shap_values/trainResults/shapFigures'

        for algorithm in algorithms: #loop through algorithms
            print("---------------------------------------")
            print(abbrev[algorithm])
            print("---------------------------------------")
            
            if not os.path.exists(f'{experiment_path}/hcc-data_example/model_evaluation/shap_values/trainResults/'):
                os.mkdir(f'{experiment_path}/hcc-data_example/model_evaluation/shap_values/trainResults/')
            file_path = (f'{experiment_path}/hcc-data_example/model_evaluation/shap_values/trainResults/{abbrev[algorithm]}_shapMasterList.csv')
     
            FI_all = []  # list to store feature importanes to create shap values master list
        
            for cvCount in range(0,cv_partitions): #loop through cv's
                print(f"{abbrev[algorithm]}{cvCount} In CV{cvCount}...")

                # unpickle and load model
                result_file = f"/{full_path}/models/pickledModels/{abbrev[algorithm]}_{str(cvCount)}.pickle"
                file = open(result_file, 'rb')
                model = pickle.load(file)
                file.close()
                print(f'\nChecking if correct model is loaded...\n{model}')


                # Load CV datasets, paths to datasets updates with each iteration
                train_path = f"/{experiment_path}/{each}/CVDatasets/{each}_CV_{str(cvCount)}_Train.csv"
                test_path =f"/{experiment_path}/{each}/CVDatasets/{each}_CV_{str(cvCount)}_Test.csv"
                trainX, trainY,testX, testY, trainFeat, testFeat = dataPrep(train_path,instance_label,class_label, test_path)
#                 print(trainX)

                # shap computation and plots
                # Sanity check: print explainer to check if explainer exists
                explainer = get_explainer(model, abbrev[algorithm], trainX)  #explainer must always use training set
                print(f"\nChecking explainer for {abbrev[algorithm]}{cvCount}...\n{explainer}\n")  

                print(f"Checking shap values for {abbrev[algorithm]}{cvCount}...\n")               
                shap_values = compute_shapValues(model, abbrev[algorithm], explainer, trainX)

                print(f"\nGenerating SHAP plots for {abbrev[algorithm]}{cvCount}...\n")
                shap_summary(abbrev[algorithm], trainFeat, shap_values, explainer, trainX, cvCount, save_path, 'Train')
                
                #save SHAP FI results for each model per CV 
                print('Saving feature importance ranking for {}{}...\n'.format(abbrev[algorithm], cvCount))
                shap_fi_df, shap_means = shap_feature_ranking(abbrev[algorithm], shap_values, trainX, trainFeat) # can either choose to pass testX or trainX
                print(f'shap means for {abbrev[algorithm]} in CV{cvCount}...\n{shap_means}')
                shapFI_path = (f"{filepath}{abbrev[algorithm]}_{str(cvCount)}_shapFIValues_Test.csv")
                shap_fi_df.to_csv(shapFI_path, header=True, index=True)
                
                
                # OPTIONAL: set to 'TRUE' to save force plot figures
                if run_force == True:
                    run_force_plots(abbrev[algorithm], explainer, shap_values, trainX, trainFeat, cvCount, save_path, 'Train')
                else:
                    continue
                
                
                # create masterList of mean SHAP Values taken from shap_feature_ranking() for each model
                save = save_shap(abbrev[algorithm],shap_values, shap_means, original_headers, cvCount, 'Train', each)
                FI_all.append(save)
                print("------------------------------------------------------------------------------------------")
            temp = FI_all
            del FI_all  # free up space
            
            # create master list for each model after looping through all CVs
            df = pd.DataFrame(np.asarray(temp), columns=original_headers)
            display(df)
            
            path = (f'{filepath}{abbrev[algorithm]}_shapMasterList.csv') 
            df.to_csv(path, index=True)
           



# 

# Read in Model SHAP Value Master List to Create SHAP Figures for the Final Model Over All CVs

**NOT ENTIRELY SURE IF THIS IS ACTUALLY DISPLAYING SHAP VALUE AVERAGES OF EACH FEATURE (OVER ALL CVs)FOR ONE MODEL** 


**Areas of concern/ISSUES:**
* Not necessarily correct if I'm using just the model from different partitions to display overall SHAP value averages using the respective CV0-CV2 datasets
    * Unless I'm comparing the figures using different Explanation objects to show how different trained models are 'explaining' the prediction

In [None]:
'''Read in master list file for each trained model to create average SHAP value for each 
       feature over all CVs
       
    Will be used to create summative figures of the model rather than summative figures of each CV per model'''


abbrev='LR'
read_df = pd.read_csv(f'{experiment_path}/hcc-data_example/model_evaluation/shap_values/testResults/LR_shapMasterList.csv', index_col=0)    
# print(len(read_df.columns))

final_mean = []
for i in range(0, len(read_df.columns)):
#     print(read_df.iloc[:, i].values)
    final_mean.append(np.mean(read_df.iloc[:, i].values))
final_mean = np.asarray(final_mean)


result_file = experiment_path+ '/hcc-data_example/models/pickledModels/LR_0.pickle'
file = open(result_file, 'rb')
model = pickle.load(file)
file.close()
# print('\nChecking if correct model is loaded...\n', model)



vals = []
vals = np.asarray(read_df.values)
print(vals)

# original_dataset_vals = pd.read_csv(dataset_path+"/hcc-data_example.csv") 
# print(original_dataset_vals)

train_path = experiment_path +  '/hcc-data_example/CVDatasets/hcc-data_example_CV_0_Train.csv'
test_path = experiment_path +  '/hcc-data_example/CVDatasets/hcc-data_example_CV_0_Test.csv'
trainX, trainY,testX, testY, trainFeat, testFeat = dataPrep(train_path,instance_label,class_label, test_path)

explainer = get_explainer(model, abbrev, trainX)



'''Might not be correct since the dataset being used is testing CV0 unless you open the original 
    hcc-data_example dataset csv file
    
    Another concern might be that this wouldn't be the correct usage of Explainer objects & creating 
        figures from it.....in this case it seems I'm using trained models from each CV (& its respective CV dataset)
        to explain the SHAP value average of feature taken over all CVs (from the master list)'''

shap.summary_plot(vals, original_headers, plot_type='bar')
# shap.force_plot(explainer.expected_value, vals[0], original_headers, show=True)
shap.decision_plot(explainer.expected_value, vals, original_headers, show=True)

print(f'Single Prediction of {abbrev} SHAP Values Master List\n')
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], vals[0], testX.iloc[0], original_headers, show=True)
shap.force_plot(explainer.expected_value, vals[0], original_headers, show=True)


# explainer = get_explainer(model, abbrev, trainX)
    
# # shap_values = compute_shapValues(model, 'NB', explainer, testX)
# shap.summary_plot(np.mean(read_df.iloc[:, :].values), testX, original_headers)
# print(final_mean)
        

