<img src="../img/iris.jpg"  width="450" height="200">

### Understanding Mutliclass Classificaion ML Models with SHAP

The iris dataset is famous in the data science community; you can learn more [here](https://archive.ics.uci.edu/ml/datasets/iris). What we need to know is that the dataset is made up of 150 rows/ instances. Each row has four attribue features relating to different dimensions (width and length of sepal and petals) and there are three species (classes) to consider: Setosa, Versicolour, Virginica. These are the targets and there are 50 observations for each species - therefore we can say the data is balanced.

In this exercise you will need go through the code and fill in any missing spaces (denoted with `XXXX`). There will be clues and hints.

By the end of this exercise you should understand how to use the SHAP library to understand a RandomForestClassifier model.

#### Library Imports

There are THREE missing values in the section below

In [None]:
from sklearn import datasets
import XXXX as pd 
import XXXX as np 
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

import shap
from shap import Explanation

# Java Script for SHAP Plots
XXXX.initjs() 

In [None]:
# Helper function to see methods in object
# Might be useful when working through this exercise

def object_methods(obj):
    '''
    Helper function to list methods associated with an object
    '''
    try:
        methods = [method_name for method_name in dir(obj)
                   if callable(getattr(obj, method_name))]
        print('Below are the methods for object: ', obj)
        for method in methods:
            print(method)
    except:
        print("Error")

#### Load & Clean Data

There is ONE missing value in the section below.
HINT: The assign the dataframe to the same variable (name) as the data was initially imported to.

In [None]:
# Loading and cleaning the data
iris = datasets.load_iris()

# made into dataframe
XXXX = pd.DataFrame( 
    data= np.c_[iris['data'], iris['target']],
    columns= iris['feature_names'] + ['target']
    )

# Convert target float to int
iris['target'] = iris['target'].apply(lambda x: int(x))

In [None]:
# Define the different classes/ species
class_dict = {0 : 'setosa',
             1 : 'versicolor',
             2 : 'virginica'}

class_dict

In [None]:
# Add species into the dataframe
iris['species'] = iris['target'].apply(lambda x: class_dict.get(x))

iris.head()

In [None]:
# Take a look at some stats about the data
iris.describe()

#### Plot the Data

In [None]:
setosa = iris[iris.species == "setosa"]
versicolor = iris[iris.species=='versicolor']
virginica = iris[iris.species=='virginica']

fig, ax = plt.subplots()
fig.set_size_inches(13, 7) # adjusting the length and width of plot

# lables and scatter points
ax.scatter(setosa['petal length (cm)'], setosa['petal width (cm)'], label="Setosa", facecolor="blue")
ax.scatter(versicolor['petal length (cm)'], versicolor['petal width (cm)'], label="Versicolor", facecolor="green")
ax.scatter(virginica['petal length (cm)'], virginica['petal width (cm)'], label="Virginica", facecolor="red")


ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")
ax.grid()
ax.set_title("Iris petals")
ax.legend()

#### Performing Classification

There is ONE missing value in the section below

In [None]:
# Droping the target and species since we only need the measurements
X = iris.drop(['target','species'], axis=1)

# Define features (X) and target (y)
X = X
y = iris['target']

# get class and features names
class_names = iris.species.unique()
feature_names = X.columns

# Splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(XXXX, ####
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

There is ONE missing value in the section below

In [None]:
# Instantiate a RFC model and fit it
model = RandomForestClassifier(n_estimators=100,
                               n_jobs=-1,
                               class_weight='balanced',
                               random_state=42)

model.fit(XXXX, ####
          y_train)

#### Inspecting Model's 'Feature Importance' ('Out of the box')

In [None]:
# Looking at standard feature importance
importances = model.feature_importances_
indices = np.argsort(importances)

In [None]:
# Generate Feature Bar Chart of Standard Feature Importances
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='g', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

#### Predictions

There is ONE missing value in the section below

In [None]:
# Training predictions
training_prediction = model.XXXX(X_train) ####
training_prediction

There is ONE missing value in the section below

In [None]:
# Test predictions
test_prediction = model.XXXX(X_test) ####
test_prediction

#### Assessing Performance

In [None]:
# Performance with training data
print("Precision, Recall, Confusion matrix, in training\n")

# Precision Recall scores
print(metrics.classification_report(y_train,
                                    training_prediction,
                                    digits=3))

# Confusion matrix
print(metrics.confusion_matrix(y_train,
                               training_prediction))

In [None]:
# Performance with testing data
print("Precision, Recall, Confusion matrix, in testing\n")

# Precision Recall scores
print(metrics.classification_report(y_test,
                                    test_prediction,
                                    digits=3))

# Confusion matrix
print(metrics.confusion_matrix(y_test,
                               test_prediction))

#### Obtaining Shap Values

There is ONE missing value in the section below.

HINT Why not use the `object_methods` function to see if you can identify which method will return Shapley Values

In [None]:
# Compute SHAP values
explainer = shap.Explainer(model)
shap_values = explainer.XXXX(X_test) ####

shap.summary_plot(shap_values,
                  X_test.values,
                  plot_type="bar",
                  class_names= class_names,
                  feature_names = feature_names)

#### Summary Plots for each Class

There is ONE missing value in the code below

HINT You had to look up the method to create this variable above

In [None]:
# Summary Plot for Each Class

for class_id in iris.target.unique():
    class_name = class_dict.get(class_id).capitalize()
    print(f"---------\n\nSummary Plot for Class {class_name}")
    shap.summary_plot(XXXX[class_id], ####
                      X_test.values,
                      feature_names = feature_names)

#### Dependence Plots for each Class (Species)

There is ONE missing value in the code below.

In [None]:
# dependence plots

for class_id in iris.target.unique():
    for idx, col_name in enumerate(feature_names):
            class_name = class_dict.get(class_id).capitalize()
            print(f"--------\n\nDependence Plot for {class_name} - {col_name}")
            shap.dependence_plot(XXXX, # Index of Column ####
                                 shap_values[class_id], # Shap values for class of interest
                                 X_test.values, # Array of data
                                 feature_names=feature_names) # Feature Names

#### Force & Water  Plots

You can change the `row` and `class_id` values below to see the different outputs

In [None]:
# Force Plot
row = 2
class_id = 0

class_name = class_dict.get(class_id).capitalize()

print(f"Below is the Force Plot for {class_name} - Record {row}")
print("i.e. This represents how the probabilty of this class being chosen was made")
shap.force_plot(explainer.expected_value[class_id], # return the base or expected values from the `explainer` object
                shap_values[class_id][row], # return the shap values for the respective class and row number
                X_test.iloc[row].values, # values under the bar
                feature_names = feature_names)


There is ONE missing value in the code below

HINT the `data` arguement will come from a single row of data from the testing data.

In [None]:
# Waterfall Plot
print(f"Below is the Waterfall Plot for {class_name} - Record {row}")
print("i.e. This represents how the probabilty of this class being chosen was made")
shap.waterfall_plot(shap.Explanation(values = shap_values[class_id][row], # return the shap values for the respective class and row number
                                     base_values = explainer.expected_value[class_id], # return the base or expected values from the `explainer` object
                                     data = XXXX, # feature values (light grey on left hand side) ####
                                     feature_names = feature_names))