In this mini-lecture, we study local surrogate models (LIME).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import os 
import pprint
import copy
import lime

# import eli5
# !pip install scikit-image

import sklearn.metrics
import skimage.io
import skimage.transform
import skimage.segmentation

from skimage.segmentation import mark_boundaries
from lime import lime_tabular
from lime import lime_image
from lime import lime_text
from lime.lime_text import LimeTextExplainer

from sklearn import svm
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.base import Bunch
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.applications.imagenet_utils import decode_predictions

%matplotlib inline

In [None]:
path="C:\\Users\\gao\\GAO_Jupyter_Notebook\\Datasets"
os.chdir(path)

#path="C:\\Users\\pgao\\Documents\\PGZ Documents\\Programming Workshop\\PYTHON\\Open Courses on Python\\Udemy Course on Python\Introduction to Data Science Using Python\\datasets"
#os.chdir(path)

### I. Theory on LIME (Locally Interpretable Model-Agnostic Explanations)

In the past we have studied two global methods of model interpretation: PFI and global surrogate method. In this lecture we focus on local surrogate method, specifically, an algorithm called **locally interpretable model-agnostic explanations (LIME)**. Instead of training a global surrogate model, LIME focuses on training local surrogate models to explain individual predictions.

The idea of LIME is quite intuitive. First, forget about the training data and imagine you only have the blackbox model where you can input data points and get the predictions of the model. You can probe the box as often as you want. Your goal is to understand why the machine learning model made a certain prediction. LIME tests what happens to the predictions when you give variations of your data into the machine learning model. LIME generates a new dataset consisting of perturbed samples and the corresponding predictions of the blackbox model. On this new dataset, LIME trains an interpretable model, which is weighted by the proximity of the sampled instances to the instance of interest. The interpretable model can be any of those off-the-shelves method such as linear models or decision trees. The learned model should be a good approximation of the machine learning model predictions locally, but it does not have to be a good global approximation. This kind of accuracy is also called **local fidelity**.

Here are the recipes for training local surrogate models:

   - select your instance of interest for which you want to have an explanation of its blackbox prediction;
   - perturb your dataset and get the blackbox predictions for these new points;
   - weight the new samples according to their proximity to the instance of interest;
   - train a weighted, interpretable model on the dataset with the variations;
   - explain the prediction by interpreting the local model. 
   
How do you get the variations of the data? This depends on the type of data, which can be either text, image or tabular data. For texts and images, the solution is to turn single words or super-pixels on or off. In the case of tabular data, LIME creates new samples by perturbing each feature individually, drawing from a normal distribution with mean and standard deviation taken from the feature (assuming numeric features).

#### 1. LIME on Tabular Data

Tabular data is data that comes in tables, with each row representing an instance and each column a feature. LIME samples are not taken around the instance of interest, but from the training data's mass center. But it increases the probability that the result for some of the sample points predictions differ from the data point of interest and that LIME can learn at least some explanation.

It is best to visually explain how sampling and local model training works. Below are four pictures that show the following:

   - A) random forest predictions given features $x_{1}$ and $x_{2}$; predicted classes $y$: 1 (dark) or 0 (light);
   - B) instance of interest (big dot) and data sampled from a normal distribution (small dots);
   - C) assignment of higher weight to points near the instance of interest;
   - D) signs of the grid showing the classifications of the locally learned model from the weighted samples, with the white line marking the decision boundary ($Pr(y=1) = 0.5$).

In [None]:
from IPython.display import Image
Image("LIME tabular.PNG", height=450, width=500)

Defining a meaningful neighborhood around a point is difficult. This is the biggest critique of LIME actually, which currently uses an exponential smoothing kernel to define the neighborhood. A smoothing kernel is a function that takes two data instances and returns a proximity measure. The kernel width determines how large the neighborhood is: a small kernel width means that an instance must be very close to influence the local model, a larger kernel width means that instances that are farther away also influence the model. The original Python's LIME implementation uses an exponential smoothing kernel (on the normalized data) and the kernel width is 0.75 times the square root of the number of columns of the training data as the default value. Of course, there is no 'correct' answer here. Choosing the right kernel is a big challenge in LIME implementations. 

#### 2. LIME on Texts

LIME for texts differs from LIME for tabular data. Variations of the data are generated differently: starting from the original texts, new texts are created by randomly removing words from the original text. The dataset is represented with binary features for each word. A feature is 1 if the corresponding word is included and 0 if it has been removed.

Let's see an example where we classify YouTube comments as spam or normal. Let's suppose we have a blackbox model trained on the document word matrix. Each comment is one document (one row) and each column is the number of occurrences of a given word. Short decision trees can be applied and are easy to understand, but in this case the model is complex. In some cases, there could have been a recurrent neural network or a support vector machine trained on word embeddings (abstract vectors). Let us look at the two comments of this dataset and the corresponding classes (1 for spam, 0 for normal comment):

In [None]:
from IPython.display import Image
Image("youtube comment snippet.PNG", height=450, width=500)

Now let's create some variations of the datasets used in a local model. For example, some variations of one of the comments may look like this through different types of perturbations below. Each column corresponds to one word in the sentence. Each row is a variation; 1 means that the word is part of this variation and 0 means that the word has been removed. The "prob" column shows the predicted probability of spam for each of the sentence variations. The "weight" column shows the proximity of the variation to the original sentence, calculated as 1 minus the proportion of words that were removed, for example if 1 out of 7 words was removed, the proximity is $1 - \frac{1}{7} = 0.86$.

In [None]:
from IPython.display import Image
Image("youtube comment perturbation.PNG", height=450, width=500)

#### 3. LIME on Images

LIME for images works differently than LIME for tabular data and text. Intuitively, it would not make much sense to perturb individual pixels, since many more than one pixel contribute to one class. Randomly changing individual pixels would probably not change the predictions by much. Therefore, variations of the images are created by segmenting the image into "superpixels" and turning superpixels off or on. Superpixels are interconnected pixels with similar colors and can be turned off by replacing each pixel with a user-defined color such as gray (you can think of superpixels as 'blobs' of similar color). The user can also specify a probability for turning off a superpixel in each permutation.

There are certainly advantages and shortcomings of using LIME. Here are the advantages of using LIME:

   1. LIME is very flexible. Even if we replace the underlying machine learning model, you can still use the same local, interpretable model for explanation. Suppose the people looking at the explanations understand decision trees best. Because we use local surrogate models, we can use decision trees as explanations without actually having to use a decision tree to make the predictions. For example, we can use a SVM; and if it turns out that an xgboost model works better, we can replace the SVM and still use as decision tree to explain the predictions. 
   2. When using surrogate models such as lasso or short trees, the resulting explanations are short (selective) and possibly contrastive. Therefore, they make human-friendly explanations. This is true for tabular data, texts and images.
   3. The fidelity measure (how well the interpretable model approximates the blackbox predictions) gives us a good idea of how reliable the interpretable model is in explaining the blackbox predictions in the neighborhood of the data instance of interest.
   4. The explanations created with local surrogate models can use other (interpretable) features than the original model was trained on. Of course, these interpretable features must be derived from the data instances. A text classifier can rely on abstract word embeddings as features, but the explanation can be based on the presence or absence of words in a sentence. A regression model can rely on a non-interpretable transformation of some attributes, but the explanations can be created with the original attributes. For example, the regression model could be trained on components of a principal component analysis (PCA) of answers to a survey, but LIME might be trained on the original survey questions. Using interpretable features for LIME can be a big advantage over other methods, especially when the model was trained with non-interpretable features.

Here are the disadvantages:

   1. The correct definition of the neighborhood is a big, unsolved problem when using LIME with tabular data. This is perhaps the biggest problem of LIME. Really, the explanations depend on the selection of many hyperparameters.
   2. Sampling could be improved in the current implementation of LIME. Data points are sampled from a Gaussian distribution, ignoring the correlation between features. This can lead to unlikely data points which can then be used to learn local explanation models.
   3. Another really big problem is the instability of the explanations (a.k.a. the explanations are not robust). In an article written by Alvarez-Melis et al. (2018), the authors showed that the explanations of two very close points varied greatly in a simulated setting. Also, in my experience, if you repeat the sampling process, then the explantions that come out can be different. Instability means that it is difficult to trust the explanations, and you should be very critical.
   4. All the surrogate models may suffer from low fidelity problem (that is, we have a surrogate model not approximating the blackbox model well enough). 
   
Famous Python packages of LIME includes 'lime' (the original library) and 'eli5' as of 2021. Let's see some real examples now.

### II. LIME for Tabular Data (Multiclass Classfication)

In this example, we use a wine-quality dataset related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, refer to Cortez et al. (2009). The dataset can be used for classification or regression tasks and it's retrievable in Kaggle. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones), thus the dataset can also be used for outlier detection algorithms to detect the few excellent or poor wines in standard ML classes. Also, there are many input variables which may be irrelevant. So it could be interesting to test feature selection methods. Here, we implement a classification example using random forests, and then try to use LIME to explain the model. 

Here are the input variables (based on physicochemical tests):

   1. fixed acidity
   2. volatile acidity
   3. citric acid
   4. residual sugar
   5. chlorides
   6. free sulfur dioxide
   7. total sulfur dioxide
   8. density
   9. pH
   10. sulphates
   11. alcohol

There are only one output variable (based on sensory data): quality (score between 0 and 10). This will be a multiclass classification example.

##### 1. Data Preprocessing

Let's first read in the two datasets, one containing white wine information and the other containing the red wine information. Later we will combine them into one dataset. 

In [None]:
winewhite = pd.read_csv('winequality-white.csv', sep=';')
winewhite.head()

In [None]:
winered = pd.read_csv('winequality-red.csv', sep=';')
winered.head()

In [None]:
winered['type']='red'
winewhite['type']='white'

wine=pd.concat([winered, winewhite], ignore_index=True)
print(wine.info())
wine.head()

We see that the field 'quality' is an integer. Let's change it to a categorical variable. Notice below that the 'dtype' for this variable is 'categorical', yet the specific value of the column is still expressed as an integer: 

In [None]:
wine['quality']=wine['quality'].astype('category') # changing the 'quality' field to categorical
print(type(wine['quality'][3])) # specific value still as an integer (int64)
wine.info() # Dtype = 'category'

Let's look at some descriptive statistics. We will also check how many missing values are there for each variable. And then we will look at correlations between variable through a heatmap:

In [None]:
wine.describe(include='all')

In [None]:
wine.isna().sum() # checking how many missing values are there

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(wine.corr(), annot=True, robust=True)

We can see that there is no missing variable. In addition, nothing is super highly correlated in this case we don't have to worry about multicollinearity too much for now. Next, let's do some in-depth visual analysis. We will start from univariate analysis, then move onto bivariate analysis and then finally multivariate analysis. 

##### 2. Univariate Visualization 

Let's first look at continuous variables. Remember that the best way to visualize the continuous variables are through histograms and density plots. We first look at histograms for each variable. Then we will pick the 'sulphates' variable specifically to look at its density plot. 

In [None]:
wine.hist(bins=15, color='steelblue', edgecolor='black', linewidth=1.0, xlabelsize=8, ylabelsize=8, grid=False)
plt.tight_layout(rect=(0, 0, 1.2, 1.2))  # command to give space  

In [None]:
fig = plt.figure(figsize = (4, 2))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,1, 1)
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Density") 
sns.kdeplot(wine['sulphates'], ax=ax1, shade=True, color='darkgreen')

Now let's look at discrete variables. Since 'quality' is discrete and it indicates wines' low-medium-high quality based on a scale of 10 (all subjective values), let's look at their distribution using barplots. Below, we see that the quality in general is mediocre (6). There are very few good wines (>=9). None of the wines are rated below or equal to 2:

In [None]:
print(wine.quality.value_counts())

In [None]:
fig = plt.figure(figsize = (4, 2))
title = fig.suptitle("Wine Quality Frequency", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax = fig.add_subplot(1,1, 1)
ax.set_xlabel("Quality")
ax.set_ylabel("Frequency") 
w_q = wine['quality'].value_counts()
w_q = (list(w_q.index), list(w_q.values))
ax.tick_params(axis='both', which='major', labelsize=8.5)
bar = ax.bar(w_q[0], w_q[1], color='darkgreen', edgecolor='yellow', linewidth=1)

##### 3. Bivariate Visualization 

Now let's do some bivariate analysis. Let’s look at some ways in which we can visualize two continuous, numeric attributes as a start. Scatter plots and joint plots in particular are good ways to not only check for patterns or relationships, but also they also help us see the individual distributions for the attributes; another set of similar techniques include violin plots, which are effective ways to visualize grouped numeric data using kernel density plots (depicting probability density of the data at different values):

In [None]:
plt.scatter(wine['sulphates'], wine['alcohol'], color='darkorange', alpha=0.4, edgecolors='w')

plt.xlabel('Sulphates')
plt.ylabel('Alcohol')
plt.title('Wine Sulphates - Alcohol Content', y=1.05)

In [None]:
f, (ax) = plt.subplots(1, 1, figsize=(8, 3))
f.suptitle('Wine Quality - Sulphates Content', fontsize=12)

sns.violinplot(x="quality", y="sulphates", data=wine,  ax=ax)
ax.set_xlabel("Wine Quality",size = 10,alpha=0.8)
ax.set_ylabel("Wine Sulphates",size = 10,alpha=0.8)

For 2 discrete variables, we can create barplots like below:

In [None]:
fig = plt.figure(figsize = (10, 4))
title = fig.suptitle("Wine Type - Quality", fontsize=11)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2, 1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Quality")
ax1.set_ylabel("Frequency") 

rw_q = winered['quality'].value_counts()
rw_q = (list(rw_q.index), list(rw_q.values))

ax1.set_ylim([0, 2500])
ax1.tick_params(axis='both', which='major', labelsize=7.5)
bar1 = ax1.bar(rw_q[0], rw_q[1], color='maroon', edgecolor='black', linewidth=1)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("White Wine")
ax2.set_xlabel("Quality")
ax2.set_ylabel("Frequency") 

ww_q = winewhite['quality'].value_counts()
ww_q = (list(ww_q.index), list(ww_q.values))

ax2.set_ylim([0, 2500])
ax2.tick_params(axis='both', which='major', labelsize=7.5)
bar2 = ax2.bar(ww_q[0], ww_q[1], color='beige', edgecolor='black', linewidth=1)

##### 4. High-Dimensional Visualization 

Visualizing data till two-dimensions is pretty straightforward but it starts becoming complex as the number of dimensions begin to increase. For three-dimensional data, we can introduce a fake notion of depth by taking a z-axis in our chart or leveraging subplots and facets. However for data higher than three-dimensions, it becomes even more difficult to visualize. The best way to go higher than three dimensions is to use facets, color, shapes, sizes, depth and so on to distinguish data by different 'layers' or 'categories'. 

Let's look at strategies for visualizing three continuous, numeric attributes. One way would be to have two dimensions represented as the regular length (x-axis) and breadth (y-axis) and also take the notion of depth (z-axis) for the third dimension:

In [None]:
fig = plt.figure(figsize=(7, 5))
ax = fig.add_subplot(111, projection='3d')

xs = wine['residual sugar']
ys = wine['fixed acidity']
zs = wine['alcohol']
ax.scatter(xs, ys, zs, s=50, alpha=0.6, edgecolors='w', c='red')

ax.set_xlabel('Residual Sugar')
ax.set_ylabel('Fixed Acidity')
ax.set_zlabel('Alcohol')

The picture is nice but it could be done better. Remember that faceting is the process of removing parts of a polygon, polyhedron or polytope, without creating any new vertices. Here, a better option would be to use the notion of faceting as the third dimension (essentially subplots) where each subplot indicates a specific bin from our third variable (dimension). Since we have three continuous variables, we will need to create our own bins manually if we are using the scatterplot functionality from 'matplotlib' as opposed to 'seaborn' (see the following example). The plot below clearly tells us that higher the residual_sugar levels and the alcohol content, the lower the fixed_acidity in the wine samples:

In [None]:
quantile_list = [0, 0.25, 0.5, 0.75, 1.0]
quantile_labels = ['0', '25', '50', '75']
wine['res_sugar_labels'] = pd.qcut(wine['residual sugar'], 
                                    q=quantile_list, labels=quantile_labels)
wine['alcohol_levels'] = pd.qcut(wine['alcohol'], 
                                    q=quantile_list, labels=quantile_labels)
wine.head()

In [None]:
g = sns.FacetGrid(wine, col="res_sugar_labels", hue='alcohol_levels')
g.map(plt.scatter, "fixed acidity", "alcohol", alpha=0.7)
g.add_legend()

Now let's visualize 3 discrete categorical attributes. The trick is to use 'hues':

In [None]:
fc = sns.catplot(x="quality", hue="type", col="res_sugar_labels", data=wine, kind="count",palette={"red": "#BF9930", "white": "#CC3188"})

Using hues is very important in multidimensional data visualization in general. For example, in 3-D case, we can use hues for one of the categorical attributes while using conventional visualizations like scatter plots for visualizing two dimensions for numeric attributes. Notice that below, 'type' is categorical, yet the rest of the variables are continuous. So we can create some cool pairplots while using 'type' as 'hues':

In [None]:
cols_t = ['density', 'residual sugar', 'total sulfur dioxide', 'fixed acidity', 'type']
pp = sns.pairplot(wine[cols_t], height=1.8, hue = 'type' , aspect=1.8,
                  plot_kws=dict(edgecolor="k", linewidth=0.5),
                  diag_kind="kde", diag_kws=dict(shade=True))

fig = pp.fig  
fig.subplots_adjust(top=0.93, wspace=0.5)
t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

Here is another example:

In [None]:
lp = sns.lmplot(x='sulphates', y='alcohol', hue='type', 
                palette={"white": "lemonchiffon", "red": "firebrick"},
                data=wine, fit_reg=True, legend=True,
                scatter_kws=dict(edgecolor="k", linewidth=0.5))

The hues act as good separators for the categories or groups and while there is no or very weak correlation as observed above, we can still understand from these plots that sulphates are slightly higher for red wines as compared to white. Instead of a scatter plot, we can also use a kernel density plot to understand the data in three dimensions:

In [None]:
ax = sns.kdeplot(winewhite['sulphates'], winewhite['alcohol'],
                  cmap="YlOrBr", shade=True, shade_lowest=False)
ax = sns.kdeplot(winered['sulphates'], winered['alcohol'],
                  cmap="Reds", shade=True, shade_lowest=False)

##### 5. Model Training

Let's train the model. We want to use all attributes to predict wine quality, which are indicated by integers. First, let's convert the 'type' variable using dummy variables, as 'type' are categorical and can take on 'red' or 'white' values. We want to use type as an input:

In [None]:
wines = pd.get_dummies(wine, columns = ['type'], drop_first=True) # drop_first=True ensures that there is no dummy variable trap
print(wines.head())

Recall that we created two pseudo-variables in the previous section: 'res sugar labels' and 'alcohol levels'. We will drop them here and then do the test-train split. Then we use an ExtraTreeClassifier() function to train the model. This is a model based on random forests, which is not a easily-interpretable model. In extremely randomized trees, randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.

In [None]:
X = wines.drop(['quality','res_sugar_labels', 'alcohol_levels'], axis=1) # getting rid of the 2 derived variables
y = wines['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=428)

Let's do a sanity check after the train-test split. And then we will train the model:

In [None]:
sanity_check=X_train.describe().loc[['min', 'max', 'mean', 'std'], :]
sanity_check.round(decimals=2)

In [None]:
model = ExtraTreesClassifier(random_state=428)
model.fit(X_train, y_train)
scores = model.score(X_test, y_test) # returning the mean accuracy on the given test data and labels
print(scores) # in multi-label classification, this is the subset accuracy which is a harsh metric since we require for each sample that each label set be correctly predicted

##### 5. Model Interpretation

Now let's use the lime_tabular.LimeTabularExplainer() class to explain the model for the 3rd observation. As we will see, the model is 41% confident this record falls in class 4 (low quality wine in this case). Meanwhile, the values of alcohol, chlorides, and fixed acidity etc. decrease the wine’s chance to be classified as medium quality (6):

In [None]:
explainer = lime_tabular.LimeTabularExplainer(
    training_data=np.array(X_train), # this has to be a 2-D numpy array
    feature_names=X_train.columns, # list of names (strings) corresponding to the columns in the training data
    class_names=list(y.unique()), # list of class names, ordered according to whatever the classifier is using
    mode='classification' # classification or regression
)

In [None]:
print(X_test.iloc[2])

In [None]:
exp = explainer.explain_instance(
    data_row=X_test.iloc[2], # picking the 3rd observation
    predict_fn=model.predict_proba # prediction function
)

exp.show_in_notebook(show_table=True)

Let's look at another individual record and see whether we can explain this. This time, we will add another argument _top_labels_, which indicates how many classes in this class needs explanations (in classification problems):

In [None]:
exp = explainer.explain_instance(
    data_row=X_test.iloc[9], # picking the 3rd observation
    predict_fn=model.predict_proba, # prediction function
    top_labels=7
)

exp.show_in_notebook(show_table=True)

### III. LIME for Tabular Data (Regression)

Now let's use the same data but this time we will do a regression exercise. Let's use the same dataset and the 'pH' as a target variable and let all other information predict the target using a neural network model:

To start with, let's convert all the categorical variables to dummies:

In [None]:
wines2 = pd.get_dummies(wine, columns = ['type', 'quality'], drop_first=True) # drop_first=True ensures that there is no dummy variable trap
wines2.info()

Now let's split the data and standardize the data since we are planning to use a neural network model:

In [None]:
X = wines2.drop(['pH'], axis=1) # getting rid of the target variable in the input
y = wines2['pH']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=428)

scaler = StandardScaler()
X_train=pd.DataFrame(scaler.fit_transform(X_train), index=X_train.index, columns=X_train.columns)
X_test=pd.DataFrame(scaler.fit_transform(X_test), index=X_test.index, columns=X_test.columns)

Let's do a sanity check on the data after train-test split. Since neural network models require us that we use unit level data, we need to make sure that they all have mean 0 and unit STD after transformation:

In [None]:
sanity_check=X_train.describe().loc[['min', 'max', 'mean', 'std'], :]
sanity_check.round(decimals=2)

In [None]:
model = Sequential()

model.add(Dense(17,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(9,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(17,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1))

model.compile(optimizer='adam',loss='mse')

In [None]:
model.fit(x=X_train.values,y=y_train.values,
          validation_data=(X_test,y_test.values),
          batch_size=34,epochs=100)

In [None]:
losses = pd.DataFrame(model.history.history)
losses.plot()

In [None]:
predictions = model.predict(X_test)
print('Model predictions:\n', predictions[0:5], "\n")
print('MAE: ', mean_absolute_error(y_test,predictions))
print('Explained Variance Score: ', explained_variance_score(y_test,predictions))
print('Root Mean Squared Error (RMSE): ', np.sqrt(mean_squared_error(y_test,predictions)))

In [None]:
plt.scatter(y_test,predictions, color='darkorange') # drawing our predictions
plt.plot(y_test,y_test,'blue') # perfect predictions are indicated on the blue line

Let's make sure the errors ($y-\hat{y}$) are well-behaved. And then we will try to explain the model using an individual record. Notice that the explanation is applied to the transformed (standardized) data:

In [None]:
errors = y_test.values.reshape(len(y_test), 1)-predictions
sns.distplot(errors, color='green')

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train.values, feature_names=X_train.columns, class_names=['pH'], verbose=True, mode='regression')

Let's use the 13th data point so we can see how the model predicts it. The predicted value is 3.15 with "fixed acidity<=-0.63", "residual sugar<=-0.77" poviding the most positive evaluation in favor of the predicted value (keep in mind that all the input values are standardized already). As you see, the interpretation for regression problem is less natural compared to classification problems. 

In [None]:
exp = explainer.explain_instance(
    data_row=X_test.iloc[12], # picking the 13th observation
    predict_fn=model.predict, # the predict_fn argument is required and the 'top_labels' argument should be ignored
)

exp.show_in_notebook(show_table=True)

In [None]:
exp.as_list() # returning a list of tuples (representation, weight), with weight as a float.

### IV. LIME for Text Data

Now let's apply LIME to a text data example. In the following example, the text data consists of a total of 1000 posts to two Google newsgroups: "alt.atheism" and "soc.religion.cristian". The task is to predict which of the two groups a post is from. What makes the data so interesting is that both newsgroups discuss similar themes using the same words, but with an obviously rather different angle.

This should actually be a diffifult task for a simplistic "bag-of-words (BOW)" style classifier in theory. However, simple classifiers perform remarkably well on this task. This is because this dataset is "famous" for the presence of many confounding features, for instance the e-mail domains that are present in the headers. The classifier simply learns that a certain domain is only used in posts in one of the two classes. Therefore, the great out-of-the box performance by simple text classifiers is not indicative of any real-world performance, since it learns to recognize particularities of this dataset. Interpretability is thus essential to understand whether the model is any good or not. 

With this example, let's build two models. The first one is a BOW TF-IDF model. The second one, we will use Glove word embedding techniques for features (pre-trained in this case), and then using these embeddings as features, we will apply the SVM for prediction. 

##### 1. Data Preprocessing
 
The dataset is built-in one from the 'sklearn' package. First, let's read in the data:

In [None]:
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, 
                                      categories=categories, remove=[]) #('headers', 'footers', 'quotes')
newsgroups_test = fetch_20newsgroups(subset='test', shuffle=True, 
                                     categories=categories, remove=[])#('headers', 'footers', 'quotes'))

class_names = ['atheism', 'christianity']
# print(newsgroups_train.DESCR)
type(newsgroups_train)

In [None]:
print(newsgroups_train.data[0])

Now let's try to delete empty posts from the trainning and test datasets:

In [None]:
def delete_empty_posts(data):
    """
    Changes the passed argument in-place by removing elements from data.data
    """
    numdeleted = 0
    for i, doc in enumerate(data.data):
        if len(doc.split()) < 3:
            del data.data[i]
            data.target = np.delete(data.target, i)
            numdeleted += 1
            # print(doc)
    print('Deleted {} posts'.format(numdeleted))

delete_empty_posts(newsgroups_train)
delete_empty_posts(newsgroups_test)

In [None]:
print(list(newsgroups_train.target_names))
print(len(newsgroups_train.target))

In [None]:
vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\b[a-zA-Z]{3,}\b', lowercase=False,
                            min_df=5, max_df=0.7, stop_words='english')

#vectorizer_small = TfidfVectorizer(analyzer='word', token_pattern=r'\b[a-zA-Z]{3,}\b', lowercase=True, min_df=10, max_df=0.7, stop_words='english') # an alternative to play around with

vectorizer.fit_transform(newsgroups_train.data)

An important property of LIME is that it is "model-agnostic": it just needs an object with a predict_proba() method that returns the probability of the positive class, and the instance that requires explanation. So we can use the whole family of scikit-learn classifiers, but also pipelines, which is a powerful thing.

In [None]:
mnb = MultinomialNB(alpha=0.1)
pl = make_pipeline(vectorizer, mnb)

def test_classifier_performance(clf):
    """
    clf will be fitted on the newsgroup train data, measured on test
    clf can be a sklearn pipeline
    """

    clf.fit(newsgroups_train.data, newsgroups_train.target)
    pred = clf.predict(newsgroups_test.data)
    print('Accuracy: {:.3f}'.format(metrics.accuracy_score(newsgroups_test.target, pred)))

In [None]:
alpha_grid = np.logspace(-3, 0, 4)#Is smoothing parameter for the counts
param_grid = [{'multinomialnb__alpha': alpha_grid }]
gs = GridSearchCV(pl, param_grid=param_grid, cv=5, return_train_score=True)

gs.fit(newsgroups_train.data, newsgroups_train.target)

In [None]:
plt.plot(gs.cv_results_['param_multinomialnb__alpha'].data,gs.cv_results_['mean_test_score'], label='test')
plt.plot(gs.cv_results_['param_multinomialnb__alpha'].data,gs.cv_results_['mean_train_score'], label='train')
plt.legend()
plt.xlabel('alpha')
plt.ylabel('Accuracy-score')
plt.title('Accuracy TF-IDF, Multinomial NB')
plt.semilogx();

We have an impressively high accuracy, of over 97% on the test data. Only a minimal smoothing of the counts using pseudocounts is needed. Now, let's use LIME to interpret the results:

In [None]:
idx=35
explainer = LimeTextExplainer(class_names=class_names) 
pl.fit(newsgroups_train.data, newsgroups_train.target)
exp = explainer.explain_instance(newsgroups_test.data[idx], 
                                 pl.predict_proba, 
                                 num_features=10)
print('Document id: %d' % idx)
print('Probability(christian) = {:.3f}'.format(pl.predict_proba([newsgroups_test.data[idx]])[0,1]))
print('True class: %s' % class_names[newsgroups_test.target[idx]])
print('R2 score: {:.3f}'.format(exp.score))
exp.show_in_notebook(text=True)

How to interpret all of this?

   - On the left, we see the actual prediction of the base model. It is 95% sure it is a post about Christianity, and it is correct.
   - In the middle, we see a graphical display of the most important features to the probability of the prediction, locally. As an example, the word "Scripture" (note, we did not lowercase the words when vectorizing, so there is a distinction between "scripture" and "Scripture".
   - At the right side, we see a very helpful graphical display of the text, and where the important features appear.

The feature importances, as displayed in the middle, can be interpreted as: the presence of the corresponding word has increased the prediction probability toward the positive class by this fraction.

Let's check this, by removing the word "Scripture":

In [None]:
newsgroups_test_modified = Bunch(data=newsgroups_test.data.copy(), target=newsgroups_test.target)
newsgroups_test_modified.data[35] = ' '.join([word for word in newsgroups_test_modified.data[35].split(' ') if not 
                                    word.startswith('Scripture')])

In [None]:
# Compare the original and modified version
print(newsgroups_test.data[35][0:800])
print('*********')
print(newsgroups_test_modified.data[35][0:800])

In [None]:
idx=35
explainer = LimeTextExplainer(class_names=class_names) 
pl.fit(newsgroups_train.data, newsgroups_train.target)
exp = explainer.explain_instance(newsgroups_test_modified.data[idx], 
                                 pl.predict_proba, 
                                 num_features=10)
print('Document id: %d' % idx)
print('Probability(christian) = {:.3f}'.format(pl.predict_proba([newsgroups_test_modified.data[idx]])[0,1]))
print('True class: %s' % class_names[newsgroups_test_modified.target[idx]])
print('R2 score: {:.3f}'.format(exp.score))
exp.show_in_notebook(text=True)

As one can see, the absence of the word "Scripture" reduced the probability for Christianity. The genearl question is: can we make our surrogate model perform better? We can influence it in several ways:

   - increasing the number of features (obvious measure)'
   - changing the model regressors (by default, it is a Ridge regressor with alpha=1.0);
   - changing the kernel width that determines the locality (by default, it is 25).

To the first point: the number of features is obviously an important parameter. The top-10 might not be enough to build an accurate model.

Regarding the model regressor, we might reduce alpha of the Ridge regressor. The default is arbitrarily set at 1.0. Since the R-squared is that of the fit itself, a lower alpha will result in a better fit. But what value is best? This is hard to tell. Increasing the number of samples might be a good thing to reduce the variance of the fit, and getting away with lower regularization.

Finally, the value of 25 for the kernel width, that is applied to an exponential kernel on the cosine distance, is arbitrary and somewhat remarkable. Since the cosine distance is bounded between -1 and 1, a value of 25 is practically identical to a uniform weighting. One could make it smaller, but how much smaller should it be?

A strategy recommended is to explain the same point and other points multiple time with different random seeds (or without random seeds), and verify that R-squared is consistently above some minimum level. A question that remains, though, is: what is our locality? And what should it be? (kernel size). Again, there is no easy answer. 

##### 3. GloVe Model

Next, let's apply the Glove (word embedding) model. Recall that a nice aspect of LIME is that it accepts sci-kit learn pipelines, and thus, we can make things a bit more sophisticated, by using word embeddings and a support vector machine.

The benefit of using word embeddings is that words with a similar meaning typically get similar vectors. We can map our words to a numerical vector (with a dimensionality that can be one or several orders of magnitude lower compared to our original vocabulary). As a result, we can expect our classifier to generalize better. Note that here we simply average the invidual word vectors. A possible refinement is weighting with TF-IDF weights.

We will use the Glove embeddings that were trained on a 6-billion word corpus, and have a dimensionality of 50. These can be downloaded here: https://nlp.stanford.edu/projects/glove/. 

The following simple class makes it possible to use Glove vectors in a pipeline:

In [None]:
class GloveVectorizer:
    def __init__(self, verbose=False, lowercase=True, minchars=3):
        # load in pre-trained word vectors
        print('Loading word vectors...')
        word2vec = {}
        embedding = []
        idx2word = []
        with open('glove.6B.50d.txt', encoding="utf8") as f:
              # is just a space-separated text file in the format:
              # word vec[0] vec[1] vec[2] ...
              for line in f:
                values = line.split()
                word = values[0]
                vec = np.asarray(values[1:], dtype='float32')
                word2vec[word] = vec
                embedding.append(vec)
                idx2word.append(word)
        print('Found %s word vectors.' % len(word2vec))

        self.word2vec = word2vec
        self.embedding = np.array(embedding)
        self.word2idx = {v:k for k,v in enumerate(idx2word)}
        self.V, self.D = self.embedding.shape
        self.verbose = verbose
        self.lowercase = lowercase
        self.minchars = minchars

    def fit(self, data, *args):
        pass

    def transform(self, data, *args):
        X = np.zeros((len(data), self.D))
        n = 0
        emptycount = 0
        for sentence in data:
            # Note: lower-casing the words
            if self.lowercase:
                tokens = sentence.lower().split()
            else:
                tokens = sentence.split()
            vecs = []
            for word in tokens:
                if len(word) >= self.minchars and word in self.word2vec:
                    vec = self.word2vec[word]
                    vecs.append(vec)
            if len(vecs) > 0:
                vecs = np.array(vecs)
                X[n] = vecs.mean(axis=0)
            else:
                emptycount += 1
            n += 1
        if self.verbose:
            print("Number of samples with no words found / total: %s / %s" % (emptycount, len(data)))
        return X

    def fit_transform(self, X, *args):
        self.fit(X, *args)
        return self.transform(X, *args)

In [None]:
gv = GloveVectorizer()
clf_svm = svm.SVC(gamma='scale', probability=True, C=100) # C: slack-variable penalty. 

Let us make a pipeline with the a word-vectorizing using the GloVe word embeddings, and an SVM model. Before making the predictions, let's first optimize the hyperparameter $C$:

In [None]:
p5 = make_pipeline(gv, clf_svm)

C_grid = np.logspace(1, 4, 4)
param_grid = [{'svc__C': C_grid }] # some basic parameter tuning
gs = GridSearchCV(p5, param_grid=param_grid, cv=5, return_train_score=True)

gs.fit(newsgroups_train.data, newsgroups_train.target)

In [None]:
plt.plot(gs.cv_results_['param_svc__C'].data,gs.cv_results_['mean_test_score'], label='test')
plt.plot(gs.cv_results_['param_svc__C'].data,gs.cv_results_['mean_train_score'], label='train')
plt.legend()
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.semilogx();
#plt.ylim([0.8, 0.82])

In [None]:
clf_svm = svm.SVC(gamma='scale', probability=True, C=1000) 
p5 = make_pipeline(gv, clf_svm)

In [None]:
idx=35
explainer = LimeTextExplainer(class_names=class_names,
                                kernel_width=1.0,
                             random_state=1)
p5.fit(newsgroups_train.data, newsgroups_train.target)
exp = explainer.explain_instance(newsgroups_test.data[idx], 
                                 p5.predict_proba, 
                                 model_regressor=Ridge(alpha=0.01),
                                 num_features=25,
                                 num_samples=20000)
print('Document id: %d' % idx)
print('Probability(christian) = {:.3f}'.format(p5.predict_proba([newsgroups_test.data[idx]])[0,1]))
print('True class: %s' % class_names[newsgroups_test.target[idx]])
print('R2 score: {:.3f}'.format(exp.score))
exp.show_in_notebook(text=True)

### V. LIME for Image Data

One of the most excellent aspect of LIME is that it can also handle image data. In this section, we first illustrate what LIME does using raw Python codes so that we can understand what each step LIME is doing behind the scene. Then we will apply LIME package directly.

Let’s start by reading an image and using the pre-trained "InceptionV3" model available in Keras to predict the class of such image.

In [None]:
Xi = skimage.io.imread("https://arteagac.github.io/blog/lime_image/img/cat-and-dog.jpg")
Xi = skimage.transform.resize(Xi, (299,299)) 
Xi = (Xi - 0.5)*2 # inception pre-processing
skimage.io.imshow(Xi/2+0.5) # showing image before inception preprocessing

In [None]:
np.random.seed(222)
inceptionV3_model = tf.keras.applications.inception_v3.InceptionV3() # loading pretrained model
preds = inceptionV3_model.predict(Xi[np.newaxis,:,:,:])
top_pred_classes = preds[0].argsort()[-5:][::-1] # saving IDs of top 5 classes
decode_predictions(preds)[0] # printing top 5 classes

So we have:

   - Labrador Retriever (82.2%)
   - Golden Retriever (1.5%)
   - American Staffordshire Terrier (0.9%)
   - Bull Mastiff (0.8%)
   - Great Dane (0.7%)
   
With this information, the input image and the pre-trained InceptionV3 model, we can proceed to generate explanations with LIME. In this example we will generate explanations for the class Labrador Retriever.

LIME creates explanations by generating a new dataset of random perturbations (with their respective predictions) around the instance being explained and then fitting a weighted local surrogate model. This local model is usually a simpler model with intrinsic interpretability such as a linear regression model. For the first step, let's generate random perturbations for input image. The following script uses the quick-shift segmentation algorithm to compute the super-pixels in the image. In addition, it generates an array of 150 perturbations where each perturbation is a vector with zeros and ones that represent whether the super-pixel is on or off:

In [None]:
superpixels = skimage.segmentation.quickshift(Xi, kernel_size=4,max_dist=200, ratio=0.2)
num_superpixels = np.unique(superpixels).shape[0]
skimage.io.imshow(skimage.segmentation.mark_boundaries(Xi/2+0.5, superpixels))

num_perturb = 150 # generating perturbations
perturbations = np.random.binomial(1, 0.5, size=(num_perturb, num_superpixels))

def perturb_image(img,perturbation,segments): # creating a function to apply perturbations to images
    active_pixels = np.where(perturbation == 1)[0]
    mask = np.zeros(segments.shape)
    for active in active_pixels:
        mask[segments == active] = 1 
    perturbed_image = copy.deepcopy(img)
    perturbed_image = perturbed_image*mask[:,:,np.newaxis]
    return perturbed_image

print(perturbations[0]) # showing an example of perturbations
skimage.io.imshow(perturb_image(Xi/2+0.5,perturbations[0],superpixels))

The next step is to predict classes for perturbations. The following script uses the "inceptionV3_model" to predict the class of each of the perturbed images. The shape of the predictions is (150, 1000) which means that for each of the 150 images, we get the probability of belonging to the 1000 classes in "InceptionV3". From these 1000 classes we will use only the "Labrador" class in further steps since it is the prediction we want to explain. In this example, 150 perturbations were used. However, for real applications, a larger number of perturbations will produce more reliable explanations:

In [None]:
predictions = []
for pert in perturbations:
    perturbed_img = perturb_image(Xi,pert,superpixels)
    pred = inceptionV3_model.predict(perturbed_img[np.newaxis,:,:,:])
    predictions.append(pred)

predictions = np.array(predictions)
print(predictions.shape)

Now we have everything to fit a linear model using the perturbations as input features and the predictions for Labrador "predictions[labrador]" as output. However, before we fit a local surrogatea model (say linear model), LIME needs to give more weight (importance) to images that are closer to the image being explained. 

To compute weights (importance) for the perturbations, we use a distance metric to evaluate how far is each perturbation from the original image. The original image is just a perturbation with all the super-pixels active (all elements in one). Given that the perturbations are multidimensional vectors, the cosine distance is a metric that can be used for this purpose. After the cosine distance has been computed, a kernel function is used to translate such distance to a value between zero and one (a weight). At the end of this process we have a weight (importance) for each perturbation in the dataset. Below is the code:

In [None]:
original_image = np.ones(num_superpixels)[np.newaxis,:] # perturbation with all superpixels enabled 
distances = sklearn.metrics.pairwise_distances(perturbations,original_image, metric='cosine').ravel()
print(distances.shape)

kernel_width = 0.25 # transforming distances to a value between 0 and 1 using a kernel function
weights = np.sqrt(np.exp(-(distances**2)/kernel_width**2)) #Kernel function
print(weights.shape)

Lastly, we fit a weighted linear model (the surrogate model) using the information obtained in the previous steps. We get a coefficient for each super-pixel in the image that represents how strong is the effect of the super-pixel in the prediction of 'Labrador':

In [None]:
class_to_explain = top_pred_classes[0] #Labrador class
simpler_model = LinearRegression()
simpler_model.fit(X=perturbations, y=predictions[:,:,class_to_explain], sample_weight=weights)
coeff = simpler_model.coef_[0]

num_top_features = 4 # using coefficients from the linear model to extract top features
top_features = np.argsort(coeff)[-num_top_features:] 

mask = np.zeros(num_superpixels)  # showing only the super-pixels corresponding to the top features
mask[top_features]= True # activating top super-pixels
skimage.io.imshow(perturb_image(Xi/2+0.5,mask,superpixels))

This is what LIME returns as explanation. The area of the image (super-pixels) that have a stronger association with the prediction of 'Labrador Retriever' shown above. This explanation suggests that the pre-trained 'InceptionV3' model is doing a good job predicting the 'Labrador' class for the given image. This example shows how LIME can help to increase confidence in a machine-learning model by understanding why it is returning certain predictions.

Now let's actually apply the LIME package. Let's look at a giant pandas image and then apply LIME to explain it: 

In [None]:
url = 'https://raw.githubusercontent.com/marcellusruben/All_things_medium/main/Lime/panda_00024.jpg'

def transform_img_fn_ori(url):
    img = skimage.io.imread(url)
    img = skimage.transform.resize(img, (299,299))
    img = (img - 0.5)*2
    img = np.expand_dims(img, axis=0)
    preds = inet_model.predict(img)
    for i in decode_predictions(preds)[0]:
        print(i)
    return img

inet_model = tf.keras.applications.inception_v3.InceptionV3()
images_inc_im = transform_img_fn_ori(url)

As you can see, the pre-trained "InceptionV3" model also predicts that our image is a giant panda. Now let’s interpret the behavior of our pre-trained model:

In [None]:
explainer = lime_image.LimeImageExplainer()
explanation= explainer.explain_instance(images_inc_im[0].astype('double'), inet_model.predict,  top_labels=3, hide_color=0, num_samples=1000)

temp_1, mask_1 = explanation.get_image_and_mask(explanation.top_labels[0], positive_only=True, num_features=5, hide_rest=True)
temp_2, mask_2 = explanation.get_image_and_mask(explanation.top_labels[0], positive_only=False, num_features=10, hide_rest=False)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,10))
ax1.imshow(mark_boundaries(temp_1, mask_1))
ax2.imshow(mark_boundaries(temp_2, mask_2))
ax1.axis('off')
ax2.axis('off')

### References: 

#### Interpretable AI:
   - https://christophm.github.io/interpretable-ml-book/
   
#### Overview on LIME:
   - Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Why should I trust you?: Explaining the predictions of any classifier." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM (2016).
   - Alvarez-Melis, David, and Tommi S. Jaakkola. "On the robustness of interpretability methods." arXiv preprint arXiv:1806.08049 (2018).
   - P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553 (2009).
   - https://towardsdatascience.com/understanding-model-predictions-with-lime-a582fdff3a3b
   - https://towardsdatascience.com/lime-how-to-interpret-machine-learning-models-with-python-94b0e7e4432e
   - https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612
   - https://github.com/marcotcr/lime

#### LIME for Tabular Data:
   - https://www.analyticsvidhya.com/blog/2017/06/building-trust-in-machine-learning-models/?utm_source=blog&utm_medium=shapley-value-machine-learning-interpretability-game-theory
   - https://github.com/marcotcr/lime/tree/ce2db6f20f47c3330beb107bb17fd25840ca4606
   - https://marcotcr.github.io/lime/tutorials/Using%2Blime%2Bfor%2Bregression.html
   - https://www.kaggle.com/rajyellow46/wine-quality
   
#### LIME for Text Data:
   - https://donernesto.github.io/blog/explaining-text-classification-predictions-with-lime/
   - https://github.com/stanfordnlp/GloVe
   
#### LIME for Image Data:
   - https://towardsdatascience.com/interpretable-machine-learning-for-image-classification-with-lime-ea947e82ca13
