# Feature Attribution Bokeh Generation

This notebook is responsible for taking the attribution and image data and providing an interative UMAP plot based on the raw attribution data. The plot itself is generated via Bokeh and lives in an HTML file.

NOTE: This notebook only works with attributions that were done AGAINST A SINGLE CLASS. It will NOT WORK for multi-class attributions that are designed for generating confusion matrices.

## Inputs

The settings that control the notebook are determined by a set of variables (all expressed as capital letters, with underscoring used for spaces ex: `ORIGINAL_IMAGES_PATH`). The values of these variables can be changed prior to execution of the notebook.

### Required Arguments
The notebook requires the following information to be provided. 

### File Paths

* `TRUE_LABELS_PATH` (string) - Path to the true labels for the dataset, these should be stored in `*.npy` format and have the shape `[1,x]` where x is the number of instances of data
* `PREDICTED_LABELS_PATH` (string) - Path to the predicted labels generated from the "Classifier Labels Generation" notebook for the dataset, these should be stored in `*.npy` format and have the shape `[1,x]` where x is the number of instances of data
* `ATTRIBUTION_DATA_PATH` (string) - Path to the attribution data generated from the "Feature Attribution Data Generation" Notebook, this should be a ".pt" or saved PyTorch Tensor file.
* `PLOT_PATH` (string) - Path to where you want the plot to be saved.

### Optional Arguments

The notebook provides default values for these values but they can be changed to new values.


#### Image Processing Options

The notebook has default options for these but they can be tweaked for custom results. Note that the following variables are passed DIRECTLY into a call to the `visualize_image_attr` function that Captum provides, meaning it should align with the information found in the Captum documentation: https://captum.ai/api/utilities.html#visualization. The most relevant parts have been summarized/taken straight from the documentation and provided below.

* `METHOD` (string) (default value: "blended_heat_map") - the method for visualization attribution. They are:
  * "heat_map" - display a heatmap of attributions
  * "blended_heat_map" - put the heatmap over a greyscale version of the image
  * "original_image" - Just show the original image
  * "masked_image" - mask image (pixel-wise multiply) by normalized attribution values
ng
  * "alpha_scaling" - set the alpha channel of each pixel to normalized attribution value
* `SIGN` (string) (default value: "all") - Determines which attribution values to show. The options for this method are:
  * "positive" - only display positive attributions
  * "absolute_value" - display the absolute value of all attributions
  * "negative" - only display negative attributions
  * "all" - display both positive and negative attributions. Note that if you set `METHOD` to "masked_image" or "alpha_scaling" the "all" option is NOT supported.
* `ALPHA_OVERLAY` (float between 0 and 1) (default value: 0.8) - controls the "brightness" or rather how prominently the zebrafish appears in the background. Higher Alpha values correspond to fainter background images.
* `SHOW_COLORBAR` (boolean)  (default value: True)- Determines if a colorbar is added that shows a mapping between the color on the image (red/green) and its associated attribution value.

#### UMAP options

These are passed to UMAP upon instantiation of a UMAP object and control UMAP's behavior. The following is summarized/taken from https://umap-learn.readthedocs.io/en/latest/api.html:

* `N_NEIGHBOURS` (int) (default value: 20) - number of neighbouring sample points 
* `MIN_DIST` (float) (default value: 0.1) - the effective minimum distance between embedded points
* `VERBOSE` (Boolean) (default value: True) - allows you to enable/disable verbose output from UMAP when it's constructing the data

#### Plot Options

These options target the Bokeh Plot.

* `PLOT_TITLE` (string) (defalut value: "Deconvolution against Predicted Labels") - The title the UMAP plot will have
* `PLOT_SUB_IMAGE_WIDTH_PX` (string) (default value: "256px") - the width of each fish image upon being displayed when the user hovers on a point.
* `PLOT_SUB_IMAGE_HEIGHT_PX` (string) (default value: "90px") - the height of each fish image upon being displayed when the user hovers on a point.

## Outputs

* An HTML file at the path specified by `PLOT_PATH` that can be opened in any browser, containing an interactive UMAP plot of the feature attribution data.

In [None]:
# Import dependencies
import torch 

from io import BytesIO
import base64
from PIL import Image

from bokeh import plotting, palettes
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# Install Captum
!pip install -q captum
# We just need the visualization library, the actual 
# attribution itself is done in a separate notebook
# See: "Feature Attribution Data Generation"
from captum.attr import visualization as viz

# Install and import UMAP
!pip install -q umap-learn
import umap


[K     |████████████████████████████████| 1.4 MB 11.8 MB/s 
[K     |████████████████████████████████| 88 kB 5.6 MB/s 
[K     |████████████████████████████████| 1.1 MB 33.7 MB/s 
[?25h  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone


In [None]:
# Control Variables

### File Paths
#### Path to the original dataset, should be saved as .npy file
ORIGINAL_IMAGES_PATH = "/content/drive/Shareddrives/Exploding Gradients/X_cropped_b.npy"
#### Path to the true labels of the dataset, should be save as .npy file
TRUE_LABELS_PATH = "/content/drive/Shareddrives/Exploding Gradients/y_b.npy"
#### Path to the classifier generated labels, should be saved as .npy file (use "Labels Generation" notebook if you don't have this file!)
PREDICTED_LABELS_PATH = "/content/drive/MyDrive/Fish Attribution/model1e-050.5.2022-05-22 12:13:10.pt Work/predicted_labels.npy"
#### Path to the attribution data
ATTRIBUTION_DATA_PATH = "/content/drive/MyDrive/Fish Attribution/model1e-050.5.2022-05-22 12:13:10.pt Work/Deconvolution Single Class.pt"
#### Path to where the plot should be stored
PLOT_PATH = "/content/drive/MyDrive/Fish Attribution/model1e-050.5.2022-05-22 12:13:10.pt Work/deconvolution.html"

## Image Processing Options
### Method the attributions should be visualized as
METHOD = "blended_heat_map"
### The signs of the attribution to visualize
SIGN = "all"
### Controls how prominently the background image shows, 
### 1 means the background has full prominence/brightness while 0 means
### the background is not visible at all
ALPHA_OVERLAY = 0.8
### Decides whether or not to include the colorbar for each image, 
### indicating the color associated with each attribution value on the
### image.
SHOW_COLORBAR = True

## UMAP options
### Number of neighbours used for evaluation
N_NEIGHBOURS = 20
### Minimum distance between points, should range from 0 to 1 as a float
MIN_DIST = 0.5
### Decides if you want to display the progress as UMAP is crunching numbers
VERBOSE = True

## Plot Options
### Title of the Plot
PLOT_TITLE = "Deconvolution against Predicted Labels"
### Image Width in pixels, expressed as a string (this is injected into the
### final HTML that bokeh uses)
PLOT_SUB_IMAGE_WIDTH_PX = "256px"
### Image Height expressed in pixels, 
### expressed as a string (this is injected into the
### final HTML that bokeh uses
PLOT_SUB_IMAGE_HEIGHT_PX = "90px"

In [None]:
## Load all necessary data

import torch

## Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

## Load Images along with True and Predicted labels
images = np.load(ORIGINAL_IMAGES_PATH)
y_true = np.load(TRUE_LABELS_PATH)
y_predicted = np.load(PREDICTED_LABELS_PATH)

## Convert the labels from numpy arrays to tensors
y_true_tensor = torch.Tensor(y_true).long()
y_predicted_tensor = torch.Tensor(y_predicted).long()

## Reshape tensors to proper dimensions
images_tensor = torch.Tensor(images) 
images_tensor = torch.swapaxes(images_tensor,2,4)
images_tensor = torch.swapaxes(images_tensor,3,4)

# Get the attribution data
attributions_tensor = torch.load(ATTRIBUTION_DATA_PATH, map_location=torch.device('cpu'))

Mounted at /content/drive


In [None]:
# Once the data has been extracted from the numpy arrays and converted to
# tensor form, just delete the original arrays (saves RAM)
del images, y_true, y_predicted 

In [None]:
## This is a heavily modified helper function (originally taken from this source:
# https://tonio73.github.io/data-science/cnn/CnnVsDense-Part2-Visualization.html
# which also provided the code below that has been slightly modified, and
# https://zapaishchykova.medium.com/the-meaningful-visualization-of-clusters-2a666be0f460

# This function is applied to convert each cluster of images and attributions to
# the format that Bokeh can use to view the images
def embeddableImage(image_cluster, attribution_cluster):
    # an "image_cluster" should have the following dimensions: [4, 3, 130, 370]
    # as should the "attribution_cluster" (equivalent to one piece of data)
    encoded_subimages = []
    # For each image and attribution associated with the image, pass the data
    # to Captum visualization to get th eimage
    for sub_image, sub_attribution in zip(image_cluster, attribution_cluster):
      fig, _ = viz.visualize_image_attr(sub_attribution.transpose(1,2,0), 
                                                  sub_image.transpose(1,2,0), 
                                                  method=METHOD, 
                                                  sign=SIGN,
                                                  show_colorbar=SHOW_COLORBAR,
                                                  use_pyplot=False);
      # encode the data into a format that Bokeh understands by saving the
      # image to a buffer instead of a file
      buffer = BytesIO()
      fig.savefig(buffer, format='png', bbox_inches='tight')
      encoded_subimages.append('data:image/png;base64,' + base64.b64encode(buffer.getvalue()).decode())
    return encoded_subimages

In [None]:
# umapPlot is solely responsbile for generating the plot itself 
# It accepts the following arguments:
## embedding - refers to the data produced by applying UMAP, should be an array
## x - the raw images used
## y_true - the true labels
## y_predicted - predicted labels 
## attributions - the attribution data
## title - a string that indicates what the plot title is
## x and attributions MUST have the same dimensions
def umapPlot(embedding, x, y_true, y_predicted, attributions, title=''):
    """ Plot the embedding of X and y with popovers using Bokeh """
    
    # Create a two-column dataframe from the embedding data
    df = pd.DataFrame(embedding, columns=('x', 'y'))
    # for each image should be able to apply the embeddable image function
    # list of lists [rows x columns], x instances with 4 columns
    sub_images = np.array(list(map(embeddableImage, x, attributions)))

    # Start storing the subimages into the dataframe with their own index
    for i in range(4):
      df['image'+str(i+1)] = sub_images[:,i]
    # Convert the predicted and true labels to strings
    df['true_class'] = [str(label) for label in y_true]
    df['predicted_class'] = [str(label) for label in y_predicted]
    # Create the indices for the data so a user can index back and obtain them
    df['index'] = list(range(len(y_true)))

    datasource = ColumnDataSource(df)

    colorMapping = CategoricalColorMapper(factors=np.arange(10).astype(np.str), palette=palettes.Spectral10)

    plotFigure = plotting.figure(
        title=title,
        plot_width=600,
        plot_height=600,
        tools=('pan, wheel_zoom, reset')
    )

    # Whenever the user hovers on a point with their mouse, show 
    # the 4 images of the fish in the data instance simultaneously, along with
    # The True and Predicted class, as well as the data index number
    tooltip = """
        <div>
            <div>
                <img src='@image1' style='float: left; width:{WIDTH}; height:{HEIGHT}; margin: 5px 5px 5px 5px'/>
            </div>
            <div>
                <img src='@image2' style='float: left; width:{WIDTH}; height:{HEIGHT}; margin: 5px 5px 5px 5px'/>
            </div>
            <div>
                <img src='@image3' style='float: left; width:{WIDTH}; height:{HEIGHT}; margin: 5px 5px 5px 5px'/>
            </div>
            <div>
                <img src='@image4' style='float: left; width:{WIDTH}; height:{HEIGHT}; margin: 5px 5px 5px 5px'/>
            </div>
            <div>
                <span style='font-size: 16px; color: #224499'>True Class:</span>
                <span style='font-size: 18px'>@true_class</span>
            </div>
            <div>
                <span style='font-size: 16px; color: #224499'>Predicted Class:</span>
                <span style='font-size: 18px'>@predicted_class</span>
            </div>
            <div>
                <span style='font-size: 16px; color: #224499'>Index:</span>
                <span style='font-size: 18px'>@index</span>
            </div>
        </div>
        """.format(WIDTH=PLOT_SUB_IMAGE_WIDTH_PX,
                   HEIGHT=PLOT_SUB_IMAGE_HEIGHT_PX)
        # Inject the proper Pixel Widht and Height values specified by the suer
    plotFigure.add_tools(HoverTool(tooltips=tooltip))

    # The actual coordinates for the circles/points on the plot are taken
    # from the embedding that UMAP generates (see how the embedding variable)
    # is converted into a panda's dataframe earlier
    plotFigure.circle(
        'x', 'y',
        source=datasource,
        # The color of each point should be by whatever the PREDICTED label
        # is
        color=dict(field='predicted_class', transform=colorMapping),
        line_alpha=0.6, fill_alpha=0.6, size=8
    )
    
    # The original code would force the plot to be shown at this point,
    # this should be AVOIDED. Any attempt at rendering the plot 
    # in Colab will CAUSE IT TO CRASH
    return plotFigure


In [None]:
# the reducer is an instance of a UMAP object that will do the heavy lifting,
# giving us the x,y coordinates of interest from the original data
reducerFish = umap.UMAP(n_neighbors = 20,
                        min_dist=0.5,
                        verbose = True)
# There are 285 pieces of data, reshape the entire dataset such that it's just a series of flat tensors
# (4x3x130x750 = 1170000). Otherwise UMAP won't accept the data.
embeddingFish = reducerFish.fit_transform(torch.reshape(attributions_tensor, (285, 1170000)).detach())

UMAP(min_dist=0.5, n_neighbors=20, verbose=True)
Sat May 28 23:10:20 2022 Construct fuzzy simplicial set




Sat May 28 23:10:37 2022 Finding Nearest Neighbors
Sat May 28 23:10:40 2022 Finished Nearest Neighbor Search
Sat May 28 23:10:42 2022 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

Sat May 28 23:10:46 2022 Finished embedding


In [None]:
# Feed all relevant data to create the UMAP plot
fig = umapPlot(embeddingFish, 
               images_tensor.numpy(), 
               y_true_tensor.squeeze().numpy(),
               y_predicted_tensor.squeeze().numpy(),
               attributions_tensor.detach().numpy(),
               title=PLOT_TITLE)

285


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


In [None]:
from bokeh.plotting import output_file, save

In [None]:
output_file(PLOT_PATH)
save(fig)

'/content/drive/MyDrive/Fish Attribution/model1e-050.5.2022-05-22 12:13:10.pt Work/deconvolution.html'