# Gene analysis using SHAP
- This notebook explains what is `SHAP` and how to use **scaLR**'s `SHAP` to get the genes/features weight to each class of the model.
- `scaLR` supports early stops in `SHAP` analysis.

# What is SHAP?

- `SHAP` (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the `classic Shapley values` from game theory and their related extensions.

- Know more: https://shap.readthedocs.io/en/latest/

# What is early stop in SHAP?

- `scaLR` proccessing `SHAP` in the batches. Processing `SHAP` in the batch or with all data gives similar results.
- `scaLR` list out the `top N genes` for each batch and match with previous batch if it's found number`(threshold)` of genes similar then it will count the patience. Once patience count is match with the config patience number, it will stop the process.

# How to use SHAP from scaLR

## <span style="color: steelblue;">Cloning scaLR</span>

In [None]:
!git clone https://github.com/infocusp/scaLR.git

## <span style="color: steelblue;">Library Installation and Imports</span>

In [None]:
import sys
imported_packages = {pkg.split('.')[0] for pkg in sys.modules.keys()}
ignore_libraries = "|".join(imported_packages)

!pip install $(grep -ivE "$ignore_libraries" scaLR/requirements.txt)
!pip install memory-profiler==0.61.0

In [None]:
from os import path
sys.path.append('./scaLR/')

from anndata import AnnData
import pandas as pd

from scalr.feature.scoring import ShapScorer
from scalr.nn.model import build_model
from scalr.utils import read_data
from scalr.analysis import Heatmap
from scalr.feature.selector import build_selector
%reload_ext autoreload
%autoreload 2



- To perform `SHAP` analysis, we need the `best-trained model` along with the `training data`. This trained model is then used to infer `SHAP` scores on the `test data`.

- If the **`scaLR` pipeline has already been run** with the dataset [(Liu et al., 2021)](https://doi.org/10.1016/j.cell.2021.02.018) mentioned in the [tutorial](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/pipeline/scalr_pipeline.ipynb), you may skip the **`Getting best trained model and datasets`** section. The `best model` and the `train/test` data can be found inside `scalr_experiments/exp_name_0`, specifically for cell type classification tasks.

- Otherwise, we will be using the **`scaLR`** pipeline to accomplish this. For more detailed information on data exploration and pipeline training, please refer to the [scaLR pipeline](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/pipeline/scalr_pipeline.ipynb).


## <span style="color: steelblue;">Getting best trained model and datasets</span>
*`Can be skipped if the scaLR pipeline has already been run.`*

The dataset we are about to download contains two clinical conditions (COVID-19 and normal) and links variations in immune response to disease severity and outcomes over time[(Liu et al. (2021))](https://doi.org/10.1016/j.cell.2021.02.018)

In [None]:
# This shell will take approximately 00:00:53 (hh:mm:ss) to run.
!wget -P data https://datasets.cellxgene.cziscience.com/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad

In [None]:
adata = read_data('./data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad')

In [None]:
# Rename the 'var' indices using the 'feature_name' column, which contains gene symbols, and save the file.
# This shell will take approximately 00:00:47 (hh:mm:ss) to run.
adata.var.set_index('feature_name',inplace=True)
adata.obs.index = adata.obs.index.astype(str)
adata.var.index = adata.var.index.astype(str)
AnnData(X=adata.X,obs=adata.obs,var=adata.var).write('data/modified_adata.h5ad',compression='gzip')

In [None]:
# Command to run end to end pipeline.
# This shell will take approximately 00:21:15 (hh:mm:ss) on GPU to run.()
!python scaLR/pipeline.py --config scaLR/tutorials/pipeline/config_celltype.yaml -l -m

## Configuration for SHAP analysis

In [None]:
config = {
    "dataloader": {
        "name": "SimpleDataLoader",
        "params": {
            "batch_size": 10, # Number of samples processing at a time.
            "padding": 5000
        }
    },
    "top_n_genes": 100, # Top N Genes used for early stop.
    "background_tensor": 20, # Number of train data used as background. Please see SHAP official documentation to know more.
    "early_stop": {
        "patience": 5, # Process stop if continue top genes are similar(>= threshold) up-to number of batches(=patience).
        "threshold": 95 # How many genes should be the same for each iteration?
    },
    "device": 'cuda', # Process with a run on cpu or cuda/gpu.
    "samples_abs_mean": True, # First performed abs on the samples score then do mean.
    "logger": "FlowLogger" # It will print the logs to the output.
}

## Read train-test data & best model
The `train`, `test` data, and `best model` can be found in `./scalr_experiments/exp_name_0` if the pipeline has been run for `cell type classification` or according to the experiment name and path specified in the `config_celltype/config_clinical.yaml` file located at `./scaLR/tutorials/pipeline/`.

In [None]:
train_data = read_data("./scalr_experiments/exp_name_0/feature_extraction/feature_subset_data/train")
test_data = read_data("./scalr_experiments/exp_name_0/feature_extraction/feature_subset_data/test")

In [None]:
# Model path which generated using scaLR platform.
model_checkpoint = "./scalr_experiments/exp_name_0/model/best_model"

model_config = read_data(path.join(model_checkpoint, 'model_config.yaml'))
model_weights = path.join(model_checkpoint, 'model.pt')
mappings = read_data(path.join(model_checkpoint, 'mappings.json'))

model, _ = build_model(model_config)
model.to(config['device'])
model.load_weights(model_weights)

## Run SHAP


In [None]:
shap_scorer = ShapScorer(**config)

In [None]:
target = "cell_type" # Column name in anndata.obs representing all classes.
shap_values = shap_scorer.get_top_n_genes_weights(model, train_data, test_data, target, mappings)

In [None]:
shap_values

In [None]:
columns = train_data.var_names # Fetching the features/columns names
class_labels = mappings[target]['id2label'] # Fetching class labels from the mappings.
all_scores = shap_values[:, :len(columns)] # Fetching all rows and columns data only.

score_matrix = pd.DataFrame(all_scores, columns=columns, index=class_labels)

In [None]:
score_matrix

# Select top N features

In [None]:
selector_config = {
    "name": "ClasswisePromoters", # Class wise top genes.
    # "name": "AbsMean", # Top genes across all class.
    "params":{
        "k": 5000
    }
}
selector, _ = build_selector(selector_config)

In [None]:
# Getting a dictionary of top_N(5000 for current experiment) features per each class.
top_features = selector.get_feature_list(score_matrix)

# Generate heatmaps
Heatmap of feature weights with respect to each class.

- If `top_features` is listed, will plot a single heatmap with top genes from all classes.
- If `top_features` is dict(it contains class wise top features), each heatmap show top features of that class w.r.t the other class.

In [None]:
# save_plot = True, will store plots without showing plots.
heatmap = Heatmap(top_n_genes=20, save_plot=False)

In [None]:
# Generating heatmaps for all classes with the top 20 genes.
heatmap.generate_analysis(
    score_matrix=score_matrix,
    top_features=top_features,
    dirpath=".",
)