# <span style="color: steelblue;">Validation of scaLR Models Using the Gene Recall Curve</span>

## <span style="color: steelblue;"> Keypoints </span>

1. This notebook is designed as a tutorial for using the gene recall curve from the scaLR library.
   - It covers two methods to generate the gene recall curve using the library, both of which are included in this tutorial.
2. The tutorial also explains why the gene recall curve is a crucial measure for evaluating model efficacy.


## <span style="color: steelblue;">What is gene recall?</span>
- Gene recall curve is a method to detect how capable our model is for ranking genes which are actually important.

- As the term says, we want to check the recall of genes(literature - proven important) in the model generated ranked genes.


## <span style="color: steelblue;">What are the current metrics to understand model performance results?</span>
- There are many, some of which are listed below.

1. Classification tasks
    - accuracy
    - precision
    - recall
    - f-score
    - etc
2. Regression tasks
    - MAE
    - MSE
    - etc.

- Once the above metrics are optimized, we can proceed with downstream analysis of the top genes.

- Key genes can be identified using SHAP (SHapley Additive exPlanations) analysis for neural networks.

- Differential gene expression analysis can be performed to further validate the important genes.

- The gene recall curve is a crucial metric for evaluating a model's performance, as it assesses literature gene recall and indicates whether the model is effectively capturing important genes in the top ranks.


## <span style="color: steelblue;">Why is gene recall an important metric and how to interpret it?</span>

- Let's say we have two models, each producing a list of 500 ranked genes related to a specific disease or trait. We want to compare these models based on the most important genes they identify for that disease or trait, which will then be used for further analysis.

- By comparing the top K genes (e.g., top 20 or 30) identified by each model, we can determine which model is more effective in associating genes with the disease or trait.

- To evaluate which model ranks genes more accurately, a gene recall curve can be particularly useful.

- We can examine the presence of literature-supported genes within the top ranks of each model's list, assessing which model includes more of these known genes.

- Example:
  - Let's say we have 100 genes from the literature relevant to our study.
  - Suppose model 1 identifies 20 of these literature genes within its top 100 ranked genes, with the remaining 80 appearing between ranks 100-500.

  - Meanwhile, model 2 identifies 40 literature genes within its top 100 ranked genes, with the other 60 appearing between ranks 100-500. At first glance, it might seem that model 2 is better than model 1. However, it’s crucial to consider where these genes fall within the top 100.

  - If model 1 captures all 20 genes within the top 50 ranks, while model 2 places all 40 genes closer to rank 100 (i.e., between ranks 50-100), this suggests that model 1 is actually more effective than model 2. This is because, ultimately, only the top 20-30 genes are of primary interest.
  
  - This insight is derived from the gene recall curve.



## <span style="color: steelblue;">What are the required parameters for the gene recall curve?</span>

- First and foremost, we need literature genes to assess their recall for particualr disease, trait  and cell type.
  - For example, if user wants gene recall for cell specific markers like B cells, T cells, or dendritic cells (DCs), user need to compile a literature genes/markers list for each category.

- Secondly, we need a ranked gene list from the model or a score matrix that indicates the score of each gene for every cell type.

**Note:** Please refer to the [scaLR pipeline tutorial](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/pipeline/scalr_pipeline.ipynb) for more information on the `score matrix`. If the scaLR pipeline has already been run, the matrix can be found at `exp_dir/analysis/gene_analysis/score_matrix.csv`. This matrix is used by default in the pipeline run to generate the gene recall curve provided a class-specific `reference genes CSV`.

## <span style="color: steelblue;"> How to generate gene recall using scaLR library</span>


### <span style="color: steelblue;">Cloning scaLR</span>

In [None]:
!git clone https://github.com/infocusp/scaLR.git

## <span style="color: steelblue;">Library Installation and Imports</span>

In [None]:
import sys
sys.path.append('scaLR')

In [None]:
!pip install anndata scanpy pydeseq2 shap

In [None]:
import pandas as pd
from scalr.analysis import gene_recall_curve

%reload_ext autoreload
%autoreload 2

## <span style="color: steelblue;">Getting required files</span>

## <span style="color: steelblue;">Example of reference genes list </span>

 1. The reference genes dataframe should look like below - categories in columns and genes in rows.
    The category(column) names should match exactly with the ranked genes dataframe columns.

 2. Also, you may need to add NaNs in columns as seen below if some cell types are having different number of
    reference genes, to have same #rows in dataframe.

In [None]:
reference_genes_path = './scaLR/tutorials/analysis/gene_recall_curve/reference_genes.csv'

reference_genes_df = pd.read_csv(reference_genes_path, index_col=0)
reference_genes_df

## <span style="color: steelblue;">Gene Recall Generation</span>

### <span style="color: steelblue;">1. Using ranked genes csv</span>
- Ranked genes dataframe should look like below, use `ranked_genes_path` in `GeneRecallCurve` to generate gene recall.

In [None]:
ranked_genes_df1_path = './scaLR/tutorials/analysis/gene_recall_curve/ranked_genes.csv'

ranked_genes_df = pd.read_csv(ranked_genes_df1_path, index_col=0)
ranked_genes_df.head()

In [None]:
# Create object for gene recall curve.
grc = gene_recall_curve.GeneRecallCurve(reference_genes_path=reference_genes_path,    # Reference genes csv path. Required.
                                        ranked_genes_path_dict=
                                            {                                         # Dictionary of ranked genes csv path per model. Required in this case.
                                            'model_0': ranked_genes_df1_path,
                                            # 'model_1': ranked_genes_df2_path,
                                            },
                                        top_K=50,               # Top K ranks in which gene recall is to be checked. Optional - default: 100
                                        plots_per_row = 3,      # Number of plots per row. Optional
                                        save_plots=False        # Whether to save plot or not. Optional - default: True
                                        )

## save_plots is `False` here. But if you want to store plots, then consider making `save_plots` to True &
## add `dirpath` in `generate_analysis()` below.

# Generate gene recall curve
grc.generate_analysis()

#### <span style="color: steelblue;">Compare multiple models gene recall in one plot</span>

- We can send multiple model ranked genes csv path to compare gene recall within same plot.
- We just need to pass list of ranked genes csv path in `ranked_genes_path` in `GeneRecallCurve()` class.

![Alt text](https://github.com/infocusp/scaLR/blob/main/tutorials/analysis/gene_recall_curve/multi_model_gene_recall_comparison.png?raw=1)

### <span style="color: steelblue;">2. Gene recall using score_matrix</span>
- If you want to generate gene recall using the score_matrix, don't pass anything for `ranked_genes_path` in
  GeneRecallCurve(), you can pass `score_matrix` in `generate_analysis(score_matrix=score_matrix)` after
  creating GeneRecallCurve() object.

- If you have mentioned `ranked_genes_path` & also given score_matrix to `generate_analysis()`, then
  `ranked_genes_path` will be given more priority and that will be used to generate gene recall.

- What all you require to generate gene recall using this method?
  Answer:
    1. reference genes dataframe.
    2. `score_matrix`
    3. `feature_selector` method - this you can find inside `scalr/feature/selector` - example below.

In [None]:
score_matrix = pd.read_csv('./scaLR/tutorials/analysis/gene_recall_curve/score_matrix.csv', index_col=0)
score_matrix

# score_matrix should look like below. score_matrix have a score for each gene per category. Categories are in rows.
# Category names should match with reference genes dataframe categories. E.g. D, DC, etc...

In [None]:
# Create object for gene recall curve
grc = gene_recall_curve.GeneRecallCurve(reference_genes_path,   # Reference genes csv path. Required.
                                        top_K=100,              # Top K ranks in which gene recall is to be checked. Optional - default: 100
                                        plots_per_row=3,        # Number of plots per row. Optional
                                        save_plots=False,       # Whether to save plot or not. Optional - default: True
                                        features_selector=
                                            {
                                                'name': 'ClasswisePromoters',  # Mention aggregation strategy here.
                                                'params': {}                   # Mention params like `k` here.
                                            }
                                        )

# save_plots is `False` here. But if you want to store plots, then consider making `save_plots` to True &
# add `dirpath` in `generate_analysis()` below.

# Generate gene recall curve
grc.generate_analysis(score_matrix=score_matrix)

## <span style="color: steelblue;">Interpretation of gene recall curve using comparison example</span>


![Alt text](https://github.com/infocusp/scaLR/blob/main/tutorials/analysis/gene_recall_curve/multi_model_gene_recall_comparison.png?raw=1)

- As stated in section `1 of Gene Recall - Using ranked genes csv`, we can plot multiple models gene recall curves in single plot by passing list of ranked genes csv of each model.
- Lets understand `Mono_Cell` behavior from the above gene recall curve.
    - We can see the spike in the curve for `model_1` in top 100 as compared to `model_0`. This indicates that more of literature(important) genes for cell Mono are found using `model_1` than `model_0`.
    Hence, we can consider model_1 for further analysis of genes for Mono Cell.
    
    
    
    Disclaimer: This is an example figure for explaination purposes.