<a href="https://colab.research.google.com/github/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_2.ipynb"> <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>
<a id="raw-url" href="https://raw.githubusercontent.com/perrin-isir/xomx-tutorials/main/tutorials/xomx_kidney_classif_2.ipynb" download> <img align="left" src="https://img.shields.io/badge/Github-Download%20(Right%20click%20%2B%20Save%20link%20as...)-blue" alt="Download (Right click + Save link as)" title="Download Notebook"></a>

# *xomx tutorial:* **constructing diagnostic biomarker signatures**: phase 2

This is the second and main phase of the tutorial on kidney cancer classification. We recall that the objective of this tutorial is to use a recursive feature elimination method on 
RNA-seq data from the Cancer Genome Atlas (TCGA) to identify gene biomarker signatures for the differential diagnosis of three types of kidney cancer. 

The recursive feature elimination method is based on 
the [Extra-Trees algorithm](https://link.springer.com/article/10.1007/s10994-006-6226-1)
(and its implementation in 
[scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)).

**Remark (1/2):** the first phase of the tutorial [(xomx_kidney_classif_1.ipynb)](https://github.com/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_1.ipynb) imports the RNA-seq data from the Cancer Genome Atlas (TGCA) online database, and applies basic preprocessing. It results in the file `xomx_kidney_classif_small.h5ad`, which is an AnnData object containing RNA-seq data for 265 samples labelled "TCGA-KIRC" (kidney renal clear cell carcinoma), "TCGA-KIRP" (kidney renal papillary cell carcinoma), or "TCGA-KICH" (chromophobe renal cell carcinoma). The samples have been randomly assigned to a training set (75%) and a validation set (25%). For each of the samples, the features are the levels of expression of the top 8000 highly variable genes that have been selected in phase 1. 

**Remark (2/2):** the first phase of the tutorial is optional. It takes some time to import the data from the Cancer Genome Atlas (TGCA) online database, so for convenience we stored `xomx_kidney_classif_small.h5ad` in the *xomx-tutorials* repository [(https://github.com/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_small.h5ad)](https://github.com/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_small.h5ad). The file is directly downloaded in this phase 2 of the tutorial, therefore phase 1 can be skipped.

Here is the last plot obtained at the end of this tutorial, a UMAP embedding of the RNA-seq data reduced to a small set of biomarker genes, with colors based on the expression levels of the gene NDUFA4L2 (ENSG00000185633):

In [33]:
%%html
<iframe width="99%" height=650 src="xomx_kidney_classif_figure.html">
# /!\ This cell cannot be executed in Colab

In [None]:
# imports:
import os
from IPython.display import clear_output, HTML
try:
    import xomx
except ImportError:
    !pip install git+https://github.com/perrin-isir/xomx.git
    clear_output()
    import xomx
try:
    import scanpy as sc
except ImportError:
    !pip install scanpy
    clear_output()
    import scanpy as sc
import numpy as np

We define `save_dir`, the folder in which everything will be saved.

In [None]:
save_dir = os.path.expanduser(os.path.join("~", "results", "xomx-tutorials", "kidney_classif"))  # the default directory in which results are stored
os.makedirs(save_dir, exist_ok=True)

In [None]:
# Setting the pseudo-random number generator
rng = np.random.RandomState(0)

## Step 1: loading the data

In [None]:
if not os.path.exists(os.path.join(save_dir, "xomx_kidney_classif_small.h5ad")):
    !wget -O {os.path.join(save_dir, "xomx_kidney_classif_small.h5ad")} "https://github.com/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_small.h5ad?raw=true"

In [None]:
xd = sc.read(os.path.join(save_dir, "xomx_kidney_classif_small.h5ad"))

In [None]:
xd

`xd` contains the data matrix and the data annotations.
There are 465 samples, and 8000 features which were selected with the function `sc.pp.highly_variable_genes()`, see [xomx_kidney_classif_1.ipynb](https://github.com/perrin-isir/xomx/blob/main/xomx-tutorials/tutorials/xomx_kidney_classif_1.ipynb).

`xd.X[0, :]`, the first row, contains the 8000 (normalized and logarithmized) expression levels for the 
first sample.  

In [None]:
xd.X[0, :]

`xd.X[:, 0]`, the first column, contains the values of the first feature for all samples.

The feature names (gene IDs) are stored in `xd.var_names`, and the sample
identifiers are stored in `xd.obs_names`. 

In [None]:
xd.var_names

The labels are stored in `xd.obs["labels"]`.

In [None]:
xd.obs["labels"]

Using the function `train_and_test_indices()` (see [xomx_kidney_classif_1.ipynb](https://github.com/perrin-isir/xomx/blob/main/xomx-tutorials/tutorials/xomx_kidney_classif_1.ipynb)), the data has been divided into a training set and a test set.
- `xd.uns["train_indices"]` is the array of indices of all samples that belong to the training set.
- `xd.uns["test_indices"]`is the array of indices of all samples that belong to the test set.
- `xd.uns["train_indices_per_label"]` is the dictionary of sample indices in the training set, per label. For instance, `xd.uns["train_indices_per_label"]["TCGA-KIRP"]` is the array of indices of all the samples labelled as `"TCGA-KIRP"` that belong to the training set.
- `xd.uns["test_indices_per_label"]`is the dictionary of sample indices in the test set, per label.

In [None]:
xd.uns["test_indices_per_label"]["TCGA-KICH"]

## Step 2: training binary classifiers and performing recursive feature elimination

We initialize an empty dictionary of "feature selectors":

In [None]:
feature_selectors = {}

There will be one feature selector per label.
What we call feature selector here is a binary classifier
trained with the Extra-Trees algorithm to
distinguish samples with a given label from
other types of samples. After training, features are
ranked by a measure of importance known as the Gini importance, 
and the 100 most important features are kept. 
Then, the Extra-Trees algorithm is run again on the training
data filtered to the 100 selected features, which leads to a 
new measure of importance of the features. We repeat the 
procedure to progressively select 30, then 20, 15 and finally 10
features. At each iteration, we evaluate on the test set the 
Matthews correlation coefficient (MCC score) of the 
classifier to observe how the performance changes 
when the number of features decreases.  
The progression 100-30-20-15-10 is arbitrary, but 
the most efficient strategies start by aggressively 
reducing the number of features, and then slow down
when the number of features becomes small.

Here is the loop that trains all the classifiers and ends up 
selecting 10 features for every label. It also creates 
`gene_dict`, a dictionary of the 10-gene signatures selected
for each label.

In [None]:
gene_dict = {}
for label in xd.uns["all_labels"]:
    print("Label: " + label)
    feature_selectors[label] = xomx.fs.RFEExtraTrees(
        xd,
        label,
        n_estimators=450,
        random_state=rng,
    )
    feature_selectors[label].init()
    for siz in [100, 30, 20, 15, 10]:
        print("Selecting", siz, "features...")
        feature_selectors[label].select_features(siz)
        print(
            "MCC score:",
            xomx.tl.matthews_coef(feature_selectors[label].confusion_matrix),
        )
    gene_dict[label] = [
        xd.var_names[idx_]
        for idx_ in feature_selectors[label].current_feature_indices
    ]
    print("Done.")

## Step 3: visualizing results

Using the plotting function `scatter()`,
we plot the standard deviation vs mean value for all the 
features (which were computed before logarithmizing the data).
`scatter()` takes in input two functions, one for 
the x-axis, and one for the y-axis. Each of these functions
must take in input the feature index.  

By changing the 
`obs_or_var` option to "obs" instead of "var", we can use
`scatter()` to make a scatter plot over the samples
instead of over the features.

In [None]:
xomx.pl.scatter(
    xd,
    lambda idx: xd.var["mean_values"][idx],
    lambda idx: xd.var["standard_deviations"][idx],
    obs_or_var="var",
    xlog_scale=True,
    ylog_scale=True,
    xlabel="mean values",
    ylabel="standard deviations",
)

You can notice that the plots are interactive: information is obtained by hovering the cursor over the points.  
By default, the plots are made with bokeh, but matplotlib can be used as well.
This is controlled with the function `xomx.pl.extension()`:

In [None]:
xomx.pl.extension("matplotlib")

In a notebook, interactive plots with matplotlib require using ipympl, and enabling it with the matplotlib Jupyter magic `@matplotlib widget` (**remark:** matplotlib interactive plots are typically slower in notebooks than in python scripts).

In [None]:
try:
    import ipympl
except ImportError:
    !pip install ipympl
    clear_output()
    import ipympl
%matplotlib widget

In [None]:
xomx.pl.scatter(
    xd,
    lambda idx: xd.var["mean_values"][idx],
    lambda idx: xd.var["standard_deviations"][idx],
    obs_or_var="var",
    xlog_scale=True,
    ylog_scale=True,
    xlabel="mean values",
    ylabel="standard deviations",
)

However, interactive plots with matplotlib do not work in Colab, and `@matplotlib inline` must be used instead.

In [None]:
%matplotlib inline
xomx.pl.scatter(
    xd,
    lambda idx: xd.var["mean_values"][idx],
    lambda idx: xd.var["standard_deviations"][idx],
    obs_or_var="var",
    xlog_scale=True,
    ylog_scale=True,
    xlabel="mean values",
    ylabel="standard deviations",
)

This plot shows the 8000 highly variable genes selected in the phase 1 of the tutorial ([xomx_kidney_classif_1.ipynb](https://github.com/perrin-isir/xomx/blob/main/xomx-tutorials/tutorials/xomx_kidney_classif_1.ipynb)), and we can observe the frontier defined by `sc.pp.highly_variable_genes()` to remove genes considered less variable.

For a given feature selector, for example `feature_selectors["TCGA-KIRP"]`,
`plot()` displays results on the test set. The classifier uses only the selected 
features, here the 10 features selected for the label `"TCGA-KIRP"`.
Points above the horizontal red line (score > 0.5) are classified as positives (prediction: `"TCGA-KIRP"`), and points below the horizontal line (score < 0.5)
are classified as negatives (prediction: `not "TCGA-KIRP"`).

In [None]:
xomx.pl.extension("bokeh")
feature_selectors["TCGA-KIRP"].plot()

We can construct a multiclass classifier based on the 3 binary classifiers:

In [None]:
sbm = xomx.cl.ScoreBasedMulticlass(xd, xd.uns["all_labels"], feature_selectors)

This multiclass classifier bases its predictions on 30 features (at most): the 
union of the three 10-gene signatures (one per label). It simply computes the 3 
scores of each of the binary classifiers, and returns the label that corresponds 
to the highest score.  
`plot()` displays results on the test set:

In [None]:
sbm.plot()

For each of the 3 labels, points that are 
higher in the horizontal band correspond to a 
higher confidence in the prediction (but
the very top of the band does not mean 100% 
confidence).

We gather the selected genes in a single list:

In [None]:
all_selected_genes = np.asarray(list(gene_dict.values())).flatten()

We can visualize these marker genes with `xomx.pl.plot_var()`:

In [None]:
xomx.pl.plot_var(xd, all_selected_genes)

Interestingly, we can observe that some of the selected marker genes are downregulated (especially for `"TCGA-KIRP"`).  
Let us "zoom" on the marker genes for KIRP:

In [None]:
xomx.pl.plot_var(xd, gene_dict["TCGA-KIRP"])

We observe at least 2 significantly downregulated genes for KIRP: 
PTGER3 (ENSG00000050628) and EBF2 (ENSG00000221818).

KICH markers:

In [None]:
xomx.pl.plot_var(xd, gene_dict["TCGA-KICH"])

We can also use `plot_var()` with a single gene:

In [None]:
xomx.pl.plot_var(xd, "ENSG00000168269.10")

Remark: there are small differences between the plots generated with bokeh or with matplotlib. For example, here, with matplotlib, violinplots are automatically generated:

In [None]:
xomx.pl.extension("matplotlib")
xomx.pl.plot_var(xd, "ENSG00000168269.10")

The FOXI1 (ENSG00000168269) transcription factor is known to 
be drastically overexpressed in KICH. In fact, it has been argued that 
the FOXI1-driven transcriptome that defines renal intercalated cells is retained 
in KICH and implicates the intercalated cell type as the cell of origin 
for KICH; see: 
**[D. Lindgren et al., *Cell-Type-Specific Gene Programs of the Normal Human 
Nephron Define Kidney Cancer Subtypes*, Cell Reports 2017 Aug; 20(6): 1476-1489. 
doi: [10.1016/j.celrep.2017.07.043](
https://doi.org/10.1016/j.celrep.2017.07.043
)]**

KIRC markers:

In [None]:
xomx.pl.extension("bokeh")
xomx.pl.plot_var(xd, gene_dict["TCGA-KIRC"])

We can notice in particular the upregulation of NDUFA4L2 (ENSG00000185633),
a gene that has been analyzed as a biomarker for KIRC in
**[D. R. Minton et al., *Role of NADH Dehydrogenase (Ubiquinone) 1 alpha subcomplex 4-like 
2 in clear cell renal cell carcinoma*, 
Clin Cancer Res. 2016 Jun 1;22(11):2791-801. doi: [10.1158/1078-0432.CCR-15-1511](
https://doi.org/10.1158/1078-0432.CCR-15-1511
)]**.

Finally, we filter and restrict the data to the selected genes, and follow 
the Scanpy procedure to compute a 2D UMAP embedding:

In [None]:
xd_reduced = xd[:, all_selected_genes]
xd_reduced.var_names_make_unique()
sc.pp.neighbors(xd_reduced, n_neighbors=10, n_pcs=40, random_state=rng)
sc.tl.umap(xd_reduced, random_state=rng)

in AnnData objetcs, multi-dimensional annotations on observations are stored in `.obsm`.  

`sc.tl.umap()` stores the embedding in `xd_reduced.obsm["X_umap"]`.  

We use `xomx.pl.plot_2d_obsm()` to display an interactive plot:

In [None]:
xomx.pl.plot_2d_obsm(xd_reduced, "X_umap")

By default, the colors are defined by the labels stored in `xd.obs["labels"]`, unless `xd.obs["colors"]` exists, in which case it is used to define the colors.   
The colors can also depend on a function provided in input (but the function is not considered if `xd.obs["colors"]` exists). The function must take sample indices in input and return numeric values.  
We give an example with a function that returns the scores computed by the "TCGA-KIRC" classifier:

In [None]:
kirc_scores = feature_selectors["TCGA-KIRC"].score(xd.X)
xomx.pl.plot_2d_obsm(xd_reduced, "X_umap", lambda i: kirc_scores[i])

A common need is to use colors that depend on the value of one particular feature, so this can be done by simply passing the name of the feature in input:

In [None]:
xomx.pl.plot_2d_obsm(xd_reduced, "X_umap", "ENSG00000185633.10")

In [None]:
xomx.pl.plot_2d_obsm(xd_reduced, "X_umap", "ENSG00000185633.10")