<a href="https://colab.research.google.com/github/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_2.ipynb"> <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>
<a id="raw-url" href="https://raw.githubusercontent.com/perrin-isir/xomx-tutorials/main/tutorials/xomx_kidney_classif_2.ipynb" download> <img align="left" src="https://img.shields.io/badge/Github-Download%20(Right%20click%20%2B%20Save%20link%20as...)-blue" alt="Download (Right click + Save link as)" title="Download Notebook"></a>

# *xomx tutorial:* **constructing diagnostic biomarker signatures**: phase 1 (optional)

This is the first phase of a tutorial on kidney cancer classification.  
The objective of the tutorial is to use a recursive feature elimination method on 
RNA-seq data from the Cancer Genome Atlas (TCGA) to identify gene biomarker signatures 
for the differential diagnosis of three types of kidney cancer: kidney renal clear cell
carcinoma (**KIRC**), kidney renal papillary cell carcinoma (**KIRP**), and chromophobe
renal cell carcinoma (**KICH**).

This first phase imports the RNA-seq data from the Cancer Genome Atlas (TGCA) online database, and applies basic preprocessing.  
As the data importation takes some time, in general **we recommand to skip this phase and go directly to the phase 2 of the tutorial:
[(xomx_kidney_classif_2.ipynb)](https://github.com/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_2.ipynb)**.

However, some basic preprocessing tools shown in phase 1 are important, such as the functions `xomx.tl.all_labels()` (Step 4), `xomx.tl.indices_per_labels()` (Step 4), 
`xomx.tl.var_mean_values()` (Step 5)
`xomx.tl.var_standard_deviations()` (Step 5),
and `xomx.tl.train_and_test_indices()` (Step 5).

In [None]:
# imports:
import os
import shutil
from IPython.display import clear_output
try:
    import xomx
except ImportError:
    !pip install git+https://github.com/perrin-isir/xomx.git
    clear_output()
    import xomx
try:
    import pandas as pd
except ImportError:
    !pip install pandas
    clear_output()
    import pandas as pd
try:
    import scanpy as sc
except ImportError:
    !pip install scanpy
    clear_output()
    import scanpy as sc
import numpy as np

We define `save_dir`, the folder in which everything will be saved.

In [None]:
save_dir = os.path.expanduser(os.path.join("~", "results", "xomx-tutorials", "kidney_classif"))  # the default directory in which results are stored
os.makedirs(save_dir, exist_ok=True)

In [None]:
# Setting the pseudo-random number generator
rng = np.random.RandomState(0)

## Step 1: preparing the manifest

We use the 
[GDC Data Transfer Tool](
https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
)
to import data from the Cancer Genome Atlas (TCGA). 
This involves creating a `manifest.txt` file that describes the files to be imported.

The `gdc_create_manifest()` function
facilitates the creation of this manifest. It is designed to import files of gene 
expression counts obtained with [HTSeq](https://github.com/simon-anders/htseq). 
You can have a look at its implementation in 
[xomx/data_importation/gdc.py](../data_importation/gdc.py) to adapt it to your own
needs if you want to import other types of data.

`gdc_create_manifest()` takes in input the disease type (in our case "Adenomas and 
Adenocarcinomas"), the list of project names ("TCGA-KIRC", "TCGA-KIRP", "TCGA-KICH"), 
and the numbers of samples desired for each of these projects (remark: for "TCGA-KICH", 
there are only 65 samples available). It returns a list of Pandas dataframes, one for 
each project.

More information on GDC data can be found on the [GDC Data Portal](
https://portal.gdc.cancer.gov/
).

In [None]:
disease_type = "Adenomas and Adenocarcinomas"
# The 3 categories of cancers studied in this tutorial correspond to the following
# TCGA projects, which are different types of adenocarcinomas
project_list = ["TCGA-KIRC", "TCGA-KIRP", "TCGA-KICH"]
# Fetch 200 cases of KIRC, 200 cases of KIRP, and 65 cases of KICH from the
# GDC database
case_numbers = [200, 200, 65]
df_list = xomx.di.gdc_create_manifest(
    disease_type,
    project_list,
    case_numbers,
)

The Pandas library (imported as `pd`) is used to write the concatenation of the
output dataframes to the file `manifest.txt`:

In [None]:
df = pd.concat(df_list)
df.to_csv(
    os.path.join(save_dir, "manifest.txt"),
    header=True,
    index=False,
    sep="\t",
    mode="w",
)

## Step 2: importing the data

In [None]:
if not os.path.exists(os.path.join(save_dir, "gdc-client")):
    !wget -O {os.path.join(save_dir, "gdc-client_v1.6.1_Ubuntu_x64.zip")} "https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.1_Ubuntu_x64.zip"
    !unzip {os.path.join(save_dir, "gdc-client_v1.6.1_Ubuntu_x64.zip")} -d {save_dir}
    !rm {os.path.join(save_dir, "gdc-client_v1.6.1_Ubuntu_x64.zip")}

We now import the data for the 465 samples.  
**Warning**: it takes some time (approximately 25 minutes in Colab).

In [None]:
tmpdir = os.path.join(save_dir, "tmpdir_GDCsamples")
os.makedirs(tmpdir, exist_ok=True)
commandstring = (
    os.path.join(save_dir, "gdc-client") + " download -d "
    + tmpdir
    + " -m "
    + os.path.join(save_dir, "manifest.txt")
)
if not os.path.exists(os.path.join(save_dir, "xomx_kidney_classif.h5ad")):
    !{commandstring}
    clear_output()
else:
    print(f'{os.path.join(save_dir, "xomx_kidney_classif.h5ad")} already exists, no need to fetch data.')

## Step 3: Creating and saving the AnnData object

The `gdc_create_data_matrix()` function (implemented in
[gdc.py](https://github.com/perrin-isir/xomx/blob/master/xomx/data_importation/gdc.py)
) is used to create a Pandas dataframe with all the individual samples:

In [None]:
tmpdir = os.path.join(save_dir, "tmpdir_GDCsamples")
df = xomx.di.gdc_create_data_matrix(
    tmpdir,
    os.path.join(save_dir, "manifest.txt"),
)
df

In `df`, every column represents a sample (with a unique identifier), 
and the rows correspond to different genes, identified by their 
Ensembl gene ID with a version number after the dot (see
[https://www.ensembl.org/info/genome/stable_ids/index.html](https://www.ensembl.org/info/genome/stable_ids/index.html)).
The integer values are the raw gene expression level measurements for all genes 
and all samples.  
Since the last 5 rows contain special information that we will not use, we drop them
with the following command:

In [None]:
df = df.drop(index=df.index[-5:])

In the convention used by Scanpy (and various other tools), samples are stored as raws of the
data matrix, therefore we transpose the dataframe when creating the AnnData object, which we name `xd`:

In [None]:
xd = sc.AnnData(df.transpose())

See this documentation for details on AnnData objects: 
[https://anndata.readthedocs.io](https://anndata.readthedocs.io).

`xd.X[0, :]`, the first row, contains the expression levels of all genes for the 
first sample.  
`xd.X[:, 0]`, the first column, contains the expression levels of
the first gene for all samples.

The feature names (gene IDs) are stored in `xd.var_names`, and the sample
identifiers are stored in `xd.obs_names`.  
We make sure that the feature names are unique with the
following command:

In [None]:
xd.var_names_make_unique()

In order to improve cross-sample comparisons, we normalize the sequencing
depth to 1 million, with the following Scanpy command:

In [None]:
sc.pp.normalize_total(xd, target_sum=1e6)

`normalize_total()` performs a linear normalization for each sample 
so that the sum of the feature values becomes equal to `target_sum`.  
It is a very basic normalization that we use for simplicity in this tutorial, 
but for more advanced applications, a more sophisticated preprocessing may be 
required.  
`normalize_total()` is an in-place modification of the data, so after its 
application, `xd.X` contains the modified data.

We save `xd` as **xomx_kidney_classif.h5ad**
in the `save_dir` directory:

In [None]:
xd.write(os.path.join(save_dir, "xomx_kidney_classif.h5ad"))

Now we can delete the individual sample files that were downloaded in
Step 2:

In [None]:
assert tmpdir == os.path.join(save_dir, "tmpdir_GDCsamples")
shutil.rmtree(tmpdir, ignore_errors=True)  # be careful with this command

## Step 4: Labelling the samples

We load the AnnData object and the manifest (useful to avoid running the previous steps if the kernel has been restarted):

In [None]:
xd = sc.read(os.path.join(save_dir, "xomx_kidney_classif.h5ad"))
manifest = pd.read_table(os.path.join(save_dir, "manifest.txt"), header=0)

The manifest contains the labels (`"TCGA-KIRC"`, `"TCGA-KIRP"` or `"TCGA-KICH"`) of 
every sample.  
We use it create a dictionary of labels: `label_dict`.

In [None]:
label_dict = {}
for i in range(xd.n_obs):
    label_dict[manifest["id"][i]] = manifest["annotation"][i]

Example: `label_dict["80c9e71b-7f2f-48cf-b3ef-f037660a4903"]` is equal to `"TCGA-KICH"`.

Then, we create the array of labels, considering samples in the same order as 
`xd.obs_names`, and assign it to `xd.obs["labels"]`.

In [None]:
label_array = np.array([label_dict[xd.obs_names[i]] for i in range(xd.n_obs)])
xd.obs["labels"] = label_array

We compute the list of distinct labels, and assign it, as an unstructured annotation,
to `xd.uns["all_labels"]`.

In [None]:
xd.uns["all_labels"] = xomx.tl.all_labels(xd.obs["labels"])

We also compute the list of sample indices for every label:

In [None]:
xd.uns["obs_indices_per_label"] = xomx.tl.indices_per_label(xd.obs["labels"])

Example: `xd.uns["obs_indices_per_label"]["TCGA-KIRC"]` is the list of indices
of the samples that are labelled as `"TCGA-KIRC"`.

It is important to use the keys `"labels"`,
`"all_labels"` and `"obs_indices_per_label"` as they
are expected by some *xomx* functions.

## Step 5: Basic preprocessing

We compute the mean and standard deviation (across samples) for all the features:

In [None]:
xd.var["mean_values"] = xomx.tl.var_mean_values(xd)
xd.var["standard_deviations"] = xomx.tl.var_standard_deviations(xd)

We logarithmize the data with the following Scanpy function that applies
the transformation X = log(1 + X):

In [None]:
sc.pp.log1p(xd)

We then follow the Scanpy procedure to select the top 8000 highly variable genes:

In [None]:
sc.pp.highly_variable_genes(xd, n_top_genes=8000)

We perform the filtering to actually remove the other features:

In [None]:
xd = xd[:, xd.var.highly_variable].copy()

The reason why we reduce the number of features
is to speed up the process of feature elimination, 
which can be relatively slow if it begins 
with tens of thousands of features. Keeping 
highly variable features is one possibility,
but there are other options for the
initial selection of features, see for instance 
the [xomx_pbmc.ipynb](https://colab.research.google.com/github/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_pbmc.ipynb) tutorial (Step 2).

We compute the dictionary of feature indices,
which is required by some *xomx* functions:

In [None]:
xd.uns["var_indices"] = xomx.tl.var_indices(xd)

Example:  `xd.uns["var_indices"]["ENSG00000281918.1"]`
is equal to 7999 because ENSG00000281918.1 is now
the last of the 8000 features in `xd.var_names`.

We then randomly split the samples into training and test sets:

In [None]:
xomx.tl.train_and_test_indices(xd, "obs_indices_per_label", test_train_ratio=0.25, rng=rng)

The function `train_and_test_indices()` requires `xd.uns["obs_indices_per_label"]`, which was computed in 
the previous step. With `test_train_ratio=0.25`, for every label 
(`"TCGA-KIRC"`, `"TCGA-KIRP"` or `"TCGA-KICH"`), 25% of the samples are assigned to 
the test set, and 75% to the training set. It creates the following unstructured 
annotations:
- `xd.uns["train_indices"]`: the array of indices of all samples that belong 
to the training set.
- `xd.uns["test_indices"]`: the array of indices of all samples that belong 
to the test set.
- `xd.uns["train_indices_per_label"]`: the dictionary of sample indices in the 
training set, per label. For instance, `xd.uns["train_indices_per_label"]["TCGA-KIRP"]` is the array
of indices of all the samples labelled as `"TCGA-KIRP"` that belong to the training set.
- `xd.uns["test_indices_per_label"]`: the dictionary of sample indices in the 
test set, per label.

Finally, we save the logarithmized and filtered data to a new file, **xomx_kidney_classif_small.h5ad**, 
which will be used as a starting point in the phase 2 of the tutorial [(xomx_kidney_classif_2.ipynb)](https://github.com/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_2.ipynb):

In [None]:
xd.write(os.path.join(save_dir, "xomx_kidney_classif_small.h5ad"))  # ignore the warning