<a href="https://colab.research.google.com/github/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_2.ipynb"> <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>
<a id="raw-url" href="https://raw.githubusercontent.com/perrin-isir/xomx-tutorials/main/tutorials/xomx_kidney_classif_2.ipynb" download> <img align="left" src="https://img.shields.io/badge/Github-Download%20(Right%20click%20%2B%20Save%20link%20as...)-blue" alt="Download (Right click + Save link as)" title="Download Notebook"></a>

# *xomx tutorial:* **constructing diagnostic biomarker signatures**: phase 1 (optional)

This is the first phase of a tutorial on kidney cancer classification.  
The objective of the tutorial is to use a recursive feature elimination method on 
RNA-seq data from the Cancer Genome Atlas (TCGA) to identify gene biomarker signatures 
for the differential diagnosis of three types of kidney cancer: kidney renal clear cell
carcinoma (**KIRC**), kidney renal papillary cell carcinoma (**KIRP**), and chromophobe
renal cell carcinoma (**KICH**).

This first phase imports the RNA-seq data from the Cancer Genome Atlas (TGCA) online database, and applies basic preprocessing.  
As the data importation takes some time, in general **we recommand to skip this phase and go directly to the phase 2 of the tutorial:
[(xomx_kidney_classif_2.ipynb)](https://github.com/perrin-isir/xomx-tutorials/blob/main/tutorials/xomx_kidney_classif_2.ipynb)**.

However, some basic preprocessing tools shown in phase 1 are important, such as the functions `xomx.tl.all_labels()` (Step 4), `xomx.tl.indices_per_labels()` (Step 4), 
`xomx.tl.var_mean_values()` (Step 5)
`xomx.tl.var_standard_deviations()` (Step 5),
and `xomx.tl.train_and_test_indices()` (Step 5).

In [None]:
# imports:
import os
from IPython.display import clear_output
try:
    import xomx
except ImportError:
    !pip install git+https://github.com/perrin-isir/xomx.git
    clear_output()
    import xomx
try:
    import pandas as pd
except ImportError:
    !pip install pandas
    clear_output()
    import pandas as pd
try:
    import scanpy as sc
except ImportError:
    !pip install scanpy
    clear_output()
    import scanpy as sc
import numpy as np

We define `save_dir`, the folder in which everything will be saved.

In [None]:
save_dir = os.path.expanduser(os.path.join('~', 'results', 'xomx-tutorials', 'kidney_classif'))  # the default directory in which results are stored
os.makedirs(save_dir, exist_ok=True)

In [None]:
# Setting the pseudo-random number generator
rng = np.random.RandomState(0)

## Step 1: preparing the manifest

We use the 
[GDC Data Transfer Tool](
https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
)
to import data from the Cancer Genome Atlas (TCGA). 
This involves creating a `manifest.txt` file that describes the files to be imported.

The `gdc_create_manifest()` function
facilitates the creation of this manifest. It is designed to import files of gene 
expression counts obtained with [HTSeq](https://github.com/simon-anders/htseq). 
You can have a look at its implementation in 
[xomx/data_importation/gdc.py](../data_importation/gdc.py) to adapt it to your own
needs if you want to import other types of data.

`gdc_create_manifest()` takes in input the disease type (in our case "Adenomas and 
Adenocarcinomas"), the list of project names ("TCGA-KIRC", "TCGA-KIRP", "TCGA-KICH"), 
and the numbers of samples desired for each of these projects (remark: for "TCGA-KICH", 
there are only 65 samples available). It returns a list of Pandas dataframes, one for 
each project.

More information on GDC data can be found on the [GDC Data Portal](
https://portal.gdc.cancer.gov/
).

In [None]:
disease_type = "Adenomas and Adenocarcinomas"
# The 3 categories of cancers studied in this tutorial correspond to the following
# TCGA projects, which are different types of adenocarcinomas
project_list = ["TCGA-KIRC", "TCGA-KIRP", "TCGA-KICH"]
# Fetch 200 cases of KIRC, 200 cases of KIRP, and 65 cases of KICH from the
# GDC database
case_numbers = [200, 200, 65]
df_list = xomx.di.gdc_create_manifest(
    disease_type,
    project_list,
    case_numbers,
)

The Pandas library (imported as `pd`) is used to write the concatenation of the
output dataframes to the file `manifest.txt`:

In [None]:
df = pd.concat(df_list)
df.to_csv(
    os.path.join(save_dir, "manifest.txt"),
    header=True,
    index=False,
    sep="\t",
    mode="w",
)

## Step 2: importing the data

In [None]:
if not os.path.exists(os.path.join(save_dir, 'gdc-client')):
    !wget -O {os.path.join(save_dir, 'gdc-client_v1.6.1_Ubuntu_x64.zip')} "https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.1_Ubuntu_x64.zip"
    !unzip {os.path.join(save_dir, 'gdc-client_v1.6.1_Ubuntu_x64.zip')} -d {save_dir}
    !rm {os.path.join(save_dir, 'gdc-client_v1.6.1_Ubuntu_x64.zip')}