# Tutorial of DeepAdapter
### A self-adaptive and versatile tool for eliminating multiple undesirable variations from transcriptome
In this notebook, you will learn how to reproduce the results and re-train your DeepAdapter with your own datasets.
## 1. Installation and requirements
### 1.1. Installation
To run locally, please open a terminal window and download the code with:
```sh
$ # create a new conda environment
$ conda create -n DA python=3.9
$ # activate environment
$ conda activate DA
$ # Install dependencies
$ pip install deepadapter==1.0.1
$ # Launch jupyter notebook
$ jupyter notebook
```
To execute a "cell", please press Shift+Enter

### 1.2. Formats of your own input files
Your own dataset should include $2$ files (**Note: these 2 files are put in the same directory.**): 
* **gene_expression.txt** for gene expression matrix;
* **unwantedVar_biologicalSig.txt** for annotations of unwanted variations and biological signals.

The example of **gene_expression.txt** is as follows (**Note: every row should be split by commas.**):
| SampleId | Gene_1 | Gene_2 | Gene_3 | ... | Gene_n-2 | Gene_n-1 | Gene_n |
|  ----  | ----  | ----  | ----  |  ----  | ----  | ----  | ----  |
| **1** | x<sub>11</sub> | x<sub>12</sub> | x<sub>13</sub> | ... | x<sub>1(n-2)</sub> | x<sub>1(n-1)</sub> | x<sub>1n</sub> |
| **2** | x<sub>21</sub> | x<sub>22</sub> | x<sub>23</sub> | ... | x<sub>2(n-2)</sub> | x<sub>2(n-1)</sub> | x<sub>2n</sub> |
| ... | ... | ... | ... | ... | ... | ... | ... |
| **m** | x<sub>m1</sub> | x<sub>m2</sub> | x<sub>m3</sub> | ... | x<sub>m(n-2)</sub> | x<sub>m(n-1)</sub> | x<sub>mn</sub> |

The example of **unwantedVar_biologicalSig.txt** is as follows (**Note: every row should be split by commas.**):
| SampleId | Unwanted_var | Biological_sig |
|  ----  | ----  | ----  |
| **1** | unwantedVar<sub>1</sub> | biologicalSig<sub>1</sub> |
| **2** | unwantedVar<sub>1</sub> | biologicalSig<sub>1</sub> |
| ... | ... | ... |
| **m** | unwantedVar<sub>p</sub> | biologicalSig<sub>q</sub> |

Examples of **unwantedVar** and **biologicalSig**:
* **unwantedVar**:
    * **batch**: batch1, batch2, ..., batch(n);
    * **platform**: RNA-seq, microarray;
    * **purity**: cell lines, tissue;
    * ...
* **biologicalSig**:
    * **cancer types**: lung cancer, kidney cancer, ..., bone cancer;
    * **lineages**: Lung, kidney, ..., eye;
    * **donor sources**: donor1, donor2, ..., donor(n);
    * ...

### 1.3. Download pre-trained models
Please download the pre-trained models for fine-tuning. The models are in the this link [click here to download](https://github.com/mjDelta/DeepAdapter/blob/main/models).

After downloading, place the models in the `models/` directory located in the same hierarchy as this tutorial.
* pretrained batch_lincs: `models/batch_LINCS`
* pretrained batch_quartet: `models/batch_Quartet`
* pretrained platform: `models/platform`
* pretrained purity: `models/purity`
* pretrained batch_quartet (using the intersected gene set between LINCS and Quartet dataset): `models/pretrained_LINCS_Quartet`

**Putting models in the right directory is important for loading the pretrained models successfully.**

## 2. Load the datasets and preprocess
### 2.1. Load the modules

In [None]:
%load_ext autoreload
%autoreload 2

import os, sys
import pandas as pd
import numpy as np

from deepadapter.utils import data_utils as DT
from deepadapter.finetune_utils as FTUT
from deepadapter.params import dl_finetune_params as DLPARAM

### 2.2. Load the your own dataset
Replace the **yourDataDir** with the directory where your own dataset is located in.

Name the columns of sample id, unwanted variation annotations, and wanted signal annotations as <u>**SampleID**</u>, <u>**Unwanted_var**</u>, and <u>**Biological_sig**</u>, respectively.

In [None]:
data_name = "yourDataName"
exp_data_path = "yourDataDir/gene_expression.txt" ## the path of your gene expression matrix
ann_data_path = "yourDataDir/unwantedVar_biologicalSig.txt" ## the path of your annotation information
sample_id = "SampleId"
unwanted_var_col = "Unwanted_var"
wanted_sig_col = "Biological_sig"

loadTransData = DT.LoadTransData(exp_data_path, ann_data_path, sample_id, unwanted_var_col, wanted_sig_col)
data, ids, unwanted_labels, wanted_labels = loadTransData.load_data()

### 2.3 Load the genes used in pre-trained model
Before fine-tuning, make sure that the loaded genes are the genes used in the pre-trained model. The pre-trained models can be found in the folder of `models`.

In [None]:
load_dir = "models/pretrained_LINCS_Quartet/"
pretrain_genes = pd.read_csv(os.path.join(load_dir, "gene.csv"))["gene"]
try:
	data = data[pretrain_genes]
except Exception as e:
    raise("Inconsistent gene set between this dataset and pretrained dataset")

### 2.3. Preprocess the transcriptomic data
The gene expression profiles are preprocessed by sample normalization, gene ranking, and log normalization. Let $S_i = \sum_l x_{i l}$ denote the sum over all genes. In sample normalization, we divide $S_i$ for every sample and multiply a constant 10000 ([Xiaokang Yu et al. Nature communications, 2023](https://www.nature.com/articles/s41467-023-36635-5)):
$$x_{i l} = \frac{x_{i l}}{S_i} 10^4.$$
Then, we sort genes by their expression levels and perform the log transformation $x_{i l} = \log {(x_{i l} + 1)}$.

In [None]:
prepTransData = DT.PrepTransData()
raw_df = prepTransData.sample_norm(data)
input_arr = prepTransData.sample_log(raw_df)
bat2label, label2bat, unwanted_labels, unwanted_onehot = prepTransData.label2onehot(unwanted_labels)

## 3. Finetune DeepAdapter
### 3.1. Adjust DeepAdapter's parameters
The parameters for DeepAdapter are as follows (**Note: you can revise parameter directly in `net_args`, e.g., `net_args.epochs = 10000`.**):
* **epochs**: the total training epochs of DeepAdapter, default = $5000$
* **ae_epochs**: the warmup epochs of autoencoder in DeepAdapter, default = $400$
* **batch_epochs**: the warmup epochs of discriminator in DeepAdapter, default = $50$
* **batch_size**: the batch size of dataloader, default = $256$
* **hidden_dim**: the hidden units of autoencoder in DeepAdapter, default = $256$
* **z_dim**: the latent units of autoencoder in DeepAdapter, default = $128$
* **drop**: the dropout rate of DeepAdapter, default = $0.3$
* **lr_lower_ae**: the lower learning rate of autoencoder in DeepAdapter, default = $1e-5$
* **lr_upper_ae**: the upper learning rate of autoencoder in DeepAdapter, default = $5e-4$
* **lr_lower_batch**: the lower learning rate of discriminator in DeepAdapter, default = $1e-5$
* **lr_upper_batch**: the upper learning rate of discriminator in DeepAdapter, default = $5e-4$

In [None]:
net_args = DLPARAM.load_dl_params()

print(net_args.epochs)
net_args.epochs = 10000
print(net_args.epochs)

### 3.2. Fine-tune DeepAdapter
DeepAdapter is finetuned with all layers unfrozen. Especially, there might exist different numbers of unwanted variations between pretrained and finetuned datasets, the last layer of the discriminatory network will be modified to match the number of unwanted variations in the finetuned dataset.

In [None]:
ft_num = net_args.ft_num

out_dir = f"models/finetune_{data_name}/record_{ft_num}"
os.makedirs(out_dir, exist_ok = True)

num_platform = len(bat2label)
data, labels, labels_hot, ids = unwanted_onehot, unwanted_labels, unwanted_onehot, np.arange(len(unwanted_onehot))

FTUT.test_finetune(data, labels, labels_hot, donors, ids, label2bat, load_dir, out_dir, net_args, num_platform, finetune_num = ft_num, n_test = 100)

avg_aligned_data, data, wnt_infs, unw_infs, ids = FTUT.finetune(data, labels, labels_hot, donors, ids, label2bat, load_dir, out_dir, net_args, num_platform, n_test = 100, test_ratio = 0.2)

save_path = os.path.join(out_dir, "DA_data.csv")
df = pd.DataFrame(avg_aligned_data, columns = pretrain_genes)
df["ID"] = ids
df["wantInfo"] = wnt_infs
df["unwantInfo"] = unw_infs
df.to_csv(save_path, index = False)