# **Tutorial 1: Setting Up and Training SIDISH**
This tutorial explains how to initialize and train the SIDISH framework. SIDISH combines a Variational Autoencoder (VAE) and Deep Cox regression to uncover high-risk cell populations and predict clinical outcomes.
### In this tutorial, you will:
- Load the prepared data from Tutorial 0.  
- Initialize SIDISH with appropriate model architecture, learning rates, and optimizer settings. 
### Outcome:
By the end of this tutorial, you will have a fully trained SIDISH model capable of identifying high-risk cell subpopulations and linking them to clinical outcomes.


## **Step 1: Environment Setup**


### **1.1 Set SIDISH conda environment**
To ensure compatibility, SIDISH requires Python 3.12. For best results, we recommend creating a virtual environment to manage dependencies:

Create a conda environment:
```bash
conda create --name sidish_env python=3.12
```
Activate the environment:
```bash
conda activate sidish_env
```

### **1.2 Install SIDISH**
In codeocean, there is no need to install SIDISH, since it would be already installed in the capsule. But to ensure that the SIDISH installed is the latest:
- delete SIDISH from your pip libraries
- add: git+https://github.com/mcgilldinglab/SIDISH.git#egg=sidish at the bottom of your bulk list in the pip install


## **Step 2: Import libraries**
### **2.1 Import SIDISH**
The SIDISH framework is imported directly for use:

In [10]:
from SIDISH import SIDISH as sidish

### **2.2 Import Additional Libraries**
Additional libraries for data handling, visualization, and deep learning are required:

In [2]:
import scanpy as sc
import pandas as pd
import numpy as np
import torch
import random
import os
import matplotlib.pyplot as plt

### **2.3 Set Seeds**
To ensure results are consistent across multiple runs, set seeds for all key libraries:

In [3]:
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
np.random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(1)
ite = 0
# Set seeds for reproducibility
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Call the seed setting function
set_seed(seed)

## **Step 3: Reading Data sets**
We load the processed data prepared in Tutorial 0. Loading both the scRNA-seq and bulk RNA-seq data ensures that SIDISH has access to the required datasets for training.

### **3.1 Read single-cell data**
SIDISH requires initialization before training. Each phase must be set up with appropriate parameters.

In [4]:
# Read single-cell RNA-seq data
adata = sc.read_h5ad("../data/processed_adata.h5ad")

### **3.2 Read bulk and survival data**

In [5]:
# Read bulk RNA-seq
bulk = pd.read_csv("../data/processed_bulk.csv", index_col=0)

## **Step 4: Initializing SIDISH**
Initialise SIDISH model with the saved single-cell and merged bulk RNA-seq data. This functionality also sets the device ("cpu" or "cuda") as well as the seed for reproductibility

In [6]:
sdh = sidish(adata, bulk, "cuda", seed=ite)

SIDISH No spatial graph used. Proceeding with dense VAE.


### **4.1 Initialise Phase 1 of SIDISH**
This feature initialises the hyperparameters needed for Phase 1 in SIDISH. In Phase 1 of SIDISH, a Variational Autoencoder compresses the single-cell data into a biologically meaningful latent space to extract key cellular patterns. The `epoch` parameter sets the number of iterations to train the VAE i iteration 1 of SIDISH, whilst the `i_epoch` parameter sets the number of epochs to retrain the VAE after iteration 1 of SIDISH. `latent_size` determines the latent space size of the VAE, which we set to 32, and `layer_dims` determines the layer dimensions of the encoder and decoder of the VAE, in this example we set it to a two layer of size 512 and 128. Also `batch_size` determines the batch size of the single-cell data used to train the  VAE, which we set to 512 and `optimizer` determines the optimizer used to train the VAE, for the Lung dataset we used the Adam optimizer. `lr`and `lr_3` are the learning rate used to train the VAE in SIDISH at iteration 1 and after iteration 1 respectively. 

In [7]:
# Reduced batch size to avoid memory issues and ensure num_workers=0
sdh.init_Phase1(225, 20, 32, [512, 128], 256, "Adam", 1.0e-4, 1e-4, 0)

### **4.2 Initialise Phase 2 of SIDISH**
This feature initialises the hyperparameters needed for Phase 2 in SIDISH. In Phase 2 of SIDISH, a deep Cox regression model predicts patient survival risks using bulk RNA-seq profiles B and survival outcomes. Transfer learning reuses the encoder from the single-cell VAE, allowing FC to leverage high-resolution transcriptomic features while reducing redundancy in feature
discovery. The `epoch` parameter sets the number of epochs to train the Deep Cox regression model.`hidden` parameter sets the number of additional fully connected layers to add to the encoder of the previously trained VAE, we used 128 in the case of the lung cancer dataset. `lr` determines the learning rate used to train the Deep Cox regression model. `test_size` is the size of the test set used to evaluate the performance of the deep Cox regression, used 20%. Also `batch_size` sets the batch size of the bulk data used to train the regressor, which we set to 256.

In [8]:
sdh.init_Phase2(500, 128, 1e-4, 0, 0.2, 256)

## **Step 5: Start Training SIDISH**
To start training SIDISH, the number of iterations must be provided as well as the percentile threshold to define the number of High-Risk cells identified. In the case of the lung cancer dataset we set `iterations` and `percentile` to 5 and 0.95 respectively. It's important to note that the higher the `percentile` parameter is, the lower the number of cells will be considered as the High-Risk. The stepness of the sigmoid function used to generate the gene weights for the weight matrix, is determined by the `steepness` parameter which is set to 30. Finally, the output directory of the resulting files after training is provided. The folder will contain the annoted adata file containing which cell is considered as High-Risk or Background. It also contained the saved final deep Cox regression model as well as the VAE model. It also contained the gene weights matrix at each iteration.

In [9]:
train_adata = sdh.train(5, 0.95, 30, "../data/LUNG/", distribution_fit='fitted')

########################################## Using Dense VAE ##########################################
########################################## ITERATION 1 OUT OF 5 ##########################################
[epoch 000]  average training loss: 1019.8280
[epoch 000]  average training loss: 1019.8280
[epoch 001]  average training loss: 919.7375
[epoch 001]  average training loss: 919.7375
[epoch 002]  average training loss: 883.8756
[epoch 002]  average training loss: 883.8756
[epoch 003]  average training loss: 870.8430
[epoch 003]  average training loss: 870.8430
[epoch 004]  average training loss: 865.9922
[epoch 004]  average training loss: 865.9922
[epoch 005]  average training loss: 864.0272
[epoch 005]  average training loss: 864.0272
[epoch 006]  average training loss: 863.0421
[epoch 006]  average training loss: 863.0421
[epoch 007]  average training loss: 862.0874
[epoch 007]  average training loss: 862.0874
[epoch 008]  average training loss: 860.5962
[epoch 008]  average tra

100%|██████████| 500/500 [00:04<00:00, 107.11it/s]



########################################## Calculating Patients Weight Vector ##########################################
Gamma
Best Distribution: 
########################################## Calculating Cells Weight Matrix ##########################################
########################################## Saving Weight Matrix at Iteration 0 ##########################################
########################################## Saving Weight Matrix at Iteration 0 ##########################################
########################################## ITERATION 2 OUT OF 5 ##########################################
########################################## ITERATION 2 OUT OF 5 ##########################################
[epoch 000]  average training loss: 701.0020
[epoch 000]  average training loss: 701.0020
[epoch 001]  average training loss: 700.5368
[epoch 001]  average training loss: 700.5368
[epoch 002]  average training loss: 700.0226
[epoch 002]  average training loss: 700.0226
[epoch 

100%|██████████| 500/500 [00:04<00:00, 111.10it/s]


########################################## Calculating Patients Weight Vector ##########################################
########################################## Calculating Cells Weight Matrix ##########################################
########################################## Saving Weight Matrix at Iteration 1 ##########################################
########################################## ITERATION 3 OUT OF 5 ##########################################
[epoch 000]  average training loss: 756.8352
[epoch 001]  average training loss: 756.3136
[epoch 002]  average training loss: 755.7399
[epoch 003]  average training loss: 755.7021
[epoch 004]  average training loss: 755.4863
[epoch 005]  average training loss: 755.3154
[epoch 006]  average training loss: 755.2035
[epoch 007]  average training loss: 755.1225
[epoch 008]  average training loss: 754.8130
[epoch 009]  average training loss: 754.7450
[epoch 010]  average training loss: 754.5552
[epoch 011]  average training loss: 7

100%|██████████| 500/500 [00:04<00:00, 105.94it/s]


########################################## Calculating Patients Weight Vector ##########################################
########################################## Calculating Cells Weight Matrix ##########################################




########################################## Saving Weight Matrix at Iteration 2 ##########################################
########################################## ITERATION 4 OUT OF 5 ##########################################
[epoch 000]  average training loss: 789.0911
[epoch 001]  average training loss: 788.4635
[epoch 002]  average training loss: 787.9519
[epoch 003]  average training loss: 787.8799
[epoch 004]  average training loss: 787.7612
[epoch 005]  average training loss: 787.6008
[epoch 006]  average training loss: 787.5745
[epoch 007]  average training loss: 787.4798
[epoch 008]  average training loss: 787.1837
[epoch 009]  average training loss: 787.1570
[epoch 010]  average training loss: 787.0674
[epoch 011]  average training loss: 786.9060
[epoch 012]  average training loss: 786.9176
[epoch 013]  average training loss: 786.7096
[epoch 014]  average training loss: 786.5987
[epoch 015]  average training loss: 786.4724
[epoch 016]  average training loss: 786.3992
[epoch

100%|██████████| 500/500 [00:04<00:00, 104.00it/s]


########################################## Calculating Patients Weight Vector ##########################################
########################################## Calculating Cells Weight Matrix ##########################################




########################################## Saving Weight Matrix at Iteration 3 ##########################################
########################################## ITERATION 5 OUT OF 5 ##########################################
[epoch 000]  average training loss: 791.1621
[epoch 001]  average training loss: 790.6148
[epoch 002]  average training loss: 790.1531
[epoch 003]  average training loss: 790.0888
[epoch 004]  average training loss: 789.9973
[epoch 005]  average training loss: 789.8375
[epoch 006]  average training loss: 789.8441
[epoch 007]  average training loss: 789.7551
[epoch 008]  average training loss: 789.4952
[epoch 009]  average training loss: 789.4782
[epoch 010]  average training loss: 789.4271
[epoch 011]  average training loss: 789.2819
[epoch 012]  average training loss: 789.3175
[epoch 013]  average training loss: 789.1155
[epoch 014]  average training loss: 789.0182
[epoch 015]  average training loss: 788.8968
[epoch 016]  average training loss: 788.8596
[epoch

100%|██████████| 500/500 [00:04<00:00, 113.78it/s]


########################################## Calculating Patients Weight Vector ##########################################
########################################## Calculating Cells Weight Matrix ##########################################




########################################## Saving Weight Matrix at Iteration 4 ##########################################
########################################## SIDISH TRAINING DONE ##########################################
########################################## Saving Final AnnData Object ##########################################
