# **Tutorial 0: Preparing Data for SIDISH**

This tutorial outlines the steps required to prepare scRNA-seq, bulk RNA-seq, and survival data for use with the SIDISH framework. Proper data preparation is critical to ensuring accurate risk prediction and biomarker discovery. SIDISH integrates multiple data types to identify high-risk cell populations, making data consistency essential.

### In this tutorial, you will:
- Load scRNA-seq data in `.h5ad` format.  
- Align bulk RNA-seq data with genes present in the scRNA-seq dataset.  
- Format and merge survival data with bulk RNA-seq data.  
- Save the processed data for subsequent SIDISH training.  

### Outputs:
By the end of this tutorial, you will have generated two key files:  
- A processed `.h5ad` file containing scRNA-seq data.  
- A `.csv` file combining bulk RNA-seq expression data with survival information.  

These outputs are required for the SIDISH initialization and training steps.


## **Step 1: Import libraries**
In this step, we import the key libraries required for data preparation.

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
import scanpy.external as sce

## **Step 2: Data Preparation**
Proper data preparation is crucial to ensure the scRNA-seq, bulk RNA-seq, and survival data are compatible and correctly formatted for SIDISH. This step ensures all datasets share common genes, consistent sample identifiers, and uniform formatting.

### **2.1 Read single-cell data**
Here, we load the single-cell RNA-seq data in AnnData (`.h5ad`) format. This file includes:
- Expression matrix: `adata.X`
- Per-cell metadata: `adata.obs`
- Per-gene metadata: `adata.var`

Ensuring that the single-cell data is fully loaded and inspected sets the foundation for integrating it with bulk data.

In [2]:
# Read single-cell RNA-seq data
adata = sc.read_h5ad("data/adata.h5ad")

In [3]:
adata

AnnData object with n_obs × n_vars = 4102 × 1208
    obs: 'cells', 'n_genes', 'scissors', 'scAB', 'degas', 'SIDISH_value', 'risk_value', 'SIDISH'
    var: 'gene_ids', 'n_cells'

### **2.2 Read bulk and survival data**

In this step, we load the bulk RNA-seq data (`bulk.csv`) and filter it based on genes that are present in the scRNA-seq data. Filtering ensures that only genes common to both datasets are retained, which is essential for SIDISH to effectively integrate these data types. Additionally, the `inter` file (`inter.csv`) lists the common genes shared between the two datasets to ensure data alignment. Matching the gene order in both datasets is critical to ensure proper downstream analysis.


In [4]:
# Read bulk RNA-seq
bulk = pd.read_csv("data/bulk.csv", index_col=0, delimiter=",").T

In [5]:
# Preview the bulk RNA-seq data
#bulk.head() ## too large

Survival data links bulk RNA-seq samples with clinical outcomes. Common fields are:
- Overall survival time (e.g., in `days`)
- Event status (1 if the patient passed away before last follow-up (`dead`), 0 otherwise (`alive`))

In this step, we correct the sample identifiers by replacing periods (`.`) with hyphens (`-`). This ensures that sample identifiers are consistent across the survival and bulk datasets. Correct sample matching is crucial for accurate risk prediction in SIDISH.

In [6]:
# Read the survival data
survival = pd.read_csv("data/sample_info.csv", index_col=0)

# Replace '.' with '-' in the sample names to match the sample names in the bulk RNA-seq data
survival_index = np.char.replace(np.array(survival.index, dtype=str), ".", "-")
survival.index = survival_index

# Preview  survival data
survival.head()

Unnamed: 0,Overall_survival_days,Sample_Status
TCGA-97-7938-01A,18.0,1
TCGA-55-7574-01A,995.0,1
TCGA-05-4250-01A,121.0,1
TCGA-55-6979-11A,237.0,1
TCGA-95-A4VK-01A,651.0,0


## **Step 3: Load SIDISH Library to prepare data to run on framework**
The `preprocess` function in SIDISH prepares both single-cell and bulk data for use by filtering, normalizing, and aligning them. When `processed=False`, it performs full QC on the AnnData object by removing low-quality cells and genes, filtering on mitochondrial percentage, normalizing counts, applying `log1p`, and selecting highly variable genes (HVGs). The resulting `subset` object contains HVG-filtered count data that can be directly used as input to a VAE, while the corresponding `bulk` DataFrame merges survival information with bulk expression counts, standardized into `duration` and `event` columns. Optional batch correction can be applied: Harmony integrates the single-cell data for visualization, and ComBat corrects gene expression values to account for inter-patient diversity, ensuring more robust downstream modeling. If `processed=True`, the function assumes that neither the AnnData nor the bulk dataset has been subsetted to HVGs and skips preprocessing, only aligning the datasets and merging survival data. Together, this step harmonizes the single-cell and bulk data streams, making them ready to run on the SIDISH framework.


In [7]:
from SIDISH import SIDISH as sidish
from SIDISH.SIDISH import preprocess

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
adata, bulk_merged = preprocess(adata, bulk, survival, patient_id="Sample", celltype_name="celltype_major", processed=True)

In [9]:
bulk_merged

Unnamed: 0,duration,event,AGRN,DVL1,MIB2,SKI,PEX10,TNFRSF14,WRAP73,ICMT,...,CBR1,ETS2,BRWD1,BACE2,TFF3,PDXK,CSTB,FAM207A,MT-ATP6,MT-ND3
TCGA-97-7938-01A,18.0,0,10013.0,1254.0,551.0,4424.0,413.0,1190.0,513.0,5030.0,...,2940.0,10198.0,2741.0,1196.0,80.0,5439.0,6896.0,153.0,68295.0,20315.0
TCGA-55-7574-01A,995.0,0,16625.0,2748.0,1497.0,4463.0,382.0,2841.0,814.0,2564.0,...,1472.0,4595.0,1646.0,1350.0,64.0,3836.0,6004.0,235.0,61087.0,23484.0
TCGA-05-4250-01A,121.0,0,20111.0,3721.0,546.0,4713.0,1193.0,2757.0,1072.0,6030.0,...,4182.0,5526.0,2522.0,3789.0,193.0,9205.0,29827.0,1316.0,70918.0,26228.0
TCGA-55-6979-11A,237.0,0,7902.0,1231.0,471.0,2945.0,281.0,1747.0,400.0,1605.0,...,1078.0,8474.0,1435.0,991.0,86.0,5205.0,7090.0,137.0,76060.0,37142.0
TCGA-95-A4VK-01A,651.0,0,19830.0,4281.0,2140.0,6593.0,941.0,3282.0,1950.0,5519.0,...,1714.0,12088.0,2633.0,4862.0,285.0,15984.0,13002.0,523.0,113113.0,77968.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-97-7937-01A,564.0,0,28116.0,4250.0,1582.0,10486.0,1061.0,2072.0,1863.0,10609.0,...,2046.0,3913.0,4453.0,599.0,98.0,18815.0,20605.0,531.0,327230.0,93486.0
TCGA-05-4398-01A,1431.0,0,58358.0,7895.0,4075.0,9791.0,2809.0,7142.0,3477.0,12634.0,...,4116.0,12755.0,4308.0,5507.0,221.0,14698.0,34441.0,1378.0,190688.0,36879.0
TCGA-50-6592-01A,777.0,0,10793.0,2948.0,798.0,4142.0,972.0,3297.0,1933.0,5014.0,...,5753.0,6592.0,2020.0,2715.0,28.0,16138.0,55732.0,430.0,58361.0,16460.0
TCGA-44-3396-01A,1130.0,0,52409.0,5447.0,2284.0,10442.0,1941.0,7465.0,3342.0,13050.0,...,6764.0,13355.0,3650.0,7218.0,1854.0,10535.0,30239.0,1084.0,304261.0,90051.0


## **Step 4: Saving Processed Data for SIDISH**
Saving the processed data ensures that the cleaned and formatted data is easily accessible in the next tutorial (in Tutorial 1). The single-cell RNA-seq data is stored in .h5ad format, while the combined bulk and survival data is saved as a .csv file .

In [None]:
# Save single-cell data
adata.write_h5ad("data/processed_adata.h5ad")

# Save bulk RNA-seq and survival data
bulk_merged.to_csv("data/processed_bulk.csv")