# **Tutorial 0: Preparing Data for SIDISH**

This tutorial demonstrates how to preprocess single-cell and bulk RNA-seq data, along with survival information, for use in **SIDISH**.

### **Introduction**
The SIDISH (Semi-supervised Iterative Deep learning for Identifying Single-cell High-risk populations) framework is a powerful tool designed to identify high-risk cell populations and potential biomarkers from single-cell and bulk RNA-seq data. This guide will walk you through the setup and training of SIDISH, providing clear instructions and explanations.


## **Step 1: Import libraries**


In [1]:
import scanpy as sc
import numpy as np
import pandas as pd

## **Step 2: Data Preparation**

SIDISH requires properly formatted single-cell RNA-seq and bulk RNA-seq data with survival data. In this step, we combine all components so that both the single-cell and bulk datasets are compatible. We'll verify that the same genes are used in both datasets and that survival information is correctly matched to each bulk sample.

### **2.1 Read single-cell data**
Here, we load the single-cell RNA-seq data in AnnData (`.h5ad`) format. This file includes:
- Expression matrix: `adata.X`
- Per-cell metadata: `adata.obs`
- Per-gene metadata: `adata.var`

Ensuring that the single-cell data is fully loaded and inspected sets the foundation for integrating it with bulk data.

In [2]:
# Read single-cell RNA-seq data
adata = sc.read_h5ad("../../DATA/adata.h5ad")

In [3]:
adata

AnnData object with n_obs × n_vars = 4102 × 1208
    obs: 'cells', 'n_genes', 'scissors', 'scAB', 'degas', 'SIDISH_value', 'risk_value', 'SIDISH'
    var: 'gene_ids', 'n_cells'

In [4]:
adata.obs

Unnamed: 0,cells,n_genes,scissors,scAB,degas,SIDISH_value,risk_value,SIDISH
AAACCTGAGACTACAA_15,AAACCTGAGACTACAA_15,313,b,b,b,0,0.012134,b
AAACCTGAGACTTGAA_20,AAACCTGAGACTTGAA_20,1386,b,b,b,0,-0.545835,b
AAACCTGAGCGCCTTG_15,AAACCTGAGCGCCTTG_15,1728,b,b,h,0,0.490999,b
AAACCTGCAGCTATTG_19,AAACCTGCAGCTATTG_19,823,b,b,b,0,0.168064,b
AAACCTGCAGTAAGCG_15,AAACCTGCAGTAAGCG_15,5359,b,h,b,0,-0.047996,b
...,...,...,...,...,...,...,...,...
TTTGTCAAGATGTTAG_15,TTTGTCAAGATGTTAG_15,362,b,b,b,0,0.182129,b
TTTGTCACACTCTGTC_19,TTTGTCACACTCTGTC_19,335,b,b,h,0,-0.094352,b
TTTGTCACATTGAGCT_14,TTTGTCACATTGAGCT_14,650,b,h,b,0,-0.310603,b
TTTGTCAGTAAGGGAA_13,TTTGTCAGTAAGGGAA_13,711,b,b,b,0,-0.472073,b


In [5]:
adata.var

Unnamed: 0,gene_ids,n_cells
AGRN,AGRN,1061
DVL1,DVL1,602
MIB2,MIB2,458
SKI,SKI,458
PEX10,PEX10,495
...,...,...
PDXK,PDXK,736
CSTB,CSTB,2635
FAM207A,FAM207A,708
MT-ATP6,MT-ATP6,2811


In [6]:
adata.X

array([[0., 0., 0., ..., 0., 2., 3.],
       [0., 0., 0., ..., 0., 3., 1.],
       [2., 0., 0., ..., 0., 5., 9.],
       ...,
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 4., 3.],
       [0., 0., 0., ..., 0., 2., 0.]], dtype=float32)

### **2.2 Read bulk and survival data**

In this section, we load the bulk RNA-seq dataset (`count_data_LUAD.csv`) and filter it by the genes used in our single-cell dataset. We also read a file of common genes (`inter_LUAD.csv`) to ensure both single-cell and bulk datasets share the same gene set.

By maintaining a consistent gene order and filtering out unmatched genes, we prepare the bulk data for integration with the single-cell data in SIDISH.


In [7]:
# Read bulk RNA-seq
bulk = pd.read_csv("../../DATA/bulk.csv", index_col=0, delimiter=",").T

# Read genes that are common between single-cell RNA-seq and bulk RNA-seq data
inter = pd.read_csv("../../DATA/inter.csv").iloc[:, 1:].values.reshape(1, -1)[0]

# Filter the bulk RNA-seq data to include only the common genes
bulk = bulk.filter(items=adata.to_df().columns.values)

# Arrange the genes in the same order in both the single-cell RNA-seq and bulk RNA-seq data
bulk = bulk[adata.to_df().columns.values]

In [8]:
# Preview the bulk RNA-seq data
bulk.head()

Unnamed: 0,AGRN,DVL1,MIB2,SKI,PEX10,TNFRSF14,WRAP73,ICMT,ACOT7,DNAJC11,...,CBR1,ETS2,BRWD1,BACE2,TFF3,PDXK,CSTB,FAM207A,MT-ATP6,MT-ND3
TCGA-97-7938-01A,10013.0,1254.0,551.0,4424.0,413.0,1190.0,513.0,5030.0,473.0,1158.0,...,2940.0,10198.0,2741.0,1196.0,80.0,5439.0,6896.0,153.0,68295.0,20315.0
TCGA-55-7574-01A,16625.0,2748.0,1497.0,4463.0,382.0,2841.0,814.0,2564.0,928.0,1086.0,...,1472.0,4595.0,1646.0,1350.0,64.0,3836.0,6004.0,235.0,61087.0,23484.0
TCGA-05-4250-01A,20111.0,3721.0,546.0,4713.0,1193.0,2757.0,1072.0,6030.0,2942.0,1781.0,...,4182.0,5526.0,2522.0,3789.0,193.0,9205.0,29827.0,1316.0,70918.0,26228.0
TCGA-55-6979-11A,7902.0,1231.0,471.0,2945.0,281.0,1747.0,400.0,1605.0,971.0,488.0,...,1078.0,8474.0,1435.0,991.0,86.0,5205.0,7090.0,137.0,76060.0,37142.0
TCGA-95-A4VK-01A,19830.0,4281.0,2140.0,6593.0,941.0,3282.0,1950.0,5519.0,984.0,1983.0,...,1714.0,12088.0,2633.0,4862.0,285.0,15984.0,13002.0,523.0,113113.0,77968.0


We have a separate CSV file that contains survival information for the bulk RNA-seq samples. Common fields are:
- Overall survival time (e.g., in days)
- Event status (1 if the patient passed away before last follow-up, 0 otherwise)

We'll need to ensure that the sample identifiers match those in the bulk RNA-seq DataFrame, so we replace '.' with '-' if necessary.

In [9]:
# Read the survival data
survival = pd.read_csv("../../DATA/sample_info.csv", index_col=0)

# Replace '.' with '-' in the sample names to match the sample names in the bulk RNA-seq data
survival_index = np.char.replace(np.array(survival.index, dtype=str), ".", "-")
survival.index = survival_index

# Preview  survival data
survival.head()

Unnamed: 0,Overall_survival_days,Sample_Status
TCGA-97-7938-01A,18.0,1
TCGA-55-7574-01A,995.0,1
TCGA-05-4250-01A,121.0,1
TCGA-55-6979-11A,237.0,1
TCGA-95-A4VK-01A,651.0,0


We now merge the survival information (duration, event) with the bulk 
RNA-seq expression matrix based on matching sample identifiers (row indices). 
After merging, each row will have the two survival columns plus a large set 
of gene expression columns.

In [10]:
# Rename the columns to 'duration' and 'event' to match the input format of the survival data
survival.rename(columns={"Overall_survival_days": "duration", "Sample_Status": "event"}, inplace=True,)

# Merge the survival data with the bulk RNA-seq data by sample names
bulk_merged = pd.concat([survival, bulk], axis=1)

# Preview the merged data
bulk_merged.head()

Unnamed: 0,duration,event,AGRN,DVL1,MIB2,SKI,PEX10,TNFRSF14,WRAP73,ICMT,...,CBR1,ETS2,BRWD1,BACE2,TFF3,PDXK,CSTB,FAM207A,MT-ATP6,MT-ND3
TCGA-97-7938-01A,18.0,1,10013.0,1254.0,551.0,4424.0,413.0,1190.0,513.0,5030.0,...,2940.0,10198.0,2741.0,1196.0,80.0,5439.0,6896.0,153.0,68295.0,20315.0
TCGA-55-7574-01A,995.0,1,16625.0,2748.0,1497.0,4463.0,382.0,2841.0,814.0,2564.0,...,1472.0,4595.0,1646.0,1350.0,64.0,3836.0,6004.0,235.0,61087.0,23484.0
TCGA-05-4250-01A,121.0,1,20111.0,3721.0,546.0,4713.0,1193.0,2757.0,1072.0,6030.0,...,4182.0,5526.0,2522.0,3789.0,193.0,9205.0,29827.0,1316.0,70918.0,26228.0
TCGA-55-6979-11A,237.0,1,7902.0,1231.0,471.0,2945.0,281.0,1747.0,400.0,1605.0,...,1078.0,8474.0,1435.0,991.0,86.0,5205.0,7090.0,137.0,76060.0,37142.0
TCGA-95-A4VK-01A,651.0,0,19830.0,4281.0,2140.0,6593.0,941.0,3282.0,1950.0,5519.0,...,1714.0,12088.0,2633.0,4862.0,285.0,15984.0,13002.0,523.0,113113.0,77968.0


## **Step 3: Saving Processed Data for SIDISH**
Let's save the processed single-cell and bulk data to use in the next tutorial (in Tutorial 1)

In [11]:
# Save single-cell data
adata.write_h5ad("../../DATA/processed_adata.h5ad")

# Save bulk RNA-seq and survival data
bulk_merged.to_csv("../../DATA/processed_bulk.csv")