<center> Advanced Integration and Annotation of scRNA-seq Data Using scVI: Hyperparameter Tuning, Label Transfer, and Custom Reference Creation

# Introduction

In this notebook, data from 10X Genomics will be uploaded following the application of Harmony correction. The scVI model, a preferred integration method for scRNA-seq data, will be tested here. Additionally, scVI and scANVI will be applied to enable label transfer from the CZI scRNA-seq atlas, which contains six RNA-seq samples.

The procedure is divided into two parts. First, label transfer and the second will be integration.

First, the data will be uploaded. For those interested in exploring alternative label transfer methods, the `celltypist` model can be used. However, this notebook will focus on atlas-based label transfer. [GitHub reference](https://github.com/mousepixels/sanbomics_scripts/blob/main/sc2024/annotation_integration.ipynb)


### Label Transfer

In [1]:
import warnings
warnings.simplefilter("ignore", FutureWarning)
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", RuntimeWarning)

In [2]:
import scanpy as sc
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scvi
import torch
#import celltypist
#from celltypist import models
from scvi.autotune import ModelTuner
from ray import tune
import ray

In [3]:
torch.set_float32_matmul_precision("high")

In [4]:
print("scvi-tools version:", scvi.__version__)

scvi-tools version: 1.1.6.post2


In [5]:
print("CUDA available:", torch.cuda.is_available())

CUDA available: True


In [6]:
pathout = "/data/kanferg/Sptial_Omics/SpatialOmicsToolkit/out_4"

#### Reference from CZI:

In [7]:
ref_name = "/data/kanferg/Sptial_Omics/playGround/Data/Breast_Cancer/ref/breast_cancer_sc_czi.h5ad"
adata_ref_init = sc.read_h5ad(ref_name)
# select all the breast samples
adata_ref_tisueRM = adata_ref_init[adata_ref_init.obs['tissue']=='breast',:].copy()
# select all the protein_coding
adata_ref = adata_ref_tisueRM[:,adata_ref_tisueRM.var['feature_type']=='protein_coding'].copy()
adata_ref.var.index = adata_ref.var["feature_name"].astype(str).values
adata_ref

AnnData object with n_obs × n_vars = 34164 × 18626
    obs: 'condition', 'replicate', 'nCount_RNA', 'nFeature_RNA', 'percent.mito', 'RNA_snn_res.0.8', 'seurat_clusters', 'labels_score', 'Order', 'Lane', 'Index', 'cancer', 'reference', 'flowcell', 'min_umis', 'min_genes', 'percent_mito', 'expected_cells', 'total_droplets', 'z_dim', 'z_layers', 'channel_id', 'labels_cl_unif_per_channel', 'filt_median_genes', 'filt_median_umi', 'pass', 'ccpm_id', 'htapp', 'sequenced', 'stage_at_diagnosis', 'metastatic_presentation', 'biopsy_days_after_metastasis', 'ER_primary', 'ER_biopsy', 'PR_primary', 'PR_biopsy', 'HER2_primary', 'HER2_biopsy', 'receptors_primary', 'receptors_biopsy', 'site_biopsy', 'histology_breast', 'histology_biopsy', 'sampleid', 'cnv_cors', 'cnv_cors_max', 'cnv_score', 'cnv_ref_score', 'cnv_score_norm', 'cnv_score_norm_norm', 'cnv_condition', 'cnv_score_norm_norm2', 'pam50_Basal_single', 'pam50_Her2_single', 'pam50_LumA_single', 'pam50_LumB_single', 'pam50_Normal_single', 'pam50_m

In [8]:
from anndata import AnnData
rdata = AnnData(adata_ref.X, obs={"CellType": adata_ref.obs["cell_type"].values,"nCount_RNA":adata_ref.obs["nCount_RNA"].values,'percent_mito':adata_ref.obs["percent_mito"]} , var ={"n_cells":adata_ref.var["n_cells"].values, "feature_name":adata_ref.var["feature_name"].astype(str).values} )
rdata.var.index = rdata.var["feature_name"].values
rdata

AnnData object with n_obs × n_vars = 34164 × 18626
    obs: 'CellType', 'nCount_RNA', 'percent_mito'
    var: 'n_cells', 'feature_name'

In [9]:
rdata.obs.groupby('CellType').size()

CellType
fibroblast                               2797
blood vessel endothelial cell            2816
T cell                                   1274
adipocyte                                 595
chondrocyte                                 1
macrophage                                639
plasma cell                                31
mature NK T cell                           25
malignant cell                          25685
blood vessel smooth muscle cell           300
endothelial cell of hepatic sinusoid        1
dtype: int64

In [10]:
remove_cell = ['endothelial cell of hepatic sinusoid','chondrocyte']
rdata = rdata[~rdata.obs['CellType'].isin(remove_cell), :]
rdata.obs.groupby('CellType').size()

CellType
fibroblast                          2797
blood vessel endothelial cell       2816
T cell                              1274
adipocyte                            595
macrophage                           639
plasma cell                           31
mature NK T cell                      25
malignant cell                     25685
blood vessel smooth muscle cell      300
dtype: int64

In [11]:
rdata = rdata[~rdata.obs.CellType.isna()]

In [12]:
# needs row count data and celltype in ref dataset and query data set. also batch
rdata.obs['batch'] = 'ref'
rdata.obs['sample'] = 'refrance'


After preparing the reference data for label transfer, make sure to add batch and sample keys for the breast cancer data. These keys should include relevant batch and sample identifiers. Additionally, a CellType key should be added and labeled as ‘Unknown’. This column will be used to identify cell types based on the reference data.

> **Note:** For scVI, use raw (unnormalized) and unlogged counts.


In [13]:
pathout = "/data/kanferg/Sptial_Omics/SpatialOmicsToolkit/out_4"
bcdata_init = sc.read_h5ad(os.path.join(pathout, "adata_concat_BreastCancer_harmony.h5ad"))
bcdata_init.X = bcdata_init.layers['counts']

In [14]:
# Rename mitochodria percent will be usfull for integration
bcdata_init.obs['percent_mito'] = bcdata_init.obs['pct_counts_MT'].values
bcdata_init.obs['CellType'] = 'Unknown'
bcdata_init.obs['sample'] = 'ST'

andata_combined = sc.concat((bcdata_init,rdata))

In [15]:
scvi.model.SCVI.setup_anndata(andata_combined, batch_key='batch', categorical_covariate_keys = ['sample'])
model_scvi = scvi.model.SCVI(andata_combined)
max_epochs_scvi = np.min([round((20000 / andata_combined.n_obs) * 100), 100])
model_scvi.train(max_epochs=27)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]


Epoch 27/27: 100%|██████████████| 27/27 [12:16<00:00, 27.27s/it, v_num=1, train_loss_step=160, train_loss_epoch=167]

`Trainer.fit` stopped: `max_epochs=27` reached.


Epoch 27/27: 100%|██████████████| 27/27 [12:16<00:00, 27.28s/it, v_num=1, train_loss_step=160, train_loss_epoch=167]


In [16]:
model_scANvi = scvi.model.SCANVI.from_scvi_model(model_scvi, adata = andata_combined, unlabeled_category = 'Unknown',labels_key = 'CellType')
max_epochs_scanvi = int(np.min([10, np.max([2, round(max_epochs_scvi / 3.0)])]))
#model_scANvi.train(max_epochs=max_epochs_scanvi, n_samples_per_label=100)
model_scANvi.train(max_epochs=20, n_samples_per_label=100)

[34mINFO    [0m Training for [1;36m20[0m epochs.                                                                                   


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]


Epoch 20/20: 100%|██████████████| 20/20 [19:08<00:00, 57.21s/it, v_num=1, train_loss_step=158, train_loss_epoch=157]

`Trainer.fit` stopped: `max_epochs=20` reached.


Epoch 20/20: 100%|██████████████| 20/20 [19:08<00:00, 57.40s/it, v_num=1, train_loss_step=158, train_loss_epoch=157]


In [22]:
model_scvi.save(os.path.join(pathout,'the_model_scvi'))
model_scANvi.save(os.path.join(pathout,'the_model_scANvi'))

In [18]:
andata_combined.obs['predicted'] = model_scANvi.predict(andata_combined)
andata_combined.obs['transfer_score'] = model_scANvi.predict(soft = True).max(axis = 1)
andata_save = andata_combined.copy()
#andata_save.write_h5ad(os.path.join(pathout, "adata_concat_BreastCancer_harmony_scVI_scANVI_unintigrated.h5ad"))

In [19]:
andata_combined

AnnData object with n_obs × n_vars = 328122 × 1066
    obs: 'batch', 'percent_mito', 'CellType', 'sample', '_scvi_batch', '_scvi_labels', 'predicted', 'transfer_score'
    uns: '_scvi_uuid', '_scvi_manager_uuid'
    obsm: '_scvi_extra_categorical_covs'

Compute the latent representation of the data.

This is typically denoted as $\text{Z}_n$ from the scVI paper. [see scVI api](https://docs.scvi-tools.org/en/stable/api/reference/scvi.model.SCVI.html#scvi.model.SCVI.get_latent_representation)

In [21]:
andata_combined.obsm['X_scVI'] = model_scvi.get_latent_representation()
andata_combined

AnnData object with n_obs × n_vars = 328122 × 1066
    obs: 'batch', 'percent_mito', 'CellType', 'sample', '_scvi_batch', '_scvi_labels', 'predicted', 'transfer_score'
    uns: '_scvi_uuid', '_scvi_manager_uuid'
    obsm: '_scvi_extra_categorical_covs', 'X_scVI'

In [25]:
andata_combined.layers['Norm_exp_scVI'] = model_scvi.get_normalized_expression()
andata_combined

AnnData object with n_obs × n_vars = 328122 × 1066
    obs: 'batch', 'percent_mito', 'CellType', 'sample', '_scvi_batch', '_scvi_labels', 'predicted', 'transfer_score'
    uns: '_scvi_uuid', '_scvi_manager_uuid'
    obsm: '_scvi_extra_categorical_covs', 'X_scVI'
    layers: 'Norm_exp_scVI'

In [27]:
andata_combined.layers['Norm_exp_scVI'].shape

(328122, 1066)

In [28]:
andata_bc = andata_combined[andata_combined.obs['sample']=='ST'].copy()

In [29]:
andata_save = andata_bc.copy()
andata_save.write_h5ad(os.path.join(pathout, "andata_bc_BreastCancer_harmony_scVI_scANVI_unintigrated_untuned.h5ad"))