# Joint analysis of paired and unpaired multiomic data with MultiVI

MultiVI is used for the joint analysis of scRNA and scATAC-seq datasets that were jointly profiled (multiomic / paired) and single-modality datasets (only scRNA or only scATAC). MultiVI uses the paired data as an anchor to align and merge the latent spaces learned from each individual modality.

This tutorial walks through how to read multiomic data, create a joint object with paired and unpaired data, set-up and train a MultiVI model, visualize the resulting latent space, and run differential analyses. 

## this notebook is modified directly from the scvi-tools tutorial

[MultiVI tutorial](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/MultiVI_tutorial.html)

<div class="alert alert-info">
Important

MultiVI requires the datasets to use shared features. scATAC-seq datasets need to be processed to use a shared set of peaks.

</div>

In [1]:
!date

Tue Dec  6 11:22:23 EST 2022


#### import libraries and set notebook variables

In [2]:
import scvi
import numpy as np
import scanpy as sc
from pandas import read_csv, concat
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context

import random
random.seed(42)

import warnings
warnings.filterwarnings('ignore')

scvi.settings.seed = 42

%matplotlib inline
# for white background of figures (only for docs rendering)
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

  warn(f"Failed to load image Python extension: {e}")
Global seed set to 0
  new_rank_zero_deprecation(
  return new_rank_zero_deprecation(*args, **kwargs)
Global seed set to 42


In [3]:
# naming
proj_name = 'aging_phase2'

# directories
wrk_dir = '/labshare/raph/datasets/adrd_neuro/brain_aging/phase2'
quants_dir = f'{wrk_dir}/quants'
models_dir = f'{wrk_dir}/models'

# in files
arc_cnt_file = f'{wrk_dir}/src_data/aging_phase2_arc_aggr/outs/filtered_feature_bc_matrix'
arc_aggr_file = f'{wrk_dir}/src_data/aging_phase2_arc_aggr/outs/aggr.csv'

# out files
raw_anndata_file = f'{quants_dir}/{proj_name}.raw.h5ad'
results_file = f'{quants_dir}/{proj_name}.multivi.h5ad'
trained_model_path = f'{models_dir}/{proj_name}_trained_multivi'

# variables
DEBUG = True
MIN_CELL_PERCENT = 0.005
MAX_MITO_PERCENT = 10
TESTING = False
testing_cell_size = 5000
FILTER_FEATURES_HV = False
TOP_FEATURES_PERCENT = 0.3

## Data Processing
Next, we'll read the data into an Anndata object. 
The data has 12012 cells, we'll use 4004 for each dataset.

Reading the data into an AnnData object can be done with the `read_10x_multiome` function:

Cellranger-arc aggr writes filtered file gzipped but read_10x_multiome looks for unzipped files, so need to unzip, guesses future versions may be able to find either?

In [4]:
%%time
# read multiomic data
adata = scvi.data.read_10x_multiome(arc_cnt_file)
adata.var_names_make_unique()
adata.obs_names_make_unique()

CPU times: user 2min 36s, sys: 1.3 s, total: 2min 37s
Wall time: 2min 38s


In [5]:
print(adata)

AnnData object with n_obs × n_vars = 38671 × 181593
    obs: 'batch_id'
    var: 'ID', 'modality', 'chr', 'start', 'end'


In [6]:
adata.obs.batch_id.value_counts()

4    20000
2     7360
1     6749
3     4562
Name: batch_id, dtype: int64

#### from the cellranger-arc aggr command the batch_id's should match the index from the aggr library file

In [7]:
aggr_lib = read_csv(arc_aggr_file)
aggr_lib = aggr_lib[['library_id']]
aggr_lib['batch_id'] = aggr_lib.index + 1
print(aggr_lib.shape)
if DEBUG:
    display(aggr_lib.head())

(4, 2)


Unnamed: 0,library_id,batch_id
0,Ag119,1
1,Ag120,2
2,Ag121,3
3,Ag122,4


#### merge the lib ID based on obs batch_id

In [8]:
prev_index = adata.obs.index.copy()
adata.obs['library_id'] = adata.obs.batch_id.map(aggr_lib.set_index('batch_id')['library_id'])

print(adata.obs.shape)
if DEBUG:
    display(adata.obs.head())
    print(prev_index.equals(adata.obs.index))

(38671, 2)


Unnamed: 0_level_0,batch_id,library_id
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1
AAACAGCCAAAGCGGC,4,Ag122
AAACAGCCAACCCTCC,2,Ag120
AAACAGCCAAGGATTA,2,Ag120
AAACAGCCAAGTCGCT,3,Ag121
AAACAGCCAATAACCT,2,Ag120


True


In [9]:
adata.var.modality.value_counts()

Peaks              144992
Gene Expression     36601
Name: modality, dtype: int64

#### if testing notebook subset the cells

In [10]:
if TESTING:
    cells_subset = random.sample(list(adata.obs.index.values), 
                                 testing_cell_size)
    adata = adata[cells_subset]
    print(adata)
    if DEBUG:
        display(adata.obs.head())
        display(adata.var.modality.value_counts())

We can then use the `organize_multiome_anndatas` function to orgnize these three datasets into a single Multiome dataset.
This function sorts and orders the data from the multi-modal and modality-specific AnnDatas into a single AnnData (aligning the features, padding missing modalities with 0s, etc). 

In [11]:
# We can now use the organizing method from scvi to concatenate these anndata
# adata_mvi = scvi.data.organize_multiome_anndatas(adata_paired, adata_rna, adata_atac)
adata_mvi = scvi.data.organize_multiome_anndatas(adata)

Note that `organize_multiome_anndatas` adds an annotation to the cells to indicate which modality they originate from:

In [12]:
display(adata_mvi.obs.modality.value_counts())
display(adata_mvi.obs.head())

paired    38671
Name: modality, dtype: int64

Unnamed: 0,batch_id,library_id,modality
AAACAGCCAAAGCGGC_paired,4,Ag122,paired
AAACAGCCAACCCTCC_paired,2,Ag120,paired
AAACAGCCAAGGATTA_paired,2,Ag120,paired
AAACAGCCAAGTCGCT_paired,3,Ag121,paired
AAACAGCCAATAACCT_paired,2,Ag120,paired


<div class="alert alert-info">
Important

MultiVI requires the features to be ordered so that genes appear before genomic regions. This must be enforced by the user.

</div>

MultiVI requires the features to be ordered, such that genes appear before genomic regions. In this case this is already the case, but it's always good to verify:

In [13]:
adata_mvi = adata_mvi[:, adata_mvi.var["modality"].argsort()].copy()
display(adata_mvi.var)

Unnamed: 0,ID,modality,chr,start,end
MIR1302-2HG,ENSG00000243485,Gene Expression,chr1,29553,30267
AL391261.2,ENSG00000258847,Gene Expression,chr14,66004522,66004523
FUT8-AS1,ENSG00000276116,Gene Expression,chr14,65412689,65412690
FUT8,ENSG00000033170,Gene Expression,chr14,65410591,65413008
AL355076.2,ENSG00000258760,Gene Expression,chr14,65302679,65318790
...,...,...,...,...,...
chr15:93129576-93130407,chr15:93129576-93130407,Peaks,chr15,93129576,93130407
chr15:93131610-93132443,chr15:93131610-93132443,Peaks,chr15,93131610,93132443
chr15:93144220-93145147,chr15:93144220-93145147,Peaks,chr15,93144220,93145147
chr15:93088872-93089737,chr15:93088872-93089737,Peaks,chr15,93088872,93089737


#### save the MultiVi organized but unprocessed anndata object note that the subject is in the obs

In [14]:
%%time
# adata.write(raw_anndata_file)
adata_mvi.write(raw_anndata_file)

CPU times: user 603 ms, sys: 384 ms, total: 987 ms
Wall time: 2.62 s


In [15]:
print(adata_mvi)

AnnData object with n_obs × n_vars = 38671 × 181593
    obs: 'batch_id', 'library_id', 'modality'
    var: 'ID', 'modality', 'chr', 'start', 'end'


We also filter features to remove those that appear in fewer than MIN% of the cells

In [16]:
print(adata_mvi.shape)
# annotate the group of mitochondrial genes as 'mt'
adata_mvi.var['mt'] = adata_mvi.var_names.str.startswith('MT-')  
# With pp.calculate_qc_metrics, we can compute many metrics very efficiently.
sc.pp.calculate_qc_metrics(adata_mvi, qc_vars=['mt'], percent_top=None, 
                           log1p=False, inplace=True)
adata_mvi = adata_mvi[adata_mvi.obs.pct_counts_mt < MAX_MITO_PERCENT, :]
# Basic filtering:
sc.pp.filter_cells(adata_mvi, min_genes=200)
sc.pp.filter_genes(adata_mvi, min_cells=int(adata_mvi.shape[0] * MIN_CELL_PERCENT))

print(adata_mvi)

if DEBUG:
    display(adata_mvi.obs.head())

(38671, 181593)
AnnData object with n_obs × n_vars = 38618 × 139949
    obs: 'batch_id', 'library_id', 'modality', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
    var: 'ID', 'modality', 'chr', 'start', 'end', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'


Unnamed: 0,batch_id,library_id,modality,n_genes_by_counts,total_counts,total_counts_mt,pct_counts_mt,n_genes
AAACAGCCAAAGCGGC_paired,4,Ag122,paired,1996,3297.0,2.0,0.060661,1996
AAACAGCCAACCCTCC_paired,2,Ag120,paired,2996,6309.0,0.0,0.0,2996
AAACAGCCAAGGATTA_paired,2,Ag120,paired,11400,29757.0,5.0,0.016803,11400
AAACAGCCAAGTCGCT_paired,3,Ag121,paired,1387,2618.0,0.0,0.0,1387
AAACAGCCAATAACCT_paired,2,Ag120,paired,12730,33734.0,3.0,0.008893,12730


#### if flag set then subset to highest variance features

MultiVI tutorial doesn't suggest this so probably typically will set to false

In [17]:
if FILTER_FEATURES_HV:
    n_top_genes = int(adata_mvi.var.shape[0] * TOP_FEATURES_PERCENT)
    sc.pp.highly_variable_genes(adata_mvi, n_top_genes=n_top_genes, 
                                batch_key='library_id',flavor='seurat_v3', 
                                subset=True)
    print(adata_mvi)

## Setup and Training MultiVI
We can now set up and train the MultiVI model!

First, we need to setup the Anndata object using the `setup_anndata` function. At this point we specify any batch annotation that the model would account for.
**Importantly**, the main batch annotation, specific by `batch_key`, should correspond to the modality of the cells.

Other batch annotations (e.g if there are multiple ATAC batches) should be provided using the `categorical_covariate_keys`.

The actual values of categorical covariates (include `batch_key`) are not important, as long as they are different for different samples.
I.e it is not important to call the expression-only samples "expression", as long as they are called something different than the multi-modal and accessibility-only samples.

<div class="alert alert-info">
Important

MultiVI requires the main batch annotation to correspond to the modality of the samples. Other batch annotation, such as in the case of multiple RNA-only batches, can be specified using `categorical_covariate_keys`.

</div>

In [18]:
scvi.model.MULTIVI.setup_anndata(adata_mvi, batch_key='modality', 
                                 categorical_covariate_keys = ['library_id']) 
                                 # continuous_covariate_keys = ['total_counts', 'pct_counts_mt'])
                                 # categorical_covariate_keys = ['region', 'subject_id'],



When creating the object, we need to specify how many of the features are genes, and how many are genomic regions. This is so MultiVI can determine the exact architecture for each modality.

In [19]:
mvi = scvi.model.MULTIVI(
    adata_mvi, 
    n_genes=(adata_mvi.var['modality']=='Gene Expression').sum(),
    n_regions=(adata_mvi.var['modality']=='Peaks').sum(),
)
mvi.view_anndata_setup()

In [None]:
%%time
mvi.train()

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Epoch 5/500:   1%|▎                               | 4/500 [35:12<73:07:40, 530.77s/it, loss=6.76e+03, v_num=1]

## Save and Load MultiVI models

Saving and loading models is similar to all other scvi-tools models, and is very straight forward:

In [None]:
mvi.save(trained_model_path, overwrite=True)

In [None]:
mvi = scvi.model.MULTIVI.load(trained_model_path, adata=adata_mvi, use_gpu=True)

## Extracting and visualizing the latent space

We can now use the `get_latent_representation` to get the latent space from the trained model, and visualize it using scanpy functions:

In [None]:
adata_mvi.obsm["MultiVI_latent"] = mvi.get_latent_representation()

#### embed the graph based on latent representation

In [None]:
sc.pp.neighbors(adata_mvi, use_rep="MultiVI_latent")
sc.tl.umap(adata_mvi, min_dist=0.2)

#### visualize the latent representation

In [None]:
with rc_context({'figure.figsize': (8, 8), 'figure.dpi': 100}):
    plt.style.use('seaborn-bright')
    sc.pl.umap(adata_mvi, color=['library_id'])

### Clustering on the MultiVI latent space
The user will note that we imported curated labels from the original publication. Our interface with scanpy makes it easy to cluster the data with scanpy from MultiVI's latent space and then reinject them into MultiVI (e.g., for differential expression).

In [None]:
# neighbors were already computed using scVI
leiden_res = 0.6
sc.tl.leiden(adata_mvi, key_added="leiden_MultiVI", resolution=leiden_res)

In [None]:
with rc_context({'figure.figsize': (8, 8), 'figure.dpi': 100}):
    plt.style.use('seaborn-bright')
    sc.pl.umap(adata_mvi, color=['leiden_MultiVI'], 
               frameon=False, legend_loc='on data')

### add quantification layers as needed

In a well-mixed space, MultiVI can seamlessly impute the missing modalities for single-modality cells.
First, imputing expression and accessibility is done with `get_normalized_expression` and `get_accessibility_estimates`, respectively.

We'll demonstrate this by imputing gene expression for all cells in the dataset (including those that are ATAC-only cells):

In [None]:
# preserve original counts
adata_mvi.layers['counts'] = adata_mvi.X.copy()
# get normalized expression values from model and accessiblility estimates
expression = mvi.get_normalized_expression()
accessibility = mvi.get_accessibility_estimates()
combined = concat([expression, accessibility], axis='columns')
print(combined.shape)
if DEBUG:
    display(combined.head())
    print(adata_mvi.obs.index.equals(combined.index))
    print(adata_mvi.var.index.equals(combined.columns))
adata_mvi.layers['X_mvi'] = combined


In [None]:
print(adata_mvi)

#### save the modified anndata object

In [None]:
adata_mvi.write(results_file)

We can demonstrate this on some known marker genes:


neuron, SNAP23. 

In [None]:
with rc_context({'figure.figsize': (8, 8), 'figure.dpi': 100}):
    plt.style.use('seaborn-bright')
    sc.pl.umap(adata_mvi, color='SNAP25')
    sc.pl.umap(adata_mvi, color='SNAP25', layer='X_mvi')

GABAerigc, GAD1:

In [None]:
with rc_context({'figure.figsize': (8, 8), 'figure.dpi': 100}):
    plt.style.use('seaborn-bright')
    sc.pl.umap(adata_mvi, color='GAD1')
    sc.pl.umap(adata_mvi, color='GAD1', layer='X_mvi')

Glutamatergic, GRIN1:

In [None]:
with rc_context({'figure.figsize': (8, 8), 'figure.dpi': 100}):
    plt.style.use('seaborn-bright')
    sc.pl.umap(adata_mvi, color='GRIN1')
    sc.pl.umap(adata_mvi, color='GRIN1', layer='X_mvi')

Microglia, CSF1R:

In [None]:
with rc_context({'figure.figsize': (8, 8), 'figure.dpi': 100}):
    plt.style.use('seaborn-bright')
    sc.pl.umap(adata_mvi, color='CSF1R')
    sc.pl.umap(adata_mvi, color='CSF1R', layer='X_mvi')

Astrocyte, GFAP:

In [None]:
with rc_context({'figure.figsize': (8, 8), 'figure.dpi': 100}):
    plt.style.use('seaborn-bright')
    sc.pl.umap(adata_mvi, color='GFAP')
    sc.pl.umap(adata_mvi, color='GFAP', layer='X_mvi')

Oligodendrocyte, PLP1:

In [None]:
with rc_context({'figure.figsize': (8, 8), 'figure.dpi': 100}):
    plt.style.use('seaborn-bright')
    sc.pl.umap(adata_mvi, color='PLP1')
    sc.pl.umap(adata_mvi, color='PLP1', layer='X_mvi')

All three marker genes clearly identify their respective populations. Importantly, the imputed gene expression profiles are stable and consistent within that population, **even though many of those cells only measured the ATAC profile of those cells**.

In [None]:
!date