## Notebook to identify zero-inflated genes using scvi-tools

from this example at scvi-tools [Identification of zero-inflated genes](https://docs.scvi-tools.org/en/stable/user_guide/notebooks/AutoZI_tutorial.html#)

[biorxiv paper](https://www.biorxiv.org/content/10.1101/794875v2)

In [3]:
!date

Thu Sep 30 14:43:12 EDT 2021


### Imports, data loading and preparation

In [4]:
import numpy as np
import pandas as pd
import anndata

import scanpy as sc
import scvi

In [5]:
# parameters
cohort = 'aging'
assay = 'RNA'

# directories for initial setup
home_dir = '/labshare/raph/datasets/adrd_neuro'
wrk_dir = f'{home_dir}/{cohort}'
results_dir = f'{wrk_dir}/demux'

# in files
h5ad_file = f'{results_dir}/{cohort}.pegasus.leiden_085.Age_group_young_old.h5ad'

# out files
# output_file = f'{results_dir}/{cohort}.testing.h5ad'
# regions_out_file = f'{results_dir}/{cohort}.regions_glmm_age_diffs.csv'
# cells_out_file = f'{results_dir}/{cohort}.celltypes_glmm_age_diffs.csv'


#### load the data

In [6]:
sc_quant = scvi.data.read_h5ad(h5ad_file)
print(sc_quant)

AnnData object with n_obs × n_vars = 167945 × 35441
    obs: 'pool_name', 'Sample_id', 'Tissue_source', 'Brain_region', 'Clinical_diagnosis', 'Age', 'Sex', 'donor_id', 'lane_num', 'Channel', 'n_genes', 'n_counts', 'percent_mito', 'scale', 'Group', 'leiden_labels', 'anno', 'leiden_labels_085', 'new_anno', 'Age_group'
    var: 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank'
    uns: 'Channels', 'Groups', 'PCs', 'W_diffmap', 'W_pca_harmony', 'c2gid', 'df_qcplot', 'diffmap_evals', 'diffmap_knn_distances', 'diffmap_knn_indices', 'genome', 'gncells', 'leiden_resolution', 'modality', 'ncells', 'norm_count', 'pca', 'pca_features', 'pca_harmony_knn_distances', 'pca_harmony_knn_indices', 'stdzn_max_value', 'stdzn_mean', 'stdzn_std'
    obsm: 'X_diffmap', 'X_fle', 'X_pca', 'X_pca_harmony', 'X_phi', 'X_umap', 'X_umap_085'
    varm: 'de_res', 'gmeans', 'gstds', 'means', 'partial_sum'


In [None]:
# sc_quant.obs.head()

In [10]:
sc_quant.layers["counts"] = sc_quant.X.copy()
sc.pp.normalize_total(sc_quant, target_sum=10e4)
sc.pp.log1p(sc_quant)
sc_quant.raw = sc_quant
scvi.data.poisson_gene_selection(
    sc_quant,
    n_top_genes=1000,
    batch_key="pool_name",
    subset=True,
    layer="counts",
)
scvi.data.setup_anndata(
    sc_quant,
    labels_key="Sample_id",
    batch_key="pool_name",
    layer="counts",
)


These matrices should now be stored in the .obsp attribute.
This slicing behavior will be removed in anndata 0.8.
  warn(

These matrices should now be stored in the .obsp attribute.
This slicing behavior will be removed in anndata 0.8.
  warn(
  res = method(*args, **kwargs)


Sampling from binomial...: 100%|██████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 7757.43it/s]
Sampling from binomial...: 100%|██████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 8831.64it/s]
Sampling from binomial...: 100%|██████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9898.58it/s]
Sampling from binomial...: 100%|█████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 10322.72it/s]
Sampling from binomial...: 100%|██████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9749.41it/s]
Sampling from binomial...: 100%|██████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9861.21it/s]
[34mINFO    [0m Using batches from adata.obs[1m[[0m[32m"pool_name"[0m[1m][0m                                           
[34mINFO    [0m Using labels from adata.obs[1m[[0m[32m"Sample_id"[0m[1m][0m                                            
[34mI

### Analyze gene-specific ZI

In AutoZI, all ’s follow a common  prior distribution where  and the zero-inflation probability in the ZINB component is bounded below by . AutoZI is encoded by the AutoZIVAE class whose inputs, besides the size of the dataset, are  (alpha_prior),  (beta_prior),  (minimal_dropout). By default, we set .

Note : we can learn  in an Empirical Bayes fashion, which is possible by setting alpha_prior = None and beta_prior = None

In [11]:
vae = scvi.model.AUTOZI(sc_quant)



We fit, for each gene , an approximate posterior distribution  (with ) on which we rely. We retrieve  for all genes  (and , if learned) as numpy arrays using the method get_alphas_betas of AutoZIVAE.

In [12]:
vae.train(max_epochs=200, plan_kwargs = {'lr':1e-2})

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Epoch 1/200:   0%|                                                                                | 0/200 [00:00<?, ?it/s]



Epoch 200/200: 100%|██████████████████████████████████████████| 200/200 [1:15:08<00:00, 22.54s/it, loss=1.22e+03, v_num=1]


In [13]:
outputs = vae.get_alphas_betas()
alpha_posterior = outputs['alpha_posterior']
beta_posterior = outputs['beta_posterior']

Now that we obtained fitted , different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis , but also other metrics such as the mean wrt  of  are possible. We focus on the former. We decide that gene  is ZI if and only if  is greater than a given threshold, say . We may note that it is equivalent to . From this we can deduce the fraction of predicted ZI genes in the dataset.

In [14]:
from scipy.stats import beta

# Threshold (or Kzinb/Knb+Kzinb in paper)
threshold = 0.5

# q(delta_g < 0.5) probabilities
zi_probs = beta.cdf(0.5, alpha_posterior, beta_posterior)

# ZI genes
is_zi_pred = (zi_probs > threshold)

print('Fraction of predicted ZI genes :', is_zi_pred.mean())

Fraction of predicted ZI genes : 0.985


We noted that predictions were less accurate for genes  whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.

We noted that predictions were less accurate for genes  whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.

In [16]:
mask_sufficient_expression = (np.array(sc_quant.X.mean(axis=0)) > 1.).reshape(-1)
print('Fraction of genes with avg expression > 1 :', mask_sufficient_expression.mean())
print('Fraction of predicted ZI genes with avg expression > 1 :', is_zi_pred[mask_sufficient_expression].mean())

Fraction of genes with avg expression > 1 : 0.78
Fraction of predicted ZI genes with avg expression > 1 : 0.9871794871794872


### Analyze gene-cell-type-specific ZI

One may argue that zero-inflation should also be treated on the cell-type (or ‘label’) level, in addition to the gene level. AutoZI can be extended by assuming a random variable  for each gene  and cell type  which denotes the probability that gene  is not zero-inflated in cell-type . The analysis above can be extended to this new scale.

In [17]:
# Model definition
vae_genelabel = scvi.model.AUTOZI(
    sc_quant,
    dispersion='gene-label',
    zero_inflation='gene-label'
)

# Training
vae_genelabel.train(max_epochs=200, plan_kwargs = {'lr':1e-2})

# Retrieve posterior distribution parameters
outputs_genelabel = vae_genelabel.get_alphas_betas()
alpha_posterior_genelabel = outputs_genelabel['alpha_posterior']
beta_posterior_genelabel = outputs_genelabel['beta_posterior']

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Epoch 1/200:   0%|                                                                                | 0/200 [00:00<?, ?it/s]



Epoch 200/200: 100%|███████████████████████████████████████████| 200/200 [1:15:35<00:00, 22.68s/it, loss=1.2e+03, v_num=1]


In [18]:
pbmc = scvi.data.pbmc_dataset(run_setup_anndata=False)

[34mINFO    [0m Downloading file at data/gene_info_pbmc.csv                                         
Downloading...: 909it [00:00, 25927.39it/s]                                                                               
[34mINFO    [0m Downloading file at data/pbmc_metadata.pickle                                       
Downloading...: 4001it [00:00, 57037.33it/s]                                                                              
[34mINFO    [0m Downloading file at data/pbmc8k/filtered_gene_bc_matrices.tar.gz                    
Downloading...: 37559it [00:01, 27402.53it/s]                                                                             
[34mINFO    [0m Extracting tar file                                                                 
[34mINFO    [0m Removing extracted data at data/pbmc8k/filtered_gene_bc_matrices                    
[34mINFO    [0m Downloading file at data/pbmc4k/filtered_gene_bc_matrices.tar.gz                    
Downloading..

  res = method(*args, **kwargs)


In [21]:
# q(delta_g < 0.5) probabilities
zi_probs_genelabel = beta.cdf(0.5,alpha_posterior_genelabel, beta_posterior_genelabel)

# ZI gene-cell-types
is_zi_pred_genelabel = (zi_probs_genelabel > threshold)

In [23]:
ct = sc_quant.obs['Brain_region'].astype("category")
codes = np.unique(ct.cat.codes)
cats = ct.cat.categories
for ind_cell_type, cell_type in zip(codes, cats):
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[:,ind_cell_type]
    print('Fraction of predicted ZI genes for cell type {} :'.format(cell_type),
          is_zi_pred_genelabel_here.mean(),'\n')

Fraction of predicted ZI genes for cell type Entorhinal cortex : 0.735 

Fraction of predicted ZI genes for cell type Middle temporal gyrus : 0.687 

Fraction of predicted ZI genes for cell type Putamen : 0.665 

Fraction of predicted ZI genes for cell type Subventricular zone : 0.6 



In [25]:
ct = sc_quant.obs['new_anno'].astype("category")
codes = np.unique(ct.cat.codes)
cats = ct.cat.categories
for ind_cell_type, cell_type in zip(codes, cats):
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[:,ind_cell_type]
    print('Fraction of predicted ZI genes for cell type {} :'.format(cell_type),
          is_zi_pred_genelabel_here.mean(),'\n')

Fraction of predicted ZI genes for cell type Astrocyte : 0.735 

Fraction of predicted ZI genes for cell type Astrocyte-GFAP-Hi : 0.687 

Fraction of predicted ZI genes for cell type Endothelial : 0.665 

Fraction of predicted ZI genes for cell type ExN CUX2 ADARB2 : 0.6 

Fraction of predicted ZI genes for cell type ExN CUX2 LAMP5 : 0.705 

Fraction of predicted ZI genes for cell type ExN FEZF2 : 0.76 

Fraction of predicted ZI genes for cell type ExN LAMP5 : 0.613 

Fraction of predicted ZI genes for cell type ExN RORB : 0.614 

Fraction of predicted ZI genes for cell type ExN RORB THEMIS : 0.733 

Fraction of predicted ZI genes for cell type ExN THEMIS : 0.637 

Fraction of predicted ZI genes for cell type InN ADARB2 LAMP5 : 0.647 

Fraction of predicted ZI genes for cell type InN ADARB2 VIP : 0.699 

Fraction of predicted ZI genes for cell type InN LHX6 PVALB : 0.662 

Fraction of predicted ZI genes for cell type InN LHX6 SST : 0.696 

Fraction of predicted ZI genes for cell type M

In [26]:
# With avg expressions > 1
for ind_cell_type, cell_type in zip(codes, cats):
    mask_sufficient_expression = (np.array(sc_quant.X[sc_quant.obs['new_anno'].values.reshape(-1) == cell_type,:].mean(axis=0)) > 1.).reshape(-1)
    print('Fraction of genes with avg expression > 1 for cell type {} :'.format(cell_type),
          mask_sufficient_expression.mean())
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[mask_sufficient_expression,ind_cell_type]
    print('Fraction of predicted ZI genes with avg expression > 1 for cell type {} :'.format(cell_type),
          is_zi_pred_genelabel_here.mean(), '\n')

Fraction of genes with avg expression > 1 for cell type Astrocyte : 0.449
Fraction of predicted ZI genes with avg expression > 1 for cell type Astrocyte : 0.6993318485523385 

Fraction of genes with avg expression > 1 for cell type Astrocyte-GFAP-Hi : 0.273
Fraction of predicted ZI genes with avg expression > 1 for cell type Astrocyte-GFAP-Hi : 0.7326007326007326 

Fraction of genes with avg expression > 1 for cell type Endothelial : 0.326
Fraction of predicted ZI genes with avg expression > 1 for cell type Endothelial : 0.6901840490797546 

Fraction of genes with avg expression > 1 for cell type ExN CUX2 ADARB2 : 0.743
Fraction of predicted ZI genes with avg expression > 1 for cell type ExN CUX2 ADARB2 : 0.5881561238223418 

Fraction of genes with avg expression > 1 for cell type ExN CUX2 LAMP5 : 0.785
Fraction of predicted ZI genes with avg expression > 1 for cell type ExN CUX2 LAMP5 : 0.70828025477707 

Fraction of genes with avg expression > 1 for cell type ExN FEZF2 : 0.772
Fracti