This notebook performs a quality control step which serves to identify and remove doublet cells - these are two or more cells trapped into a single doublet. These can be mixtures of cells with different identities and are not informative for biology. For example a doublet cell can seem like an intermediate cell state which does not actually exist in the data or confounding DE expression or enrichment analyses[Wolock et al, 2019].
Scrublet is a python-based framework which specifically models doublets (two cells in a droplet). In our data we assumed that the possibility of higher-level aggregations of cells is less-significant. Doublet-identifying algorithms alleviate the need for manually needing to assess whether certain cells express transcripts that seem to be a mixture of distinct cell types. In addition such a manual method could be additinally limiting in developmental data since in less-differentiated states multiple cell-lineage markers could be transiently expressed. 
Scrublet simulates doublets by randomly combining single-cell transcriptomes from the data and then obtaining scores of existing transcriptomes based on how many doublets they can produce and other metrics. 
The concept behind the method that Wolock et al. developed is that there are embedded and neotypic droplets. The embedded ones are created from similar transcriptomes (e.g. similar cell state), they are hard to identify but are also expected to have less impact on the data. Whereas neotypic doublets are these which introduce new artifical cell states into the data but are also easier to identify since they diverge in transcriptome from other cells in the data.

In [1]:
import scrublet as scr
import matplotlib.pyplot as plt
import scanpy as sc
import numpy as np
import pandas as pd
import glob
from pathlib import Path


In [9]:
import matplotlib as mpl

# these parameters are hard to modify on the figures Scrublet outputs
mpl.rcParams['axes.titleweight'] = 'bold'
mpl.rcParams['axes.titlesize'] = 12

Open Anndata files filtered from low-quality cells

In [3]:
files = glob.glob('new_data/qc_10x/**/*.*', recursive=True)


Determine thresholds for doublet identification upon looking at the plots

In [None]:
dict_parameters = {'20_nM_RA': 0.23, 'ESLIF': 0.21, '2i': 0.23,'5000_nM_RA': 0.2,'EB_2d': 0.23,'EB_8d': 0.24,'HD_2d': 0.14,\
                   'HD_8d': 0.15,'RA_0d': 0.2,'RA_7d': 0.22}

Import the cell and gene diagnostic objects to add filtered cell/gene amount after filtering of doublets

In [None]:
import pickle

file = open('dict_diagn', 'rb')    
dict_cells, dict_genes = pickle.load(file) # load the diagnostic dict from 2_Filtering_QC to store doublet number and genes after doublet filtering
for keys in dict_cells:  #check if the dictionaries were saved properly
    print(keys, '=>', dict_cells[keys], dict_genes[keys])
file.close()


These functions make plots which help differentiate between the embedded and neotypic doublets and also view doublets in a dimensionally-reduced space.
The Scrublet KNN classifier outputs doublet scores corresponding to whether they are distinguishable from singlets in their transcriptional profile. These scores often follow a bimodal distribution where the higher scoring doublets are neotypic. Hence, at viewing such doublet score distribution plots, a threshold is put between the two peaks of the distribution and as such the neotypic doublets to be removed are identified.

In [4]:
def diagnostic_plots(sample, threshold, scrub):
    
    '''
    Make plots of simulated doublets from available cell transcriptomes in the data and visualise embedded vs heterotypic doublet distributions
    on a sample specific basis.

    Args:
        sample (str): sample (dataset) name
        threshold (float): threshold which Scrublet internally uses to plot a threshold line between embedded and heterotypic doublet distributions
        scrub (class): Instance of the Scrublet class
        
    Returns:
        None
    '''
    # get doublet distribution
    scrub.call_doublets(threshold=threshold)
    plt.figure(figsize=(4,3))
    scrub.plot_histogram()
    plt.title(sample, weight='bold', fontsize=12)
    plt.savefig(f'new_data/plots/doublet/db{sample}.png', dpi=300)
    plt.show()
    plt.close()

    
def umaps(sample, scrub):
        '''
    Make UMAP plots on a sample specific basis to visualise all simulated doublets on the data vs heterotypic doublets which can be removed.
    Useful to see if the doublet identification went well. Every cluster in the data should have its own set of doublets. Scrublet pre-processes 
    and normalises the count matrix internally for the UMAPs but does not alter it on the original Anndata objects.

    Args:
        sample (str): sample (dataset) name
        scrub (class): Instance of the Scrublet class
        
    Returns:
        None
    '''

    scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
    plt.figure(figsize=(3,3))
    scrub.plot_embedding('UMAP', order_points=True)
    plt.title(sample,weight='bold', fontsize=12)
    plt.savefig(f'new_data/plots/doublet/{sample}_UMAP.png', dpi=300)
    plt.show()
    plt.close()

For each dataset a doublet simulation and filtering is performed. After filtering of cells, some genes can end up extremely lowly expressed. Single-cell data is inherently sparse and most genes have 70-90% zero-rates in the data, thus even removal of a small amount of cells can render some genes too lowly expressed.

In [5]:
for file in files:
    sample = Path(file).parts[2]
    if 'h5ad' in file:
        adata = sc.read_h5ad(file)
        counts = adata.X

        #initialise Scrublet class (doublet rate=0.06 is, according to 10x, a suitable doublet rate for cell number of around 8000 as is in most of our
        #sample datasets). Parameter values for genes (minimum counts and minimum number of cells expressed in), 
        # percent genes to keep as highly variable and n PCs to keep in PCA are the default values
        scrub = scr.Scrublet(counts, expected_doublet_rate=0.06)
        adata.obs['doublet_scores'], adata.obs['predicted_doublets'] = scrub.scrub_doublets(min_counts=2, 
                                                          min_cells=3, 
                                                          min_gene_variability_pctl=85, # estimates highly variable genes for dimensionality reduntcion
                                                                                            #doublet simulation
                                                          n_prin_comps=30) 
        # plot doublet bimodal distributions
        diagnostic_plots(sample, threshold, scrub)
        threshold = dict_parameters[sample]

        # plot doublets on UMAP
        umaps(sample, scrub)

        # heterotypic doublets (the only ones readily identified)
        adata.obs['predicted_doublets'] = adata.obs['doublet_scores'] > threshold 
        
        dict_cells[sample].append(sum(adata.obs['predicted_doublets']== True)) # add number of doublets that are removed to diagnostic obj
        keep = (adata.obs['predicted_doublets'] == False)

        #remove doublets
        adata = adata[keep, :]

        #diagnostics
        initial_var=adata.var.shape[0] # get number of genes
        sc.pp.filter_genes(adata, min_cells=20)  #filter out underrepresented genes
        dict_cells[sample].append(adata.n_obs) # final number of cells after filtering
        dict_genes[sample].append(initial_var-adata.var.shape[0]) # removed genes added to diagnosti obj
        
        adata.write_h5ad(f'new_data/doublet_fil/{sample}/dbf{sample}.h5ad', compression='gzip')

Export the dictionary with filtering values

In [None]:
# export the dict_cells and dict_genes to file. 
try:
    file = open('new_data/tables/dict_diagn', 'wb')
    pickle.dump((dict_cells, dict_genes), file)
    file.close()

except:
    print("Error in saving or in the object")

References:
1. Samuel L. Wolock, Romain Lopez, Allon M. Klein,
Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data, Cell Systems,Volume 8, Issue 4,2019, Pages 281-291.e9, ISSN 2405-4712, https://doi.org/10.1016/j.cels.2018.11.005.
2. Zheng, G., Terry, J., Belgrader, P. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 14049 (2017). https://doi.org/10.1038/ncomms14049