This notebook performs quality control on the valid barcode and cell-containing droplets in the Anndata object. The quality control is performed based on gene counts and specific quality metrics related to the cells' viability. Even after droplet filetring the range of UMI counts between the cells is still large and there could be dying or stressed cells (which are not as informative about the biology in the data). Such cells can have high mitochondrial counts (the mitochondrial membrane is lysed since the cell membaren is already gone) or ribosomal counts which are highly abundant transcripts and if they are the most prevalent it is likely that the other transcripts leaked out and the cell was dying. In addition there are a number of genes associated with stress such as heat-shock proteins (chaperones) and they are useful to determine the overall state even of the cells considered "high-quality". Scanpy has built in functions which focus on identifying quality metrics such as percentages of mitochondrial genes or count/UMI distributions in the cells.

In [1]:
import matplotlib.pyplot as plt
import scanpy as sc
import numpy as np
import glob
from pathlib import Path
import warnings

The following functions calculate quality metrics which are added as metadata both cell-specific or gene-specific, (in .obs and .var)  and make diagnostic plots. As such every cell has individually calculated, for example, percentage mitochondrial counts or UMI number. Usually these metrics have some correlation, e.g. dying cells would have both few total counts/genes and high percentage of mitochondrial counts. Therefore scatterplots are made which place the cells in terms of both their UMI count number (x) and another specific quality metric on y.

In [2]:
def plot_pre(adata, sample, x_metric, y_metric, y_scale, thresholds, p):
    '''
    Plot and save QC scatterplots with threshold lines before and after filtering of data. 

    Args:
        adata (Anndata): Anndata object of a single sample
        sample (str): sample name, e.g. "20_nM_RA"
        x_metric (str): quality metric name (see prepare_for_qc()) for x
        y_metric (str): quality metric name (see prepare_for_qc()) for y
        y_scale (float): set limit for y axis for visibility
        thresholds (list): list of QC thresholds (floats) to plot as lines
        p (bool): whether the data is filtered or not

    Returns:
        None
    '''
    fig, ax = plt.subplots(figsize=(3,3))
    p = sc.pl.scatter(adata, x=x_metric, y=y_metric, show=False, ax=ax)
    plt.xscale('log') # only if plotting with n_genes_by_counts or total_counts. Not log if plotting log1p_n_genes_by_counts or log1p_total_counts
    p.set_ylim(0,y_scale * 1.1)
    # ymin, ymax = ax.get_ylim()
    # ax.set_ylim(ymin, ymax * 1.1) 
    # p.set_xlim(6,12) #when plotting log1p_ metrics. Otherwise disable
    plt.ylabel(f"Percent {y_metric[11:]} counts")
    plt.xlabel(f"Total UMI counts")
    plt.title(label=f"Sample '{sample}' \n{y_metric[11:]} counts", weight='bold', fontsize=12)
    
    # plot tresholds once those are determined
    if pre:
        if y_metric[11:] == 'ribo':
            plt.axhline(thresholds[0], color = 'r', linestyle='dashed')
            plt.axvline(np.expm1(float(thresholds[1])), color = 'r', linestyle='dashed') #for filtering this metric value is in 
            #log1p but we plot in non-log_1p. Adjust if needed
        elif y_metric[11:] == 'mito':
            plt.axhline(thresholds[4], color = 'gray', linestyle='dashed')
            plt.axvline(thresholds[2], color = 'cyan', linestyle='dashed')
            plt.axvline(thresholds[3], color = 'cyan', linestyle='dashed')
            p='pre'
    else:
        # set it for filename
        p='post'
    plt.savefig(f'new_data/plots/qc/pl_{y_metric[11:]}_{sample}_{p}.png', dpi=300, bbox_inches='tight')
    plt.show()
    plt.close()



def prepare_for_qc(adata, list_qc_genes):
        '''
    Prepare sample Anndata for QC by identifying quality-determining genes and 
    calculating quality metrics using these genes with scanpy 

    Args:
        adata (Anndata): Anndata object of a single sample
        list_qc_genes (list): list of str, specifically for gene metrics which require a group of specifically curated genes, e.g. stress genes instead of a 
        prefix (e.g. Rps- for ribosome genes) as is usually done for most gene metrics
        
    Returns:
        Anndata object: deep copy of adata with QC metrics slots added in .obs
        '''
    adata.obs_names_make_unique()
    # for gene prefixes consult the gene symbols for your data's organism
    adata.var["mito"] = adata.var['gene_name'].str.lower().str.startswith("mt-", na=False)
    adata.var["ribo"] = adata.var['gene_name'].str.lower().str.startswith(("rps", "rpl"), na=False)
    adata.var["hsps"] = adata.var['gene_name'].str.lower().str.startswith(('hsp'), na=False) 
    adata.var["strs"] = adata.var['gene_name'].isin(list_qc_genes.split(', ')) 
    sc.pp.calculate_qc_metrics(adata, qc_vars=["mito", "ribo", "hsps", "strs"], inplace=True, percent_top=[20], log1p=True)
    adata.var_names_make_unique()
    return adata.copy()

    
def thresholds_n_filter(adata, dict_parameters, sample):
        '''
    Filter the sample-specific Anndata object based off the metrics. The thresholds themselves are contained within a dictionary. Filtering is either
    manually or by scanpy functions. Also calls the plotting functions defined above. In addition, adds cell and gene amount values (int) to a dictionary
    dict_diagn, at each QC step.

    Args:
        adata (Anndata): Anndata object of a single sample
        dict_parameters (dict): dictionary of filtering thresholds for QC metrics. Keys: sample names (str), values: thresholds (float). 
        
    Returns:
        Anndata object: deep copy of adata, filtered
        '''
    
    dict_diagn[sample] =[] # storing of cell and gene counts is on a sample (dataset)-specific basis

    thresholds=dict_parameters[sample]

    # isolated thresholds in variable names for readaility
    ribo = dict_parameters[sample][0]  
    second_genes = dict_parameters[sample][1]
    min_counts= dict_parameters[sample][2]
    max_counts=dict_parameters[sample][3]
    mito=dict_parameters[sample][4]

    #alternatively plot log1p_total_counts and then set x thresholds in the function.
    plot_pre(adata, sample, "total_counts", "pct_counts_mito", 20, thresholds)
    plot_pre(adata, sample, "total_counts", "pct_counts_ribo", 60, thresholds)

    # store outliers in .obs slots 
    adata.obs['ribo_outlier'] = ((adata.obs['pct_counts_ribo'] > ribo) & \
                                 (adata.obs['log1p_total_counts'] < second_genes)) 
    adata.obs['mito_outlier'] = (adata.obs['pct_counts_mito'] > mito)

    initial_obs =adata.n_obs

    # remove cells with too few counts
    sc.pp.filter_cells(adata, min_counts=min_counts)
    min_genes_obs=initial_obs-adata.n_obs
    dict_diagn[sample].append(min_genes_obs)

    # remove cells with too many counts as they can be potential doublets (see Doublet_QC.ipynb)
    sc.pp.filter_cells(adata, max_counts=max_counts)
    max_genes_obs=min_genes_obs-adata.n_obs
    dict_diagn[sample].append(max_genes_obs)
    dict_diagn[sample].append(sum(adata.obs.ribo_outlier)) 

    #remove ribo outliers
    adata = adata[(~adata.obs.ribo_outlier)] 
    dict_diagn[sample].append(sum(adata.obs.mito_outlier))

    #remove mito outlier
    adata = adata[(~adata.obs.mito_outlier)]

    #we do not put the final cell number after filtering in the diagnostic object since there are other filtering steos
    initial_var=adata.var.shape[0]

    # filter out genes underexpressed after removal of cells
    sc.pp.filter_genes(adata, min_cells=20)

    #plot datasets after filtering
    plot_pre(adata, sample, "total_counts", "pct_counts_mito", 20, thresholds)
    plot_pre(adata, sample, "total_counts", "pct_counts_ribo", 60, thresholds)
    dict_genes[sample].append(initial_var-adata.var.shape[0]) 
    return adata.copy()




Dict with threshold values for filtering determined after looking at the QC plots. E.g. to remove only cells with number of genes below a certain number and ribosomal counts above a certain number.

In [3]:
dict_diagn = {} # initialise dict with diagnostic values: cell number after every QC step
dict_genes = {} # gene number
dict_parameters = {'20_nM_RA': [21, 8, np.expm1(7.8), 30000, 10], 'ESLIF': [23, 7.9, np.expm1(7.6), 10000,10], '2i': [33, 8, np.expm1(7.4), 10000, 10],\
                   '5000_nM_RA': [22, 7.6, np.expm1(7.8), 35000, 10],'EB_2d': [15, 8.1, np.expm1(7.9), 25000, 10],'EB_8d': [23, 8.2, np.expm1(7.9), 30000, 10],\
                   'HD_2d': [10, 8.2, 1500, 30000, 10],'HD_8d': [10, 8.3, np.expm1(7.5), 40000, 10],'RA_0d': [28, 8.9, np.expm1(8.2),50000, 10], \
                   'RA_7d': [2, 8.3, 3500, 35000, 7.5]}

Custom gene lists of interest, e.g. stress genes

In [None]:
list_qc_genes ='Fos, Fosb, Fosl1, Fosl2, Atf3, Jun, Dnaja1, Gadd45g, Endog, Pclaf' # initialised as string for ease of pasting gene lists

To additionally view distributions of stress and other genes, violin plots can be made

In [5]:
def violin(adata, sample, y_metric, metric_name):
    '''
    Plot and save QC violin plots for metrics such as stress genes, to visualise distribution of these across samples. 
    
    Args:
        adata (Anndata): Anndata object of a single sample
        sample (str): sample name, e.g. "20_nM_RA"
        y_metric (str): quality metric name (see prepare_for_qc()) for y

    Returns:
        None
    '''
    fig, ax = plt.subplots(figsize=(3,3))

    # the y_metric is, for example, percent stress gene counts out of all counts
    p = sc.pl.violin(adata, y_metric, show=False, ax=ax, linewidth=3, color='#069AAB', linecolor='#4E4E4E')
    plt.ylabel(f"Percent {metric_name} gene counts")
    plt.xlabel('QC metric')
    plt.title(label=f"{sample}", weight='bold')
    plt.savefig(f'new_data/plots/qc/pl_{y_metric[11:]}_{sample}.png', dpi=300, bbox_inches='tight')
    plt.show()
    plt.close()

Open files which were prior to this step filtered for empty droplets (the object should not generally contain millions of cells)

In [4]:
files = glob.glob('new_data/not_raw/**/*.*', recursive=True) 


In [None]:
for file in files:
    sample = Path(file).parts[2]
    adata = sc.read_h5ad(file)

    #estimate quality metrics
    adata=prepare_for_qc(adata, lis_qc_genes)

    #plot stress gene plots
    violin(adata, 'pct_counts_strs', 'Stress', sample)
    violin(adata, 'pct_counts_hsps', 'Hsp', sample)

    # this also outputs the scatterplots with the mito/ribo QC metrics
    adata = thresholds_n_filter(adata, dict_parameters, sample)
    adata.write_h5ad(f'new_data/qc_10x/{sample}/qc_{sample}.h5ad', compression='gzip') #export filtered files

After all QC steps additional diagnostic plots will be made visualising the number of cells and genes removed after each step. This saves the dictionaries for further use

In [None]:
# export the dict_diagn and dict_genes to file 
import pickle
try:
    file = open('new_data/tables/dict_diagn', 'wb')
    pickle.dump((dict_diagn, dict_genes), file)
    file.close()

except:
    print("Error")



In [None]:
dict_diagn

References:
1. Heumos, L., Schaar, A.C., Lance, C. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet 24, 550–572 (2023). https://doi.org/10.1038/s41576-023-00586-w
2. The Galaxy Community , The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update, Nucleic Acids Research, Volume 52, Issue W1, 5 July 2024, Pages W83–W94, https://doi.org/10.1093/nar/gkae410, also see https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case-jupyter_basic-pipeline/tutorial.html for scatterplots like the ones here.