# Single cell data analysis using Scanpy

* __Notebook version__: `v0.0.1`
* __Created by:__ `Imperial BRC Genomics Facility`
* __Maintained by:__ `Imperial BRC Genomics Facility`
* __Docker image:__ `imperialgenomicsfacility/scanpy-notebook-image:release-v0.0.1`
* __Github repository:__ [imperial-genomics-facility/scanpy-notebook-image](https://github.com/imperial-genomics-facility/scanpy-notebook-image)
* __Created on:__ `2020-Jan-19 21:20`
* __Contact us:__ [Imperial BRC Genomics Facility](https://www.imperial.ac.uk/medicine/research-and-impact/facilities/genomics-facility/contact/)
* __License:__ [Apache License 2.0](https://github.com/imperial-genomics-facility/scanpy-notebook-image/blob/master/LICENSE)

## Table of contents
  * [Introduction](#Introduction)
  * [Loading required libraries](#Loading-required-libraries)
  * [Reading data from Cellranger output](#Reading-data-from-Cellranger-output)
  * [Data processing and visualization](#Data-processing-and-visualization)
    * [Checking highly variable genes](#Checking-highly-variable-genes)
    * [Quality control](#Quality-control)
      * [Computing metrics for cell QC](#Computing-metrics-for-cell-QC)
      * [Plotting MT gene fractions](#Plottng-MT-gene-fractions)
      * [Count depth distribution](#Count-depth-distribution)
      * [Gene count distribution](#Gene-count-distribution)
      * [Counting cells per gene](#Counting-cells-per-gene)
      * [Ploting count depth vs MT fraction](#Ploting-count-depth-vs-MT-fraction)
      * [Checking thresholds and filtering data](#Checking-thresholds-and-filtering-data)
    * [Normalization](#Normalization)
    * [Highly variable genes](#Highly-variable-genes)
    * [Regressing out technical effects](#Regressing-out-technical-effects)
    * [Principal component analysis](#Principal-component-analysis)
    * [Neighborhood graph](#Neighborhood-graph)
      * [Clustering the neighborhood graph](#Clustering-the-neighborhood-graph)
      * [Embed the neighborhood graph using UMAP](#Embed-the-neighborhood-graph-using-UMAP)
        * [Plotting 2D UMAP](#Plotting-2D-UMAP)
      * [Embed the neighborhood graph using tSNE](#Embed-the-neighborhood-graph-using-tSNE)
    * [Finding marker genes](#Finding-marker-genes)
      * [Stacked violin plot of ranked genes](#Stacked-violin-plot-of-ranked-genes)
      * [Dot plot of ranked genes](#Dot-plot-of-ranked-genes)
      * [Matrix plot of ranked genes](#Matrix-plot-of-ranked-genes)
      * [Heatmap plot of ranked genes](#Heatmap-plot-of-ranked-genes)
      * [Tracks plot of ranked genes](#Tracks-plot-of-ranked-genes)
  * [References](#References)
  * [Acknowledgement](#Acknowledgement)

## Introduction
This notebook for running single cell data analysis (for a single sample) using Scanpy package. Most of the codes and documentation used in this notebook has been copied from the following sources:

* [Scanpy - Preprocessing and clustering 3k PBMCs](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html)
* [Single-cell-tutorial](https://github.com/theislab/single-cell-tutorial)

## Loading required libraries

We need  to load all the required libraries to environment before we can run any of the analysis steps. Also, we are checking the version information for most of the major packages used for analysis.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from copy import deepcopy
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
sc.settings.verbosity = 0
sc.logging.print_versions()

We are setting the output file path to $/tmp/scanpy\_output.h5ad$

In [None]:
results_file = '/tmp/scanpy_output.h5ad'

The following steps are only required for downloading test data from 10X Genomics's website.

In [None]:
%%bash
## DELETE ME
rm -rf cache
rm -rf /tmp/data
mkdir -p /tmp/data
wget -q -O /tmp/data/pbmc3k_filtered_gene_bc_matrices.tar.gz \
  /tmp/data http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd /tmp/data
tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Reading data from Cellranger output

Load the Cellranger output to Scanpy

In [None]:
adata = \
  sc.read_10x_mtx(
    '/tmp/data/filtered_gene_bc_matrices/hg19/',
    var_names='gene_symbols',
    cache=True)

Converting the gene names to unique values

In [None]:
adata.var_names_make_unique()

Checking the data dimensions before checking QC

In [None]:
adata

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Data processing and visualization

### Checking highly variable genes

Computing fraction of counts assigned to each gene over all cells. The top genes with the highest mean fraction over all cells are
plotted as boxplots.

In [None]:
sc.pl.highest_expr_genes(adata, n_top=20)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

### Quality control

Checking $obs$ section of the AnnData object

In [None]:
adata.obs.head()

Checking the $var$ section of the AnnData object

In [None]:
adata.var.head()

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Computing metrics for cell QC

Listing the Mitochondrial genes detected in the cell population

In [None]:
mt_genes = 0
mt_genes = [gene for gene in adata.var_names if gene.startswith('MT-')]
mito_genes = adata.var_names.str.startswith('MT-')
if len(mt_genes)==0:
    print('Looking for mito genes with "mt-" prefix')
    mt_genes = [gene for gene in adata.var_names if gene.startswith('mt-')]
    mito_genes = adata.var_names.str.startswith('mt-')

if len(mt_genes)==0:
    print("No mitochondrial genes found")
else:
    print("Mitochondrial genes: count: {0}, lists: {1}".format(len(mt_genes),mt_genes))

Typical quality measures for assessing the quality of a cell includes the following components
* Number of molecule counts (UMIs or $n\_counts$ )
* Number of expressed genes ($n\_genes$)
* Fraction of counts that are mitochondrial ($percent\_mito$)

We are calculating the above mentioned details using the following codes

In [None]:
adata.obs['mito_counts'] =  np.sum(adata[:, mito_genes].X, axis=1).A1
adata.obs['percent_mito'] = \
  np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
adata.obs['n_counts'] = adata.X.sum(axis=1).A1
adata.obs['log_counts'] = np.log(adata.obs['n_counts'])
adata.obs['n_genes'] = (adata.X > 0).sum(1)

Checking $obs$ section of the AnnData object again

In [None]:
adata.obs.head()

Sorting barcodes based on the $percent\_mito$ column

In [None]:
adata.obs.sort_values('percent_mito',ascending=False).head()

A high fraction of mitochondrial reads being picked up can indicate cell stress, as there is a low proportion of nuclear mRNA in the cell. It should be noted that high mitochondrial RNA fractions can also be biological signals indicating elevated respiration. <p/>

Cell barcodes with high count depth, few detected genes and high fraction of mitochondrial counts may indicate cells whose cytoplasmic mRNA has leaked out due to a broaken membrane and only the mRNA located in the mitochrondia has survived. <p/>

Cells with high UMI counts and detected genes may represent dublets (it requires further checking).

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Ploting MT gene fractions

In [None]:
sc.pl.violin(\
  adata,
  ['n_genes', 'n_counts', 'percent_mito'],
  jitter=0.4,
  log=False,
  multi_panel=True)

Violin plot (above) shows the computed quality measures of UMI counts, gene counts and fraction of mitochondrial counts.

In [None]:
ax = sc.pl.scatter(adata, 'n_counts', 'n_genes', color='percent_mito',show=False)
ax.set_title('Fraction mitochondrial counts', fontsize=12)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Number of genes",fontsize=12)
ax.tick_params(labelsize=12)
ax.axhline(700, 0,1, color='red')
ax.axvline(1500, 0,1, color='red')

The above scatter plot shows number of genes vs number of counts with $MT$ fraction information. We will be using a cutoff of 1500 counts and 700 genes (<span style="color:red">red lines</span>) to filter out dying cells. 

In [None]:
ax = sc.pl.scatter(adata[adata.obs['n_counts']<10000], 'n_counts', 'n_genes', color='percent_mito',show=False)
ax.set_title('Fraction mitochondrial counts', fontsize=12)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Number of genes",fontsize=12)
ax.tick_params(labelsize=12)
ax.axhline(700, 0,1, color='red')
ax.axvline(1500, 0,1, color='red')

A similar scatter plot, but this time we have restricted the counts to below _10K_

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Count depth distribution

In [None]:
count_data = adata.obs['n_counts'].copy()
count_data.sort_values(inplace=True, ascending=False)
order =  range(1, len(count_data)+1)
ax = plt.semilogy(order, count_data, 'b-')
plt.gca().axhline(1500, 0,1, color='red')
plt.xlabel("Barcode rank", fontsize=12)
plt.ylabel("Count depth", fontsize=12)
plt.tick_params(labelsize=12)

The above plot is similar to _UMI counts_ vs _Barcodes_ plot of Cellranger report and it shows the count depth distribution from high to low count depths. This plot can be used to decide the threshold of count depth to filter out empty droplets.

In [None]:
ax = sns.distplot(adata.obs['n_counts'], kde=False)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
ax.axvline(1500, 0,1, color='red')

The above histogram plot shows the distribution of count depth and the <span style="color:red">red line</span> marks the count threshold 1500.

In [None]:
if (adata.obs['n_counts'].max() - 10000)> 10000:
    print('Checking counts above 10K')
    ax = sns.distplot(adata.obs['n_counts'][adata.obs['n_counts']>10000], kde=False, bins=60)
    ax.set_xlabel("Count depth",fontsize=12)
    ax.set_ylabel("Frequency",fontsize=12)
else:
    print("Skip checking counts above 10K")

In [None]:
if adata.obs['n_counts'].max() > 2000:
  print('Zooming into first 2000 counts')
  ax = sns.distplot(adata.obs['n_counts'][adata.obs['n_counts']<2000], kde=False, bins=60)
  ax.set_xlabel("Count depth",fontsize=12)
  ax.set_ylabel("Frequency",fontsize=12)
  ax.axvline(1500, 0,1, color='red')
else:
  print("Failed to zoom into the counts below 2K")

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Gene count distribution

In [None]:
ax = sns.distplot(adata.obs['n_genes'], kde=False)
ax.set_xlabel("Number of genes",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
ax.tick_params(labelsize=12)
ax.axvline(700, 0,1, color='red')

The above histogram plot shows the distribution of gene counts and the <span style="color:red">red line</span> marks the gene count threshold 700.

In [None]:
if adata.obs['n_genes'].max() > 1000:
  print('Zooming into first 1000 gene counts')
  ax = sns.distplot(adata.obs['n_genes'][adata.obs['n_genes']<1000], kde=False,bins=60)
  ax.set_xlabel("Number of genes",fontsize=12)
  ax.set_ylabel("Frequency",fontsize=12)
  ax.tick_params(labelsize=12)
  ax.axvline(700, 0,1, color='red')
else:
  print("Failed to zoom into the gene counts below 1K")

We use a permissive filtering threshold of 1500 counts and 700 gene counts to filter out the dying cells or empty droplets with ambient RNA.

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Counting cells per gene

In [None]:
adata.var['cells_per_gene'] = np.sum(adata.X > 0, 0).T

ax = sns.distplot(adata.var['cells_per_gene'][adata.var['cells_per_gene'] < 100], kde=False, bins=60)
ax.set_xlabel("Number of cells",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
ax.set_title('Cells per gene', fontsize=12)
ax.tick_params(labelsize=12)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Plotting count depth vs MT fraction

In [None]:
ax = sc.pl.scatter(adata, x='n_counts', y='percent_mito',show=False)
ax.set_title('Count depth vs Fraction mitochondrial counts', fontsize=12)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Fraction mitochondrial counts",fontsize=12)
ax.tick_params(labelsize=12)
ax.axhline(0.2, 0,1, color='red')

The scatter plot showing the count depth vs MT fraction counts and the <span style="color:red">red line</span> shows the default cutoff value for MT fraction 0.2

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

### Checking thresholds and filtering data

In [None]:
print('Total number of cells: {0}'.format(adata.n_obs))

min_counts_threshold = 1500
max_counts_threshold = 40000
min_gene_counts_threshold = 700
max_mito_pct_threshold = 0.2

if adata[adata.obs['n_counts'] > min_counts_threshold].n_obs < 1000:
    min_counts_threshold = 1000

if adata[adata.obs['n_counts'] < max_counts_threshold].n_obs < 1000:
    max_counts_threshold = 50000
    
if adata[adata.obs['n_genes'] > min_gene_counts_threshold].n_obs < 1000:
    min_gene_counts_threshold = 400
    
if adata[adata.obs['percent_mito'] < max_mito_pct_threshold].n_obs < 1000:
    max_mito_pct_threshold = 0.02

sc.pp.filter_cells(adata, min_counts = min_counts_threshold)
print('Number of cells after min count ({0}) filter: {1}'.format(min_counts_threshold,adata.n_obs))

sc.pp.filter_cells(adata, max_counts = max_counts_threshold)
print('Number of cells after max count ({0}) filter: {1}'.format(max_counts_threshold,adata.n_obs))

sc.pp.filter_cells(adata, min_genes = min_gene_counts_threshold)
print('Number of cells after gene ({0}) filter: {1}'.format(min_gene_counts_threshold,adata.n_obs))

adata = adata[adata.obs['percent_mito'] < max_mito_pct_threshold]
print('Number of cells after MT fraction ({0}) filter: {1}'.format(max_mito_pct_threshold,adata.n_obs))

print('Total number of cells after filtering: {0}'.format(adata.n_obs))

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

### Normalization

In [None]:
sc.pp.normalize_total(adata, target_sum=1e4)

We are using a simple total-count based normalization (library-size correct) to transform the data matrix $X$ to 10,000 reads per cell, so that counts become comparable among cells.

In [None]:
sc.pp.log1p(adata)

Then logarithmize the data matrix

In [None]:
adata.raw = adata

Copying the normalized and logarithmized raw gene expression data to the `.raw` attribute of the AnnData object for later use.

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

### Highly variable genes

Following codes blocks are used to identify the highly variable genes (HGV) to further reduce the dimensionality of the dataset and to include only the most informative genes. HGVs will be used for clustering, trajectory inference, and dimensionality reduction/visualization.

In [None]:
sc.pp.highly_variable_genes(adata, flavor='seurat', min_mean=0.0125, max_mean=3, min_disp=0.5)
seurat_hgv = np.sum(adata.var['highly_variable'])
print("Counts of HGVs: {0}".format(seurat_hgv))
sc.pl.highly_variable_genes(adata)

We use a 'seurat' flavor based HGV detection step. Then, we run the following codes to do the actual filtering of data. The plots show how the data was normalized to select highly variable genes irrespective of the mean expression of the genes. This is achieved by using the index of dispersion which divides by mean expression, and subsequently binning the data by mean expression and selecting the most variable genes within each bin.

In [None]:
adata = adata[:, adata.var.highly_variable]

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

### Regressing out technical effects

Normalization scales count data to make gene counts comparable between cells. But it still contain unwanted variability. One of the most prominent technical covariates in single-cell data is count depth. Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed can improve the performance of trajectory inference algorithms.

In [None]:
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])

Scale each gene to unit variance. Clip values exceeding standard deviation 10.

In [None]:
sc.pp.scale(adata, max_value=10)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

### Principal component analysis

Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.

In [None]:
sc.tl.pca(adata, svd_solver='arpack')

In [None]:
sc.pl.pca(adata,color=['CST3'])

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells.

In [None]:
sc.pl.pca_variance_ratio(adata, log=True)

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix.

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

### Neighborhood graph
Computing the neighborhood graph of cells using the PCA representation of the data matrix.

In [None]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Clustering the neighborhood graph
Scanpy documentation recommends the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we have already computed in the previous section.

In [None]:
sc.tl.leiden(adata)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Embed the neighborhood graph using UMAP

Scanpy documentation suggests embedding the graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preservers trajectories.


<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

##### Plotting 2D UMAP

In [None]:
sc.tl.umap(adata,n_components=2)

In [None]:
sc.pl.umap(adata, color=['CST3'])

plot the scaled and corrected gene expression by `use_raw=False`

In [None]:
sc.pl.umap(adata, color=['leiden'],use_raw=False)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Embed the neighborhood graph using tSNE

In [None]:
sc.tl.tsne(adata,n_pcs=40)

In [None]:
sc.pl.tsne(adata, color=['leiden'])

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

### Finding marker genes

Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the `.raw` attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.

In [None]:
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)

The result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar (Sonison & Robinson (2018)).

In [None]:
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)

Show the 5 top ranked genes per cluster 0, 1, …, 7 in a dataframe

In [None]:
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Stacked violin plot of ranked genes
Plot marker genes per cluster using stacked violin plots

In [None]:
sc.pl.rank_genes_groups_stacked_violin(
    adata, n_genes=10,groupby='leiden',swap_axes=False,figsize=(20,10))

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Dot plot of ranked genes
The dotplot visualization provides a compact way of showing per group, the fraction of cells expressing a gene (dot size) and the mean expression of the gene in those cell (color scale)

In [None]:
sc.pl.rank_genes_groups_dotplot(
    adata, n_genes=10,groupby='leiden', dendrogram=True,figsize=(20,10))

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Matrix plot of ranked genes
The matrixplot shows the mean expression of a gene in a group by category as a heatmap.

In [None]:
sc.pl.rank_genes_groups_matrixplot(adata, n_genes=10, groupby='leiden', figsize=(20,10))

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Heatmap plot of ranked genes
Heatmaps do not collapse cells as in matrix plots. Instead, each cells is shown in a row.

In [None]:
sc.pl.rank_genes_groups_heatmap(
    adata, n_genes=10, show_gene_labels=True, groupby='leiden', figsize=(20,10))

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

#### Tracks plot of ranked genes
The track plot shows the same information as the heatmap, but, instead of a color scale, the gene expression is represented by height.

In [None]:
sc.pl.rank_genes_groups_tracksplot(adata, n_genes=10, cmap='bwr',figsize=(20,30))

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## References
* [Scanpy - Preprocessing and clustering 3k PBMCs](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html)
* [single-cell-tutorial](https://github.com/theislab/single-cell-tutorial)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Acknowledgement
The Imperial BRC Genomics Facility is supported by NIHR funding to the Imperial Biomedical Research Centre.