In [None]:
!pip install scprep phate magic-impute mnnpy scanpy

In [None]:
import scprep
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import phate
import graphtools as gt
import magic
import os

# Batch correction and gene visualization

Here we're going to run batch correction on a two-batch dataset of peripheral blood mononuclear cells (PBMCs) from 10X Genomics. The two batches are from two healthy donors, one using the 10X version 2 chemistry, and the other using the 10X version 3 chemistry.

Note that in this case, we have no reason to believe that there would be a genuine biological difference between the two batches (both donors are healthy) and good reason to believe that would be a genuine technical difference between the two batches (they were run with different chemistries). You should only use batch correction if you are confident that the effect you are removing is not genuine biology.

## 1. Loading preprocessed data

We have loaded and preprocessed the PBMC data for you, though you can download the raw files from https://support.10xgenomics.com/single-cell-gene-expression/datasets

Alternatively, you may load your own data by replacing the Google Drive file ids with your own file ids.

Note that this is only useful if your data has two separate batches.

In [None]:
scprep.io.download.download_google_drive(id='1Ufsqot_Ir43M9XQhVNC27a6yW-_vvC9H',
                                         destination='data.pickle.gz')
scprep.io.download.download_google_drive(id='1BHji8Dy_jn8sIC60YsXm4sxVbFafnYWI',
                                         destination='metadata.pickle.gz')
data = pd.read_pickle('data.pickle.gz')
metadata = pd.read_pickle('metadata.pickle.gz')

## 2. Denoising with MAGIC

As mentioned previously, scRNAseq data suffers from various forms of noise - chiefly dropout or under counting of mRNA molecules in single cells. Since analysis of sparse, noisey and non-uniform expression data can be challenging, we impute missing data values with MAGIC. This will aid in the visualization of gene expression and later with more complex analyses.

Since PBMCs have 3 major cell types (T cells, B cells, and monocytes), we will selectively impute genes that are specific for this cell types. Selectively imputing genes helps save on memory.

In [None]:
marker_genes = scprep.select.get_gene_set(data, exact_word=['CD4', 'CD8A', 'CD19', 'ITGAX', 'CD14'])

data_magic = magic.MAGIC().fit_transform(data, genes=marker_genes)

## 3. Characterizing the Batch Effect

Whenever you suspect there is a batch effect, you should always start by asking yourself, "How do I know this difference doesn't represent biologically relevant variation between samples?"

The best way to do this is to start by assessing which genes are most differentially expressed between samples. Here we'll use the differential expression toolkit implemented in [`scprep.stats.differential_expression`](scprep.stats.differential_expression). Another good toolkit for calculating differential expression is [DiffxPy](https://github.com/theislab/diffxpy/).

In [None]:
# Calculate the differential expression by calculating the t-statistic between samples
results = scprep.stats.differential_expression(data.loc[metadata['sample_labels'] == 'Donor_1'],
                                    data.loc[metadata['sample_labels'] == 'Donor_2'],
                                    measure='ttest')

#### Print out the top 20 genes differentially expressed between samples

In [None]:
# ========
# Select the first 20 or 50 rows of the results dataframe
results.iloc[ ... ,:]
# ========

#### Plot the distribution of expression for each gene between samples
This plot is rather complicated so we're just going to give you the code to generate it.

In [None]:
# Create the figure and subplot axes
fig, axes = plt.subplots(4,5, figsize=(2*5, 2*4))

# Iterate over the axes
for i, ax in enumerate(axes.flatten()):
    # Get the i'th most differentially expressed gene
    curr_gene = results.iloc[i].name
    # Split the gene name to get the symbol
    gene_symbol = curr_gene.split(' ')[0]
    
    # Get the raw expression for the current gene
    exp = np.array(data[curr_gene])
    
    # Get expression per sample
    exp_donor_1 = exp[metadata['sample_labels'] == 'Donor_1']
    exp_donor_2 = exp[metadata['sample_labels'] == 'Donor_2']
    
    # Plot the histograms
    scprep.plot.histogram(exp_donor_1, range=(exp.min(), exp.max()), bins=100, 
                          ax=ax, color='#9E0141', ylabel='')
    scprep.plot.histogram(exp_donor_2, range=(exp.min(), exp.max()), bins=100, 
                          ax=ax, color='#5E4DA2', ylabel='', title=gene_symbol)

# Fit subplots into figure neatly
fig.tight_layout()

### Discussion

1. What do you notice about the kinds of genes that are the top 20 or 50 differentially expressed between samples?    

2. Do you think these differences are biologically relevant? What sort of technical factors could influence the detection of these genes?

## 4. Visualizing data

Here, we're going to visualize our data with PHATE. If you'd like to use other visualization techniques such as UMAP or tSNE please go ahead!

In [None]:
data_phate = phate.PHATE().fit_transform(data)
# alternative: umap.UMAP(), sklearn.manifold.TSNE()

In [None]:
scprep.plot.scatter2d(data_phate, c=metadata['sample_labels'], figsize=(12,8), cmap="Spectral",
                      ticks=False, label_prefix="PHATE", s = 50)

### Discussion

1. What do you notice about this visualization? 
2. What do you think is driving this effect?

## 5. Visualizing imputed gene expression on visualization

To check our suspicions about this dataset, let's check some cell type specific markers.

In [None]:
expression = scprep.select.select_cols(data_magic, exact_word='CD8A')

scprep.plot.scatter2d(data_phate, c=expression, figsize=(12,8), cmap="Reds",
                      ticks=False, label_prefix="PHATE", s = 50)

### Exercise - plotting gene expression

Visualize each of the following marker genes and describe what you find: CD4, CD8A, CD19, ITGAX, CD14. Try using both raw and imputed data.

In [None]:
# ===========
# Extract the gene expression for each of the genes listed from either `data` or `data_magic`
expression = 
# ===========
scprep.plot.scatter2d(data_phate, c=expression, figsize=(12,8), cmap="Reds",
                      ticks=False, label_prefix="PHATE", s = 50)

### Discussion

1. What do you notice about the expression of each of these markers?
2. What else might you check before deciding that the difference between the batches is a technical effect?

## 6. Correcting differences between samples

There are several algorithms that try to correct systemic sample level differences present in single cell datasets. Here, we will implement MNN correction to try and remove these differences. Herein, we will first create an AnnData object from our data before running it through MNN to get corrected data. We can then use this data to re-impute gene expression and re-visualize our data.

In [None]:
import scanpy as sc

pbmc_anndata = sc.AnnData(X=data, obs = metadata)

In [None]:
batches = ["Donor_1","Donor_2"]
alldata = {}

for batch in batches:
    alldata[batch] = pbmc_anndata[pbmc_anndata.obs['sample_labels']==batch,]


In [None]:
cdata = sc.external.pp.mnn_correct(alldata['Donor_1'], alldata['Donor_2'], svd_dim=50, 
                                  batch_key = 'sample_labels', batch_categories=["Donor_1","Donor_2"])

## 7. Visualizing gene expression on corrected data

Now that we have a batch corrected dataset, let's visualize imputed gene expression on the aligned manifold. Let us know what you think!

In [None]:
cdata_magic = magic.MAGIC().fit_transform(cdata[0], genes=marker_genes)

cdata_magic = pd.DataFrame(cdata_magic.X)
cdata_magic.columns = marker_genes
cdata_index = data.index

In [None]:
cdata_phate = phate.PHATE().fit_transform(cdata[0])

In [None]:
scprep.plot.scatter2d(cdata_phate, c=metadata['sample_labels'], figsize=(12,8), cmap="Spectral",
                      ticks=False, label_prefix="PHATE", s = 50)

In [None]:
expression = scprep.select.select_cols(cdata_magic, exact_word='ITGAX') # Please enter each of the marker genes here

scprep.plot.scatter2d(cdata_phate, c=expression, figsize=(12,8), cmap="Reds",
                      ticks=False, label_prefix="PHATE", s = 50)

### Exercise - plotting gene expression

Visualize each of the following marker genes and describe what you find: CD4, CD8A, CD19, ITGAX, CD14. Try using both raw and imputed data.

In [None]:
# ===========
# Extract the gene expression for each of the genes listed from either `data` or `data_magic`
expression = 
# ===========
scprep.plot.scatter2d(data_phate, c=expression, figsize=(12,8), cmap="Reds",
                      ticks=False, label_prefix="PHATE", s = 50)

### Discussion

1. What do you notice about the expression of each of these markers? How does it compare to the visualization before batch correction?
2. When is it a good idea to apply batch correction to a dataset?
3. Can you think of any risks of doing batch correction?