# Batch Correction Tutorial

Author: [Hui Ma](https://github.com/huimalinda), [Yiming Yang](https://github.com/yihming), [Rimte Rocher](https://github.com/rocherr)<br />
Date: 2022-02-24<br />
Notebook Source: [batch_correction.ipynb](https://raw.githubusercontent.com/klarman-cell-observatory/pegasus/master/notebooks/batch_correction.ipynb)

In [None]:
import pegasus as pg

## Dataset

In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, and show how to use the batch correction methods in Pegasus to tackle the batch effects in the data.

The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use [gsutil](https://cloud.google.com/storage/docs/gsutil) to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip).

Now load the count matrix:

In [None]:
data = pg.read_input("MantonBM_nonmix_subset.zarr.zip")
data

`'Channel'` is the batch key. Each batch is associated with one donor, so there are 8 batches in total.

## Sections
-  [Preprocessing](#pre)
-  [Clustering without Correcting Batch Effects](#nobatch)
-  [Batch Correction Methods](#batch)
    - [Harmony](#harmony)
    - [Integrative-NMF](#inmf)
    - [Scanorama](#scanorama)
-  [Comparing Batch CorrectionMethods](#comp)

## Preprocessing
<a id='pre'></a>
First, preprocess the data. This includes filtration, selecting robust genes, and log-normalization:

In [None]:
pg.qc_metrics(data, min_genes=500, max_genes=6000, mito_prefix='MT-', percent_mito=10)
pg.filter_data(data)
pg.identify_robust_genes(data)
pg.log_norm(data)

After quality-control, distribution of cells in all the 8 batches is:

In [None]:
data.obs['Channel'].value_counts()

## Clustering without Correcting Batch Effects
<a id='nobatch'></a>
We first perform downstream steps without considering batch effects. In this way, you can see where the batch effects exist, and moreover, we'll use this result as the baseline when comparing different batch correction methods.

In [None]:
data_baseline = data.copy()
pg.highly_variable_features(data_baseline)

In this tutorial, the downstream steps consists of:
* PCA calculation: by default calculate 50 PCs;
* kNN graph construction, and Louvain clustering based on it;
* Calculate 2-dimensinoal UMAP embedding, and show UMAP plot.

In [None]:
pg.pca(data_baseline)
pg.neighbors(data_baseline)
pg.louvain(data_baseline)
pg.umap(data_baseline)
pg.scatter(data_baseline, attrs=['louvain_labels', 'Channel'], basis='umap')

Let's have a quick look at the UMAP plot above. If checking the cells in Louvain cluster 11 and 14 in the right-hand-side plot, you can see that most of them are from sample `MantonBM3_HiSeq_1`. This indicates strong batch effects.

## Batch Correction Methods
<a id='batch'></a>
Batch effect occurs when data samples are generated in different conditions, such as date, weather, lab setting, equipment, etc. Unless informed that all the samples were generated under the similar condition, people may suspect presumable batch effects if they see a visualization graph with samples kind-of isolated from each other.

In this tutorial, you'll see how to apply the batch correction methods in Pegasus to this dataset. 

As a common step ahead, we need to re-select HVGs considering batch effects:

In [None]:
pg.highly_variable_features(data, batch='Channel')

### Harmony Algorithm
<a id='harmony'></a>
[Harmony](https://www.nature.com/articles/s41592-019-0619-0) is a widely-used method for data integration. Pegasus uses [harmony-pytorch](https://github.com/lilab-bcb/harmony-pytorch) package to perform Harmony batch correction.

Harmony works on PCA matrix. So we need to first calculate the original PCA matrix:

In [None]:
data_harmony = data.copy()
pg.pca(data_harmony)

Now we are ready to run Harmony integration:

In [None]:
harmony_key = pg.run_harmony(data_harmony)

When finished, the corrected PCA matrix is stored in `data_harmony.obsm['X_pca_harmony']`, and `run_harmony` returns the representation key `'pca_harmony'` as variable `harmony_key`. In the downstream steps, you can set `rep` parameter to either `harmony_key` or `'pca_harmony'` in Pegasus functions whenever applicable.

For details on parameters of `run_harmony` other than the default setting, please [see here](https://pegasus.readthedocs.io/en/stable/api/pegasus.run_harmony.html#pegasus.run_harmony).

With the new corrected PCA matrix, we can perform kNN-graph-based clustering and calculate UMAP embeddings as follows:

In [None]:
pg.neighbors(data_harmony, rep=harmony_key)
pg.louvain(data_harmony, rep=harmony_key)
pg.umap(data_harmony, rep=harmony_key)

Then show UMAP plot:

In [None]:
pg.scatter(data_harmony, attrs=['louvain_labels', 'Channel'], basis='umap')

### Integrative-NMF
<a id='inmf'></a>
Another popular data integration method is [integrative Non-negative Matrix Factorization](https://academic.oup.com/bioinformatics/article/32/1/1/1743821) (iNMF). Pegasus includes this method in [nmf-torch](https://github.com/lilab-bcb/nmf-torch) package.

In [None]:
data_inmf = data.copy()
inmf_key = pg.integrative_nmf(data_inmf, batch='Channel')

Similarly as `run_harmony`, when finished, the calculated embedding is stored in `data_inmf.obsm['X_inmf']`, and `integrative_nmf` returns the representation key `'inmf'` as variable `inmf_key`. In the downstream steps, you can set `rep` parameter to either `inmf_key` or `'inmf'` in Pegasus functions whenever applicable.

For details on parameters of `integrative_nmf` other than the default setting, please [see here](https://pegasus.readthedocs.io/en/stable/api/pegasus.integrative_nmf.html#pegasus.integrative_nmf).

Now we can perform kNN-graph-based clustering and calculate UMAP embeddings as follows:

In [None]:
pg.neighbors(data_inmf, rep=inmf_key)
pg.louvain(data_inmf, rep=inmf_key)
pg.umap(data_inmf, rep=inmf_key)

Then show UMAP plot:

In [None]:
pg.scatter(data_inmf, attrs=['louvain_labels', 'Channel'], basis='umap')

### Scanorama Algorithm
<a id='scanorama'></a>
[Scanorama](https://www.nature.com/articles/s41587-019-0113-3) is another widely-used batch correction algorithm. Pegasus uses [scanorama](https://github.com/brianhie/scanorama) package.

Similarly as Harmony, Scanorama corrects the original PCA matrix of the dataset. But as PCA step is already integrated in `run_scanorama` function, we don't need to run `pca` seperately:

In [None]:
data_scan = data.copy()
scan_key = pg.run_scanorama(data_scan)

You can check details on `run_scanorama` parameters [here](https://pegasus.readthedocs.io/en/stable/api/pegasus.run_scanorama.html).

By default, it considers count matrix only regarding the selected HVGs, and calculates the corrected PCA matrix of $50$ PCs. When finished, this new PCA matrix is stored in `data_scan.obsm['X_scanorama']`, and returns its representation key `'scanorama'` as variable `scan_key`. In the downstream steps, you can set `rep` parameter to either `scan_key` or `'scanorama'` in Pegasus functions whenever applicable:

In [None]:
pg.neighbors(data_scan, rep=scan_key)
pg.louvain(data_scan, rep=scan_key)
pg.umap(data_scan, rep=scan_key)

Now check its UMAP plot:

In [None]:
pg.scatter(data_scan, attrs=['louvain_labels', 'Channel'], basis='umap')

## Comparing Batch Correction Methods
<a id='comp'></a>

To compare the performance on the three methods, one metric is **runtime**, which you can see from the logs in sections above: integrativ nmf method is the fastest, then Harmony, and Scanorama is the slowest.

In this section, we'll use 2 other metrics for comparison:
* **kBET acceptance rate:** kBET measures whether batches are well-mixed in the local neighborhood of each cell. Then kBET acceptance rate is the percent of cells with kBET p-values $\geq 0.05$. The higher, the better.
* **kSIM acceptance rate:** kSIM measures whether cells of the same pre-annotated cell type are still close to each other in the local neighborhoods after batch correction. Then kSIM acceptance rate is the percent of cells with kSIM p-values $\geq 0.05$. We use this metric to reflect whether known biological relationships are preserved after correction. The higher, the better.

We have 4 results: No batch correction (Baseline), Harmony, iNMF, and Scanorama. For each result, kBET and kSIM acceptance rates are calculated on its 2D UMAP coordinates for comparison, which is consistent with [Cumulus paper](https://rdcu.be/b5R5B).

Details on these 2 metrics can also be found in Cumulus paper.

### kBET Acceptance Rate

We can use `calc_kBET` function to calculate on kBET acceptance rates. Besides,
* For `attr` parameter, use the batch key, which is `'Channel'` in this tutorial.
* For `rep` parameter, set to the corresponding UMAP coordinates;
* For returned values, we only care about kBET acceptance rates, so just ignore the first two returned values.

In [None]:
_, _, kBET_baseline = pg.calc_kBET(data_baseline, attr='Channel', rep='umap')
_, _, kBET_harmony = pg.calc_kBET(data_harmony, attr='Channel', rep='umap')
_, _, kBET_inmf = pg.calc_kBET(data_inmf, attr='Channel', rep='umap')
_, _, kBET_scan = pg.calc_kBET(data_scan, attr='Channel', rep='umap')

### kSIM Acceptance Rate

We need pre-annotated cell type information as ground truth to calculate kSIM acceptance rate. This is achieved by:
* Starting with Louvain clusters in the baseline case;
* Merging two CD14+ Monocyte clusters, as one is donor-specific;
* For the only cluster with no strong cell type evidence, classifying its cells using a Gradient Boosting model trained from its neighbor T-cell clusters.

This ground truth is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/cell_types.csv, or we can get it using `gsutil` from its Google bucket URL (gs://terra-featured-workspaces/Cumulus/cell_types.csv):

Now load this file, and attach its `'cell_types'` column to the 4 resulting count matrices above:

In [None]:
import pandas as pd
import numpy as np

df_celltypes = pd.read_csv("cell_types.csv", index_col='barcodekey')

assert np.sum(df_celltypes.index!=data_baseline.obs_names) == 0
data_baseline.obs['cell_types'] = df_celltypes['cell_types']

assert np.sum(df_celltypes.index!=data_harmony.obs_names) == 0
data_harmony.obs['cell_types'] = df_celltypes['cell_types']

assert np.sum(df_celltypes.index!=data_inmf.obs_names) == 0
data_inmf.obs['cell_types'] = df_celltypes['cell_types']

assert np.sum(df_celltypes.index!=data_scan.obs_names) == 0
data_scan.obs['cell_types'] = df_celltypes['cell_types']

We can then use `calc_kSIM` function to calculate kSIM acceptance rates. Besides,
* For `attr` parameter, use the ground truth key `'cell_types'`;
* For `rep` parameter, similarly as in kBET section, set to the corresponding UMAP coordinates;
* For returned values, we only care about kSIM acceptance rates, so just ignore the first returned values.

In [None]:
_, kSIM_baseline = pg.calc_kSIM(data_baseline, attr='cell_types', rep='umap')
_, kSIM_harmony = pg.calc_kSIM(data_harmony, attr='cell_types', rep='umap')
_, kSIM_inmf = pg.calc_kSIM(data_inmf, attr='cell_types', rep='umap')
_, kSIM_scan = pg.calc_kSIM(data_scan, attr='cell_types', rep='umap')

### Performance Plot
<a id='per'></a>
Now draw a scatterplot regarding these two metrics on the 4 results:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df_plot = pd.DataFrame({'method': ['Baseline', 'Harmony', 'iNMF', 'Scanorama'],
                        'kBET': [kBET_baseline, kBET_harmony, kBET_inmf, kBET_scan],
                        'kSIM': [kSIM_baseline, kSIM_harmony, kSIM_inmf, kSIM_scan]})

plt.figure(dpi=100)
ax = sns.scatterplot(x = 'kSIM', y = 'kBET', hue = 'method', data = df_plot, legend = False)
for line in range(0, df_plot.shape[0]):
    x_pos = df_plot.kSIM[line] + 0.003
    if df_plot.method[line] == 'Baseline':
        x_pos = df_plot.kSIM[line] - 0.003
    y_pos = df_plot.kBET[line]
    if df_plot.method[line] == 'iNMF':
        y_pos -= 0.01
    alignment = 'right' if df_plot.method[line] == 'Baseline' else 'left'
    ax.text(x_pos, y_pos, df_plot.method[line], ha = alignment, size = 'medium', color = 'black')
plt.xlabel('kSIM acceptance rate')
plt.ylabel('kBET acceptance rate')

As this plot shows, a trade-off exists between good mixture of cells (in terms of kBET acceptance rate) and maintaining the biology well (in terms of kSIM acceptance rate). *Harmony* method achieves the best mixture of cells, while its consistency with the ground truth biology is the least. *iNMF and *Scanorama* both have a better balance between the two measurements.

Therefore, in general, the choice of batch correction method really depends on the dataset and your analysis goal.