In [1]:
from sctoolbox.utils.jupyter import bgcolor, _compare_version

nb_name = "03_normalization_batch_correction.ipynb"

_compare_version(nb_name)

# 03 - Normalization and Batch effect correction
<hr style="border:2px solid black"> </hr>

## 1 - Description

Cells can vary in the number of fragments detected in open chromatin regions due to several factors, ranging from cell viability to technical variation. Normalization is used to correct for these imbalances and make the cells comparable again. This framework offers two normalization options: term frequency–inverse document frequency (TF–IDF) and log total count normalization. Since TF-IDF has been shown to be particularly effective for ATAC-seq data, it is used here as the default method.
DOI: https://doi.org/10.1038/nature25981

Another important property of our data is its high dimensionality. The dimensionality needs to be reduced for subsequent steps such as embedding and clustering. To achieve this, algorithms like Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) can be used. When SVD is applied in combination with TF-IDF, it is referred to as Latent Semantic Indexing (LSI). We are using the combinations total count + PCA and TFIDF + SVD.
DOI: https://doi.org/10.1038/nmeth.4346
DOI: https://doi.org/10.1038/nature25981

Batch effects are variances in the data that are not intended by the experimental design (e.g. technical variance). They can be introduced through various sources for example sequencing samples at different timepoints may introduce batch effects. As batch effects could interfere with downstream analysis they are typically removed. However, it can be challenging to identify and correct for batch effects as this is highly dependent on the experimental setup of the dataset. 
DOI: https://doi.org/10.1038/nrg2825

To determine the strength of a batch effect the Local Inverse Simpson's Index (LISI) can be used by measuring the heterogenity.
DOI: https://doi.org/10.1038/s41592-019-0619-0

This notebooks aims to prepare our data for the subsequent embedding and clustering, by normalization and batch correction. To infer the effects of the batch correction and normalization an embedding is calculated for visualization, beside the LISI score. For the embedding and following notebooks a PCA is performed and components selected.

________

## 2 - Setup

In [None]:
import sctoolbox
import sctoolbox.tools as tools
import sctoolbox.plotting as pl
import sctoolbox.utils as utils
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt

sctoolbox.settings.settings_from_config("config.yaml", key="03")

sc.set_figure_params(vector_friendly=True, dpi_save=600, scanpy=False)

________

## 3 - Load anndata
Loads the anndata.h5ad from the last notebook and provides a basic overview.

In [None]:
adata = utils.adata.load_h5ad("anndata_2.h5ad")

with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

_________

## 4 - General input

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [2]:
%bgcolor PowderBlue

# Choose normalization method
# TF-IDF: dimensionality is reduced by SVD (LSI)
# Total: dimensionality is reduced by PCA 
norm_method = 'tfidf'  # can be 'tfidf' or 'total'

# Choose if highly variable features should be used
use_highly_variable=True

# Set number of neighbors
n_neighbors = 15  #Default=15

# UMAP related settings 
metacol = 'sample'  # some metacol of interest
n_features = 'n_features'  # column name for the number of features

# batch correction: If True, several batch correction methods will be performed,
# you can choose the best one after
batch_column = "sample"
perform_batch_correction = True
batch_methods = ["bbknn", "harmony"] # "mnn", "scanorama", "combat" 
threads = 8

________________

In [None]:
# Ensure that the batch column is of type category
adata.obs[batch_column] = adata.obs[batch_column].astype(str).astype("category")

____

## 5 - Normalization
<hr style="border:2px solid black"> </hr>
Normalize the counts for each cell so that all cells have the same number of counts after normalization. This removes imbalances in sequencing depth to make the cells comparable. 

In [None]:
# Save raw layer before normalization
adata.layers["raw"] = adata.X.copy()

In [None]:
adata = tools.norm_correct.normalize_adata(adata, norm_method, use_highly_variable=use_highly_variable)

___________

## 6 - PCA/SVD and neighbors for uncorrected data
<hr style="border:2px solid black"> </hr>
This section is about the selection of SVD or PCA components, here both called principal components (PCs) for convience, later used for the subsequent analysis. As lower number of PCs decreases the needed computing resources of many upcoming steps, PCs explaining low variance are excluded. However, some high variance explaining PCs may be driven by none biological or other unwanted factors (e.g. number of active genes, cell cycle, etc.) as such they should be excluded as well from the following analysis.

The following PCA plot and heatmaps are intended to identify potentially unwanted PCs by showing the PCs in combination with available observations (cell-related metrics) and variables (feature-related metrics). In general, **selected PCs should avoid correlations with metrics**, but the importance of metrics and the stringency of thresholds depends on the experiment and the underlying questions, and therefore requires careful consideration by the analyst.

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [3]:
%bgcolor PowderBlue

# number of PCs shown within the heatmap
n_pcs_heatmap = 15

____

In [None]:
 # PCA correlations with obs variables 
_ = pl.embedding.plot_pca_correlation(
    adata,
    n_components=n_pcs_heatmap,
    which="obs",
    title="Correlation of .obs columns with PCA loadings",
    save="PCA_correlation_obs.pdf"
)

In [None]:
 # PCA correlations with var variables
_ = pl.embedding.plot_pca_correlation(
    adata,
    n_components=n_pcs_heatmap,
    which="var",
    title="Correlation of .var columns with PCA loadings",
    save="PCA_correlation_var.pdf"
)

_________

### 6.1 - Choose a subset of PCs

In case the above plots showed undesired correlation this section can be used to subset the PCs. The proposed PC subset is displayed as a plot with darker bars representing the selected PCs. Based on the selected `filter_methods`, a vertical and horizontal threshold line is displayed. PCs are filtered if they are below the horizontal threshold (`corr_thresh`) or if they are to the right of the vertical threshold line (`perc_thresh`).

| Parameter | Description | Options |
|:---:|:---|:---|
| subset_pcs | Whether the PCs should be filtered. | `True` or `False` |
| corr_thresh | Highest absolute correlation that is allowed. Will take the maximum correlation for each PC as shown in the heatmap above. | Expects a value between `0-1`. |
| perc_thresh | Top percentile of PCs that should be kept. | A value between `0-100`%. |
| filter_methods | The PCs will be filtered based on the given methods. E.g. for "variance" and "correlation" PCs are filtered on values from both methods and the intersection is used as the final subset. | Any combination of `["variance", "cumulative variance", "correlation"]` |
| basis | Compute correlation based on observations (cells) or variables (genes). | Either `obs` or `var`. |
|ignore_cols| List of column names to ignore for correlation | `None` or a list of column names|

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [4]:
%bgcolor PowderBlue

# Whether PCs should be filtered
subset_pcs = True

corr_thresh = 0.6  # PCs with an absolut correlation above this will be filtered
perc_thresh = 50  # Top percentile of PCs that should be kept
filter_methods = ['variance', 'correlation']  # propose PCs based on the provided methods
basis = 'obs'  # base correlation on obs or var
ignore_cols = []  # List of column names to ignore for correlation

________

In [None]:
selected_pcs = tools.dim_reduction.propose_pcs(
    anndata=adata,
    how=filter_methods,
    corr_thresh=corr_thresh,
    perc_thresh=perc_thresh,
    corr_kwargs={'method': 'spearmanr', 'which': basis, 'ignore': ignore_cols}
)

# Plot and select number of PCs
_ = pl.embedding.plot_pca_variance(
    adata, 
    save='PCA_variance_proposed_selection.pdf',
    selected=selected_pcs,
    n_pcs=50,
    n_thresh=max(selected_pcs),
    corr_plot='spearmanr',
    corr_thresh=corr_thresh,
    corr_on=basis,
    ignore=ignore_cols
)

In [None]:
f"Proposed principal components: {selected_pcs}"

Create a final PC-selection by changing the blue cell below:
- Either copy and adjust the proposed list from directly above
- create a custom list of PCs
- or accept the proposed list by not changing the cell below.

**Note: the selection will only be applied when `subset_pcs = True`.**

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
%bgcolor PowderBlue

final_pc_selection = selected_pcs

-------

In [None]:
 _ = pl.embedding.plot_pca_variance(
    adata, 
    selected=final_pc_selection if subset_pcs else None,
    save='PCA_variance_final_selection.pdf',
    n_pcs=50,
    n_thresh=max(selected_pcs) if subset_pcs else None,
    corr_plot='spearmanr',
    corr_thresh=corr_thresh if subset_pcs else None,
    corr_on=basis,
    ignore=ignore_cols
)

In [None]:
# Subset the number of pcs if chosen in the parameters
if subset_pcs:
    tools.dim_reduction.subset_PCA(adata, select=final_pc_selection)

___________

### 6.2 - Calculate neighbors

In [None]:
sc.pp.neighbors(adata, n_neighbors=n_neighbors, method='umap', metric='euclidean')

________

## 7 - Batch correction
<hr style="border:2px solid black"> </hr>
Batch correction is performed to remove technically introduced artifacts that would affect and potentially degrade the biological results of the data. There are several batch correction methods available, which may perform differently depending on the data set. Therefore, an overview is provided to compare batch correction methods and select the best performing one. To help in the decision making process, several metrics are shown that can be selected below and a score (LISI) is provided that explains whether the batches are well mixed after applying the correction.

In [None]:
if perform_batch_correction:
    batch_corrections = tools.norm_correct.wrap_corrections(
        adata, 
        batch_key=batch_column,
        methods=batch_methods
    )
else:
    batch_corrections = {"uncorrected": adata}

__________

### 7.1 - Plot overview of batch corrections

In [None]:
#Run standard umap for all adatas
tools.embedding.wrap_umap(batch_corrections.values(), threads=threads)

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [5]:
%bgcolor PowderBlue

# Should preliminary clustering be performed?
do_clustering = True #True or False

# select additional metrics shown in the overview plot below
# accepts adata.obs column names or genes (adata.var.index)
color_by = ['total_counts',
            'pct_counts_is_ribo',
            'pct_counts_is_mito',
            'n_genes',
            'phase']

_____________

In [None]:
# Perform additional clustering if it was chosen
color_by = []
if do_clustering:
    for adata in batch_corrections.values():
        sc.tl.leiden(adata, 0.1)
    color_by.append("leiden")
    
# Calculate LISI scores for batch
tools.norm_correct.wrap_batch_evaluation(batch_corrections, batch_key=batch_column, threads=threads, inplace=True)

In [None]:
#Plot the overview of batch correction methods
adata.obs[batch_column] = adata.obs[batch_column].astype("category") #ensure that batch column is a category

_ = pl.embedding.anndata_overview(
    batch_corrections,
    color_by=color_by + [batch_column],
    output=None
)

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [6]:
%bgcolor PowderBlue

#Choose an anndata object to proceed
selected = "bbknn"

In [None]:
adata_corrected = batch_corrections[selected]

_________

In [None]:
if not perform_batch_correction and selected != "uncorrected":
    import warnings
    warnings.warn(f"Selected batch correction '{selected}' but batch correction is disabled. Falling back to 'uncorrected'.")
    
    selected = "uncorrected"
elif selected not in batch_corrections:
    raise KeyError(f"'{selected}' is not a key in batch_corrections. Choose one of: {list(batch_corrections.keys())}")

In [None]:
adata = batch_corrections[selected]

______________

## 8 - Saving adata for the next notebook

In [None]:
adata

In [None]:
#Saving the data
adata_output = "anndata_3.h5ad"
utils.adata.save_h5ad(adata, adata_output)

In [None]:
sctoolbox.settings.close_logfile()