BANKSY: Spatial Clustering Algorithm that Unifies Cell-Typing and Tissue Domain Segmentation (version 1.1.1)
Vipul Singhal*, Nigel Chou*, Joseph Lee, Yifei Yue, Jinyue Liu, Wan Kee Chock, Li Lin, YunChing Chang, Erica Teo, Hwee Kuan Lee, Kok Hao Chen# and Shyam Prabhakar#
This repository contains the code base and examples for Building Aggregates with a Neighborhood Kernel and Spatial Yardstick developed for: BANKSY: A Spatial Clustering Algorithm that Unifes Cell Typing and Tissue Domain Segmentation.
BANKSY
is a method for clustering spatial transcriptomic data by augmenting the transcriptomic profile of each cell with an average of the transcriptomes of its spatial neighbors.
By incorporating neighborhood information for clustering, BANKSY is able to:
-
Improve cell-type assignment in noisy data
-
Distinguish subtly different cell-types stratified by microenvironment
-
Identify spatial zones sharing the same microenvironment
BANKSY is applicable to a wide variety of spatial technologies (e.g. 10x Visium, Slide-seq, MERFISH) and scales well to large datasets. For more details on use-cases and methods, see the preprint.
This Python version of BANKSY (compatible with Scanpy
), we show how BANKSY can be used for task 1 (improving cell-type assignment) using Slide-seq and Slide-seq V2 mouse cerebellum datasets.
The R version of BANKSY is available here (https://github.com/prabhakarlab/Banksy).
Machine with at least 16 GB of RAM. BANKSY
is extremely scalable and fast even on CPU for large datasets.
This software requires the following packages and has been tested on the following versions:
- Python >= 3.8
- Scanpy >= 1.8.1
- Anndata >= 0.7.1
- numpy >= 1.21
- scipy >= 1.6
- umap >= 0.5.1
- scikit-learn >= 0.24.2
- python-igraph >= 0.9
- leidenalg
(Optional) As alternatives to leiden
clustering, we also support mclust
and louvain
. To use these clustering algorithms, you need these additional packages:
- R == 4.2.3
- r-mclust == 6.0.0
- rpy2 >= 3.4.0
- louvain
To use Banksy_py
, we recommend setting up a conda
environment and installing the prequisite packages, then cloning this repository.
(base) $ conda create --name banksy
(base) $ conda activate banksy
(banksy) $ conda install -c conda-forge scanpy python-igraph leidenalg
(banksy) $ git clone https://github.com/prabhakarlab/Banksy_py.git
(banksy) $ cd Banksy_py
To run the examples presented in juypter
notebooks, install the extensions for juypter
.
(banksy) $ conda install -c conda-forge jupyter
Try out BANKSY
by running the examples in the provide ipython notebooks: slideseqv1_analysis.ipynb
and/or slideseqv2_analysis.ipynb
. More details on running BANKSY
are provided within the notebooks.
To run the slideseq_v2
dataset, please to download the data from the original source and save it in the data/slide_seq/v2
folder.
Users can directly install the prequisite packages (which replicates our Anaconda environment) from environment.yml
here after cloning in this repository:
(base) $ git clone https://github.com/prabhakarlab/Banksy_py.git
(base) $ cd Banksy_py
(base) $ conda env create --name banksy --file=environment.yml
(base) $ conda activate banksy
Users who have python=3.8-3.11
and pip
can also install our environment from requirements.txt
here after cloning in this repository:
$ git clone https://github.com/prabhakarlab/Banksy_py.git
$ cd Banksy_py
$ pip install -r requirements.txt
We are working on depositing this package in the PyPI repository for future users via pip install banksy
.
To run BANKSY on a spatial single-cell expression dataset in anndata
format:
- Preprocess the gene-cell matrix using
Scanpy
. This includes filtering out cells and genes by various criteria, and (for sequencing-based technologies e.g. 10X Visium or Slide-seq) selecting the most highly variable genes. intitalize_banksy
to generate the spatial graph (stored inbanksy_dict
object).run_banksy_multiparam
to perform dimensionality reduction and clustering.
Note that individual BANKSY matrices (for given hyperparameter settings)
can be accesed from the banksy_dict
object.
For example, to access the BANKSY matrix generated using
scaled_gaussian
decay and lambda = 0.2, use banksy_dict['scaled gaussian'][0.2]["adata"]
.
(optional) For advanced users who want to understand the entire BANKSY pipeline, you also can run individual steps below:
-
Preprocess gene-cell matrix (as above). z-score by gene using
banksy.main.zscore
orscanpy.pp.scale
. Functions provided in theScanpy
package handle most of these steps. Parameters and filtering criterion may vary by spatial technology and dataset source. -
Constructing the spatial graph which defines spatial neighbour relationships using
banksy.main.generate_spatial_weights_fixed_nbrs
. This outputs a sparse adjacency matrix defining the graph. Visualize these withbanksy_utils.plotting.plot_graph_weights
.
Some parameters that affect this step are:-
The spatial graph can be generated via the
$k_{geom}$ parameter, which connects a cell to its$k_{geom}$ nearest neighbours. This spatial graph is the basis in which the neighbourhood matrix$M$ and the azimuthal gabor filter (AGF) matrix$G$ is constructed. -
decay types
: By default, we recommendscaled_gaussian
, which weights a cell's neighbour expression as a gaussian envelope. Alternative methods includeuniform
which weights all neighbours equally,reciprocal
weights neighbours by$1/r$ where$r$ is the distance from neighbouring cell to the index cell.ranked
ranks neighbouring cells by distance with farther cells having higher rank, then sets Gaussian decay by rank. Sum of neighbour weights are always normalized to 1 for each cell. -
generate_spatial_weights_fixed_radius
(not used in paper) generates a spatial graph where each cell is connected to all cells within a given radius. This leads to variable numbers of neighbours per cell.
-
-
Generate neighbour expression matrix
$N$ (ncells by ngenes) using spatial graph to average over spatial neighbours. The neighbourhood matrix can be computed by sparse matrix multiplication of the spatial graph's adjacency matrix with the gene-cell matrix. Similarly, the AGF matrix$G$ (ncells by ngenes) which represents the magnitude of expression gradient is also generated from the azimuthal transform. -
Scale original expression matrix by $√(1 - λ)$ and neighbour expression matrix by √(λ) and concatenate matrices to obtain neighbour-augmented expression matrix (ncells by 2ngenes) using
banksy.main.weighted_concatenate
withneighbourhood_contribution
=$λ$ . These operations are performed on the numerical dataadata.X
; usebanksy.main.bansky_matrix_to_adata
to recover Anndata object with the appropriate annotations.
The following steps are identical to single cell RNA seq analysis:
-
Dimensionality reduction, particularly PCA to reduce expression matrix (either neighbour-augmented or original for comparison) to (ncells by nPCA_dims). As a default, we set
$PCA_{dims}$ = 20. -
Clustering cells by finding neighbours in expression space and cluster using graph-based clustering. Here we find expression-neighbours and perform Leiden clustering following the implemenation in Giotto.
-
Refinement (Optional) In the prescene of
noisy
clusters, we offer an optional refinement step viabanksy_utils.refine_clusters
to smooth labels in the clusters exclusively for domain segmentation tasks. However, we do not recommend the use of excessive refinement as it wears out fine-grained domains.
-
banksy.main.LeidenPartition
Finds neighbours in expression space and performs Leiden clustering. Aims to replicate implementation from the Giotto package as of 2020 to align with R version of the code. Note that scanpy also has a Leiden clustering implemenation with a different procedure for defining expression neighbours that can be used as an alternative. BANKSY is compatible with any clustering algorithm that takes a feature-cell matrix as input. -
labels.Label
Object for convenient computation with class labels. Converts labels to sparse one-hot vector for fast computation of connectivity across clusters with spatial graph. To obtain an array of integer labels in the usual format (e.g. [1, 1, 5, 2, ...]), useLabel.dense
.
We recommend the following examples to get started with the BANKSY
package
- Analyzing Slideseqv1 dataset with BANKSY
- Analyzing Slideseqv2 dataset with BANKSY
- Analyzing Starmap dataset
To reproduce the results from our manuscript, please use the branch BANKSY-manuscript
.
Bug reports, questions, request for enhancements or other contributions can be raised at the issue page. Our team will attempt to resolve them best the we could.
-
Nigel Chou (https://github.com/chousn)
-
Yifei Yue (https://github.com/yifei-1021)
-
Vipul Singhal - developed R version of BANKSY, compatible with seurat - (https://github.com/vipulsinghal02)
-
Joseph Lee - developed R version of BANKSY, compatible with seurat - (https://github.com/jleechung)
Refer to requirements.txt
for the supported versions of different packages
If you want to use or cite BANKSY, please refer to the following paper:
BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis
which can be accessed here: [https://www.nature.com/articles/s41588-024-01664-3]
This project is licensed under The GPLV3 license. See the LICENSE.md file for details.