INtegration of millions of Single Cells using batch-aware Triplet networks
INSCT
is a deep learning algorithm which calculates an integrated embedding for scRNA-seq data. With INSCT
, you can:
- Integrate scRNA-seq datasets across batches with/without labels.
- Generate a low-dimensional representation of the scRNA-seq data.
- Integrate of millions of cells on personal computers.
For more info check out our manuscript.
INSCT
learns a data representation, which integrates cells across batches. The goal of the network is to minimize the distance between Anchor and Positive while maximizing the distance between Anchor and Negative. Anchor and Positive pairs consist of transcriptionally similar cells from different batches. The Negative is a transcriptomically dissimilar cell sampled from the same batch as the Anchor.- Principal components of three data points corresponding to Anchor, Positive and Negative are fed into three identical neural networks, which share weights. The triplet loss function is used to train the network weights and the two-dimensional embedding layer activations represent the integrated embedding.
To learn an integrated embedding that overcomes batch effects, INSCT
samples triplets in a batch-aware manner:
For example, we simulated scRNAseq data, where batch effects dominate the embedding:
However, INSCT
learns an integrated embedding where cells cluster by group instead of batch:
The following notebooks can be run within your web browser and allow you to interactively explore tnn. We have prepared the following analysis examples:
Notebooks to reproduce the analyses described in our preprint can be found in the reproducibility folder.
INSCT
depends on the following Python packages. These need to be installed separately:
ivis==1.7.2
scanpy
hnswlib
To install INSCT
, follow these instructions:
Install directly from Github using pip:
pip install git+https://github.com/lkmklsmn/insct.git
Download the package from Github and install it locally:
git clone http://github.com/lkmklsmn/insct
cd insct
pip install .
Triplets sampled based on transcriptional similarity
- AnnData object with PCs
- Batch vector
from insct.tnn import TNN
model = TNN()
model.fit(X = adata, batch_name='batch')
Triplets sampled based on both transcriptional similarity and known labels
- AnnData object with PCs
- Batch vector
- Celltype vector
model = TNN()
model.fit(X = adata, batch_name='batch', celltype_name='Celltypes')
Triplets sampled based on both transcriptional similarity and known labels
- AnnData object with PCs
- Batch vector
- Celltype vector
- Masking vector (which labels to ignore)
model = TNN()
model.fit(X = adata, batch_name='batch', celltype_name='Celltypes', mask_batch= batch_name)
- Coordinates for the integrated embedding