Skip to content

mdozmorov/scATAC-seq_notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

scATAC-seq data analysis tools and papers

License: MIT PR's Welcome

Single-cell ATAC-seq related tools and genomics data analysis resources. Tools are sorted by publication date, reviews and most recent publications on top. Unpublished tools are listed at the end of each section. Please, contribute and get in touch! See MDmisc notes for other programming and genomics-related notes. See scRNA-seq_notes for scRNA-seq related resources.

Table of content

  • Review of chromatin accessibility profiling methods (wet-lab, technologies, downstream analysis and tools, applications), both bulk and single cell. DNAse-seq, ATAC-seq, MNase-seq, many more. Multi-omics technologies, integratie approaches.
    Paper Minnoye, Liesbeth, Georgi K. Marinov, Thomas Krausgruber, Lixia Pan, Alexandre P. Marand, Stefano Secchia, William J. Greenleaf, et al. “Chromatin Accessibility Profiling Methods.” Nature Reviews Methods Primers, (December 2021), https://doi.org/10.1038/s43586-020-00008-9. Supplementary Table 1 - Commonly used bioinformatics tools for data processing and analysis of bulk and single-cell chromatin accessibility data, https://static-content.springer.com/esm/art%3A10.1038%2Fs43586-020-00008-9/MediaObjects/43586_2020_8_MOESM1_ESM.pdf
  • Single-cell multiomics technologies, integration of transriptome with genome, epigenome, and proteome. Table 1 - summary of technologies. Cell isolation and barcoding. Figure 2 - genome-transcriptome single-cell technologies, Figure 3 - epigenome-transcriptome technologies, Figure 4 - proteome-transcriptome technologies. Figure 5 - overview of computational methods (dimensionality reduction, clustering, network, pseudotime inference, CNV detection), references to reviews. Integrative analysis methods (LIGER, MOFA).
    Paper Lee, Jeongwoo, Do Young Hyeon, and Daehee Hwang. “Single-Cell Multiomics: Technologies and Data Analysis Methods.” Experimental & Molecular Medicine, September 15, 2020. https://doi.org/10.1038/s12276-020-0420-2.

Preprocessing pipelines

  • scATAC-seq analysis guidelines. Technologies, data preprocessing, peak annotation, QC, matrix building, batch correction, dimensionality reduction, visualization, clustering, cell identity annotation, chromatin accessibility dynamics, motif analysis. Table 1 - summary of 13 pipelines. Tools, methods, databases.

  • Benchmarking of 10 scATAC-seq analysis methods (brief description of each in Methods) on 10 synthetic (various depth and noise levels) and 3 real datasets. scATAC technology overview, problems. Three clustering methods (K-means, Louvain, hierarchical clustering), adjusted Rand index, adjusted mutual information, homogeneity for benchmarking against gold-standard clustering, Residual Average Gini Index for benchmarking against gene markers (silver standard). SnapATAC, Cusanovich2018, cisTopic perform best overall. R code, Jupyter notebooks

Extended Data Fig. 1: Comparison of supported features from currently available scATAC-seq software., from ArchR paper.

  • SnapATAC2 - scATAC-seq processing pipeline. Main improvement - a fast nonlinear dimensionality reduction algorithm, matrix-free spectral clustering, Lanczos algorithm to derive eigenvectors while implicitly using the Laplacian matrix. Four primary modules: preprocessing, embedding/clustering (includes batch correction), functional enrichment analysis, and multi-modal omics analysis. Starts from BAM files, scaling columns with IDF. Outperforms ArchR (LSI), Signac (LSI), cisTopic (LDA), epyScanpy (PCA), PeakVI, scBassett in speed, scalability and precision in resolving cell heterogeneity on synthetic and experimental data from different technologies, species, and tissue types. Applicable to any omics data (scHi-C, scRNA-seq, single-cell methylation). Rust, with Python interface. Benchmarking datasets, Docker image and code to reproduce the analysis.
    Paper Zhang, Kai, Nathan R. Zemke, Ethan J. Armand, and Bing Ren. "SnapATAC2: a fast, scalable and versatile tool for analysis of single-cell omics data." bioRxiv (2023). https://doi.org/10.1101%2F2023.09.11.557221
  • PIC-snATAC - Paired-Insertion-Counting method for snATAC-seq feature characterization. Fragment-based (number of reads in the union of peaks) and insertion-based (number of Tn5 insertions in the appropriate direction) ATAC-seq quantification. Methods and tools overview, contrasting differences. Applied to mouse kidney snATAC-seq data, 10X genomics PBMC, a Bone Marrow Mononuclear Cells (BMMC) multiome dataset. Better resolves cell types, association with gene expression.
    Paper Miao, Zhen, and Junhyong Kim. “Is Single Nucleus ATAC-Seq Accessibility a Qualitative or Quantitative Measurement?” Preprint. Bioinformatics, April 21, 2022. https://doi.org/10.1101/2022.04.20.488960.
  • ArchR - R package for processing and analyzing single-cell ATAC-seq data. Compared to Signac and SnapATAC, has more functionality, faster, handles large (>1M cells) data. Input - BAM files. Efficient HDF5-based storage allows for large dataset processing. Quality control, doublet detection (similar performance to Scrublet), genome-wide 500bp binning and peak identification, assignment to genes using best performing model, dimensionality reduction (optimized Latent Semantic Indexing, multiple iterations of LSI), clustering, overlap enrichment with a compendium of previously published ATAC-seq datasets, trajectory analysis (Slingshot and Monocle 3), integration with scRNA-seq data (Seurat functionality). Code to reproduce the paper, GitHub. Tweet 1, Tweet 2, Tweet 3

  • SnapATAC - scATAC-seq pipeline for processing, clustering, and motif identification. Snap file format  Genome is binned into equal-size (5kb) windows, binarized with 1/0 for ATAC reads present/absent, Jaccard similarity between cells, normalized to account for sequencing depth (observed over expected method, two others), PCA on the matrix KNN graph and Louvain clustering to detect communities, tSNE or UMAP for visualization. Motif analysis and GREAT functional enrichment for each cluster. Nystrom algorithm to reduce dimensionality, ensemble approach. Outperforms ChromVAR, LSA, Cicero, Cis-Topic. Very fast, can be applied to ChIP-seq, scHi-C. SnapTools to work with snap files

  • scATAC-pro - pipeline for scATAC-seq mapping, QC, peak detection, clustering, TF and GO enrichment analysis, visualization (via VisCello). Compared with Scasat, Cellranger-atac.

  • scOpen - estimating open chromatin status in scATAC-seq experiments, aka imputation/smoothing of extreme sparse matrices. Uses positive-unlabelled learning of matrices to estimate the probability that a region is open in a given cell. The probability matrix can be used as input for downstream analyses (clustering, visualization). Integrated with the footprint transcription factor activity score (scHINT). scOpen estimated matrices tested as input for scABC, chromVAR, cisTopic, Cicero, improve performance

  • scABC, single-cell Accessibility Based Clustering - scATAC-seq clustering. Weights cells by a nonlinear transformation of library sizes, then, weighted K-medoids clustering. Input - single-cell mapped reads, and the full set of called peaks. Applied to experimental and synthetic scATAC-seq data, outperforms simple K-means-based clustering, SC3

  • Cicero - connect distal regulatory elements with target genes (covariance-based, graphical Lasso to compute a regularized covariance matrix) along pseudotime-ordered (Monocle2 or 3) scATAC-seq data. Optionally, adjusts for batch covariates. Applied to the analysis of skeletal myoblast differentiation, sciATAC-seq. R package

  • ChromVAR - scATAC-seq analysis. Identifying peaks, get a matrix of counts across aggregated peaks, tSNE for clustering, identifying motifs. Integrated with Seurat.

  • scATAC-pro - A comprehensive tool for processing, analyzing and visulizing single cell chromatin accessibility sequencing data.

Clustering, visualization

  • ChromSCape - Shiny/R application for single-cell epigenomic data visualization. clustering, differential peak analysis (Wilcoxon, edgeR), linking peaks to genes, pathway enrichment (hypergeometric on MSigDb). Wraps scater, scran, corrects for batch effect using fastMNN from batchelor, determines the optimal number of clusters with ConsensusClusterPlus (2-10 clusters). Input - BAM, BED files, or count matrix. Compared with Cusanovich2018, SnapATAC, CisTopic, EpiScanpy. Multiple datasets. Web demo, GitHub, Code for the paper

  • cisTopic - R/Bioconductor package for probabilistic modelling of cis-regulatory topics from scATAC-seq. Topic modelling for identification of cell types, enhancers, transcription regulators. Binarizing chromatin accessibility matrix, Latent Dirichlet Allocation (LDA, collapsed Gibbs sampler) and model selection, cell state identification using the topic-cell distributions, explorations of the region-topic distributions.

    Paper Bravo González-Blas, Carmen, Liesbeth Minnoye, Dafni Papasokrati, Sara Aibar, Gert Hulselmans, Valerie Christiaens, Kristofer Davie, Jasper Wouters, and Stein Aerts. “CisTopic: Cis-Regulatory Topic Modeling on Single-Cell ATAC-Seq Data.” Nature Methods 16, no. 5 (May 2019): 397–400. https://doi.org/10.1038/s41592-019-0367-1.

Imputation

  • scOpen - imputation for scATAC-seq data. Regularized NMF via a coordinate descent algorithm on binarized, TF-IDF-transformed ATAC-seq matrix. Tested on simulated (Chen et al. 2019) and four public scATAC-seq datasets against MAGIC, SAVER, scImpute, DCA, scBFA, cisTopic, SCALE, and PCA. Improves recovery of true open chromatin regions, clustering (ARI, silhouette), reduces memory footprint, fast. Improves the performance of downstream state-of-the-art scATAC-seq methods (cisTopic, chromVAR, Cicero). Applied to kidney fibrosis scATAC-seq data, Runx1 discovery. Scripts to reproduce analyses

Integration, Multi-omics methods

  • Review of single-cell multi-omics (scATAC-seq, scRNA-seq) integration principles, methods, and tools. Integration of matched and unmatched data, annotated group matching, matching with common features, aligning spaces. Quantitative causal modeling, statistical modeling, latent space inference, consensus of individual inferences (late integration). Integrating multimodal (jointly profiled) omics data. Brief description of technologies, tools. Visualization of multi-omics data, challenges. Table 1 - tools for matched data analysis, with links.
    Paper Miao, Zhen, Benjamin D. Humphreys, Andrew P. McMahon, and Junhyong Kim. “Multi-Omics Integration in the Age of Million Single-Cell Data.” Nature Reviews Nephrology 17, no. 11 (November 2021): 710–24. https://doi.org/10.1038/s41581-021-00463-x.
  • MUON - multimodal data structure to store and compute on multi-omics data. Meta-data can be object-specific or shared. Implemented in Python, MuData objects stored in HDF5 files. Includes Scanpy (omics data handling), MOFA+ (multi-omics factor analysis), neighbor graph analysis methods, visualization using matplotlib and seaborn. Examples on scRNA- and scATAC-seq PBMC data, others. Documentation. Tutorials web, GitHub. Interfaces for R, Julia

  • JVis, j-SNE and j-UMAP - joint visualization and clustering of multimodal omics data. Goal is to arrange points (here cells) in low-dimensional space such that similarities observed between points in high-dimensional space are preserved, but in all modalities at the same time. Python implementation. Tweet

  • MAESTRO - Model-based AnalysEs of Single-cell Transcriptome and RegulOme. Full pipeline for the integrative analysis of scRNA-seq and scATAC-seq data, wraps external tools (STARsolo/minimap2, RseQC, MACS2, Seurat for normalization, LISA, GIGGLE). From preprocessing, alignment, QC (technology-specific), expression/accessibility quantification to clustering (graph-based and density-based), differential analysis, cell type annotation (CIBERSORT, brain cell signatures), transcription regulator inference (regulatory potential model, using CistromeDB data), integration/cell label transfer (Canonical Correlation Analysis). Handles data from various platforms (with/without barcodes). Outperforms SnapATAC, cicero, Seurat. Snakemake workflow, HDF5 data format, Conda installation. Tweet.

  • Signac is an extension of Seurat for the analysis, interpretation, and exploration of single-cell chromatin datasets, and integration with scRNA-seq. ChromatinAssay object class, Latent Semantic Indexing and the modified TF-IDF procedure for dimensionality reduction. Applied to the PBMC 10X multiomics dataset and the Brain Initiative Cell Census Network data. Also, the Sinto Python package for processing aligned single-cell data

  • UnionCom - integration of multi-omics single-cell data using unsupervised topological alignment. Based on GUMA (generalized unsupervised manifold alignment) algorithm. Three steps: 1) embedding each single-cell dataset into the geometric distance matrix; 2) Align distance matrices; 3) Project unmatched features onto common embedding space. Tested on simulated and experimental data (sc-GEM, scNMT). Neighborhood overlap metric for testing, outperforms Seurat, MMD-MA, scAlign.

  • scAI - integrative analysis of scRNA-seq and scATAC-seq or scMethylation data measured from the same cells (in contrast to different measures sampled from the same cell population). Overview of multi-omics single-cell technologies, methods for data integration in bulk samples and single-cell samples (MATCHER, Seural, LIGER), sparsity (scATAC-seq is ~99% sparse and nearly binary). Deconvolution of both single-cell matrices into gene loading and locus loading matrices, a cell loading matrix, in which factors K correspond to loadings of gene, locus, and cell in the K-dimensional space. A strategy to reduce over-aggregation. Cell subpopulations identified by Leiden clustering of the cell loading matrix. Visualization of the low-rank matrices with the Sammon mapping. Multi-omics simulation using MOSim, eight scenarios of simulated data, AUROC and Normalized Mutual Information (NMI) assessment of matrix reconstruction quality. Compared with MOFA, Seurat, LIGER. Tested on 8837 mammalian kidney cells scRNA-seq and scATAC-seq data, 77 mouse ESCs scRNA-seq and scMethylation, interpretation.

  • Harmony - scRNA-seq integration by projecting datasets into a shared embedding where cells differences between cell clusters are maximized while differences between datasets of origin are minimized = focus on clusters driven by cell differences. Can account for technical and biological covariates. Can integrate scRNA-seq datasets obtained with different technologies, or scRNA- and scATAC-seq, scRNA-seq with spatially-resolved transcriptomics. Local inverse Simpson Index (LISI) to test for database- and cell-type-specifc clustering. Outperforms MNN, BBKNN, MultiCCA, Scanorama. Memory-efficient, fast, scales to large datasets, included in Seurat. Python version

  • Seurat v.3 paper. Integration of multiple scRNA-seq and other single-cell omics (spatial transcriptomics, scATAC-seq, immunophenotyping), including batch correction. Anchors as reference to harmonize multiple datasets. Canonical Correlation Analysis (CCA) coupled with Munual Nearest Neighborhoors (MNN) to identify shared subpopulations across datasets. CCA to reduce dimensionality, search for MNN in the low-dimensional representation. Shared Nearest Neighbor (SNN) graphs to assess similarity between two cells. Outperforms scmap. Extensive validation on multiple datasets (Human Cell Atlas, STARmap mouse visual cortex spatial transcriptomics. Tabula Muris, 10X Genomics datasets, others in STAR methods). Data normalization, variable feature selection within- and between datasets, anchor identification using CCA (methods), their scoring, batch correction, label transfer, imputation. Methods correspond to details of each Seurat function. Preprocessing of real single-cell data.

    • Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, Marlon Stoeckius, Peter Smibert, and Rahul Satija. “Comprehensive Integration of Single Cell Data.” Preprint. Genomics, November 2, 2018.

Technology

  • Single-cell ATAC + RNA co-assay methods - overview of technologies and protocols, references to the original papers

  • Multi-omics methods - Table 1 from Sierant, Michael C., and Jungmin Choi. “Single-Cell Sequencing in Cancer: Recent Applications to Immunogenomics and Multi-Omics Tools.” Genomics & Informatics 16, no. 4 (December 2018)

  • Spatial scATAC-seq technology. Integrates transposase-accessible chromatin profiling in tissue sections with barcoded solid-phase capture to perform spatially resolved epigenomics. Highly concordant with single-nucleus snATAC-seq. Applied to three stages of mouse embryonic development. Enables discovery of regulatory programs via clustering (TFIDF) integration with Visium spatial scRNA-seq. Preprocessing with 10X Genomics’ CellRanger ATAC pipeline (v.2.0.0), STutility R package, ArchR. GSE214991 - spatial ATAC data matrix, Github.

    Paper Llorens-Bobadilla, Enric, Margherita Zamboni, Maja Marklund, Nayanika Bhalla, Xinsong Chen, Johan Hartman, Jonas Frisén, and Patrik L. Ståhl. “Solid-Phase Capture and Profiling of Open Chromatin by Spatial ATAC.” Nature Biotechnology, January 5, 2023. https://doi.org/10.1038/s41587-022-01603-9.
  • mtscATAC-seq - mitochondrial scATAC-seq for mtDNA mutation calling and/using mitochondrial chromatin accessibility. Inverence of mtDNA heneroplasmy (two or more variants in the same cell), clonal relationships, cell state, chromatin accessibility variation. 10X Genomics, processing whole cells without depleting mitochondria. Computational approach to map reads aligning to NUMTs in the nuclear genome to mtDNA. About 1% reads from NUMTs would be detected, unlikely to confound. Applied to GM11906. Developed the Mitochondrial Genome Analysis Toolkit mgatk to identify clonal substructure in mtscATAC-seq data, variants annotated using MITOMAP database. GSE142745 - mtscATAC-seq data for several studies. GitHub - scripts to reproduce analyses.
    Paper Lareau, Caleb A., Leif S. Ludwig, Christoph Muus, Satyen H. Gohil, Tongtong Zhao, Zachary Chiang, Karin Pelka, et al. “Massively Parallel Single-Cell Mitochondrial DNA Genotyping and Chromatin Profiling.” Nature Biotechnology 39, no. 4 (April 2021): 451–61. https://doi.org/10.1038/s41587-020-0645-6.
  • scGET-seq, single-cell genome and epigenome by transposases sequencing technology, uses a hybrid transposase treatment including the canonical Tn5 and TnH recognizing the chromodomain of the heterochromatin protein-1a (HP-1a) that maintains heterochromatin by binding to H3K9me3. Each transposase differentialy barcoded. Probes both open and chlosed chromatin, better resolves CNVs than scATAC-seq. scGET-seq in NIH-3T3 cells before and after Kdm5c histone demethylase knockdown (impairs H3K9me3 deposition). Chromatin Velosity method that identifies the trajectories of epigenetic modifications. Data on Array Express, scGET analysis scripts and scatACC for custom scATAC analysis.
    Paper Tedesco, Martina. “Chromatin Velocity Reveals Epigenetic Dynamics by Single-Cell Profiling of Heterochromatin and Euchromatin.” Nature Biotechnology, 11 October 2021, https://doi.org/10.1038/s41587-021-01031-1
  • SHARE-seq - simultaneous profiling of scRNA-seq and sc-ATAC-seq from the same cells. Built upon SPLiT-seq, a combinatorial indexing method. Confirmed by separate scRNA-seq and scATAC-seq datasets. Chromatin opening precedes transcriptional activation.
    Paper Ma, Sai, Bing Zhang, Lindsay M. LaFave, Andrew S. Earl, Zachary Chiang, Yan Hu, Jiarui Ding, et al. “Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin.” Cell 183, no. 4 (November 2020): 1103-1116.e20. https://doi.org/10.1016/j.cell.2020.09.056.
  • dscATAC-seq (droplet single-cell assay for transposase-accessible chromatin using sequencing), with combinatorial indexing (dsciATAC-seq, cells are combinatorially barcoded, multiple cells per droplet). After Tn5 transposing (increased concentration), intact nuclei are isolated into droplets. Increased library size, complexity (chromVAR), proportion of TSS/nuclear fragments, high human/mouse specificity. tSNE clustering using the latent semantic indexing (LSI), better resolved clusters, uncorrelated with technical batches. Applied to (1) a reference map of chromatin accessibility in the mouse brain (46,653 cells) and (2) an unbiased map of human hematopoietic states in the bone marrow (60,495 cells), isolated cell populations from bone marrow and blood (52,873 cells), and bone marrow cells in response to stimulation (75,958 cells). Data at GSE123581. Analysis code, computational pipeline BAP.
    Paper Lareau, Caleb A., Fabiana M. Duarte, Jennifer G. Chew, Vinay K. Kartha, Zach D. Burkett, Andrew S. Kohlway, Dmitry Pokholok, et al. “Droplet-Based Combinatorial Indexing for Massive-Scale Single-Cell Chromatin Accessibility.” Nature Biotechnology 37, no. 8 (August 2019): 916–24. https://doi.org/10.1038/s41587-019-0147-6.

Data

Human

  • CATLAS, Cis-element ATLAS - sciATAC-seq on 25 human tissue types, approx. 500,000 nuclei, over 750,000 candidate cis-regulatory elements (cCREs) in 54 distinct cell types. Cell- and tissue-specific gene regulatory programs. Analysis of noncoding variant effect on TF binding sites (deltaSVM model, 460 TFs affected, 302 likely causal GWAS variants prioritized). Downloadable data, hg38 coordinates of cCREs, chromatin accessibility matrices aggregated as cell x cCRE, cell x gene (promoter), cell metadata, ontology, UMAP embeddings, bigWig tracks, cCRE to gene linkage data predicted by the Activity-By-Contact (ABC) model. README
    Paper Zhang, Kai, James D. Hocker, Michael Miller, Xiaomeng Hou, Joshua Chiou, Olivier B. Poirion, Yunjiang Qiu, et al. “A Single-Cell Atlas of Chromatin Accessibility in the Human Genome.” Cell, November 2021, S0092867421012794. https://doi.org/10.1016/j.cell.2021.10.024.
  • Single-cell epigenomic identification of inherited risk loci in Alzheimer’s and Parkinson’s disease. scATAC-seq data integrated with published HiChIP data. GitHub. GEO GSE147672 - processed scATAC-seq data, BED, bigWig, SummarizedExperiment. WashU session drS3o1n4kJ with scATAS clusters, cell types, neuron subclusters and cell types. Supplementary data with scATAC-seq peaks, neuronal cluster definitions, differential accessibility.
    Paper Corces, M. Ryan, Anna Shcherbina, Soumya Kundu, Michael J. Gloudemans, Laure Frésard, Jeffrey M. Granja, Bryan H. Louie, et al. “Single-Cell Epigenomic Analyses Implicate Candidate Causal Variants at Inherited Risk Loci for Alzheimer’s and Parkinson’s Diseases.” Nature Genetics 52, no. 11 (November 2020): 1158–68. https://doi.org/10.1038/s41588-020-00721-x.

Mouse

  • scRNA-seq and scATAC-seq of normal mammary epithelial cells (MECs, mouse). 4 main clusters, their characteristics. Trajectory analysis, regulatory modules and TFs. Seurat/Signac, Monocle, Cicero, cisTopic, ChromVar, Homer. Processed data: GSE157890.
    Paper Pervolarakis, Nicholas, Quy H. Nguyen, Justice Williams, Yanwen Gong, Guadalupe Gutierrez, Peng Sun, Darisha Jhutty, et al. “Integrated Single-Cell Transcriptomics and Chromatin Accessibility Analysis Reveals Regulators of Mammary Epithelial Cell Identity.” Cell Reports 33, no. 3 (October 2020): 108273. https://doi.org/10.1016/j.celrep.2020.108273.

Miscellaneous

About

scATAC-seq data analysis tools and papers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages