An example using human genome with matched scRNA-seq data
An example using plant genome without matched scRNA-seq data
Documentation
Credits
Bug report/Support
Citations
Release notes
scATACpipe is a bioinformatic pipeline for single-cell ATAC-seq (scATAC-seq) data analysis.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible.
The development of the pipeline is guided by nf-core TEMPLATE.
The pipeline consists of 2 relevant parts: preprocessing (from fastq to fragment file) and downstream analysis. If fragment files are directly available, you can choose to skip preprocessing and run downstream analysis only.
For preprocessing, 3 alternative strategies are available that are implemented in 3 sub-workflows respectively, namely, PREPROCESS_DEFAULT, PREPROCESS_10XGENOMICS, and PREPROCESS_CHROMAP. Each of them supports various input types that are demonstrated in further detail below (also see usage).
For downstream analysis, we implemented DOWNSTREAM_ARCHR sub-workflow that integrates ArchR and other tools (e.g. AMULET for doublet detection).
Below is a simplified diagram to illustrate the design logic and functionalities of scATACpipe.
The main functionalities of each sub-workflow are summarized below. You can also refer to the output - Result folders for more details.
- Add barcodes to reads
- Correct barcodes (optional)
- if
false
: skip barcode correction - if
pheniqs
ornaive
: also filter out non-cells using "inflection point" method
- if
- Trim off adapters
- Mapping
- download genome/annotation or use custom genome
- build genome index if not supplied
- Filter BAM
- Remove PCR duplicates
- Quality control
- Generate fragment file, etc.
- Build 10XGENOMICS index if not supplied
- download genome/annotation or use custom genome
- Execute
cellranger_atac count
command - Extract fragments from valid cells according to
filtered_peak_bc_matrix/barcodes.tsv
- Build Chromap index if not supplied
- download genome/annotation or use custom genome
- Execute
chromap --preset atac
command - Filter out non-cells
Note that no BAM file will be generated for PREPROCESS_CHROMAP option.
- Build ArchR-compatible genome/annotation files if not natively supported (ArchR supports hg19, hg38, mm9, and mm10 as of 02/2022)
- download genome/annotation if not supplied
- build ArchR genome/gene annotation files if needed
- Perform downstream analysis with ArchR and generate various analytical plots
- filter doublets (with ArchR built-in method or AMULET)
- dimension reduction
- batch effect correction
- clustering
- embedding
- pseudo-bulk clustering
- scRNAseq integration if supplied
- marker gene detection
- call peaks
- marker peak detection
- pairwise testing
- motif enrichment
- footprinting
- coaccessibility, etc.
The pipeline also splits BED and/or BAM files according to ArchR clusterings and summarizes all results into a single MultiQC report for easy view.
-
Install
nextflow
(>=21.10.0). -
For full reproducibility, install either
Docker
orSingularity (Apptainer)
and specify-profile singularity
or-profile docker
accordingly when running the pipeline so that all dependencies are satisfied. Otherwise, all of the dependencies must be available locally on your PATH, which is likely not true! -
Download the pipeline:
git clone https://github.com/hukai916/scATACpipe.git
- Download a minimal test dataset:
- The test_data1 is prepared by downsampling (5% and 10%) a dataset named "500 Peripheral blood mononuclear cells (PBMCs) from a healthy donor (Next GEM v1.1)" provided by 10xgenomics. Note that, in test_data1, I1 refers to index1, which is for sample demultiplexing and not relevant in our case; R1 refers to Read1; R2 refers to index2, which represents the cell barcode fastq; R3 refers to Read2.
cd scATACpipe
wget https://www.dropbox.com/s/uyiq18zk7dts9fx/test_data1.zip
unzip test_data1.zip
-
Edit the
replace_with_full_path
in the assets/sample_sheet_test_data1.csv to use the actual full path. -
Test the pipeline with this minimal test_data1:
- At least 8GB memory is recommended for test_data1.
- By default, the local executor will be used (
-profile local
) meaning that all jobs will be executed on your local computer. Nextflow supports many other executors including SLURM, LSF, etc.. You can create a profile file to config which executor to use. Multiple profiles can be supplied with comma, e.g.-profile docker,lsf
. - Please check nf-core/configs to see what other custom config files can be supplied.
- Example run with Docker using local executor:
nextflow run main.nf -profile docker --preprocess default --outdir res_test_data1 --input_fastq assets/sample_sheet_test_data1.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens'
By executing the above command:
- The `local executor` will be used.
- `PREPROCESS_DEFAULT` will be used.
- Output will be saved into `res_test_data1`.
- Ensembl genome `homo_sapiens` will be downloaded and used as reference.
- Example run with Singularity using LSF executor:
nextflow run main.nf -profile singularity,lsf --preprocess default --outdir res_test_data1 --input_fastq assets/sample_sheet_test_data1.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens'
- By specifying `-profile lsf`, the `lsf` executor will be used for job submission.
- By specifying `-profile singularity`, Singularity images will be downloaded and saved to `work/singularity` directory. It is recommended to config the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) settings to store the images in a central location.
- Run your own analysis:
- A typical command:
nextflow run main.nf -profile <singularity/docker/lsf> --preprocess <default/10xgenomics/chromap> --outdir <path_to_result_dir> --input_fastq <path_to_samplesheet> --ref_fasta_ensembl <ENSEMBL_genome_name> --species_latin_name <e.g. 'homo sapiens'>
- For help:
nextflow run main.nf --help
See documentation usage for all of the available options.
For easy generation of configuration file, we have implemented an interactive config generator.
It was implemented with pure HTML/JavaScript/CSS codes so that it can be used locally. To use it locally, simply click here to download the web_gui.
Then, then open it with your web browser by double clicking web_gui/index.html
file or execute the following command in your terminal:
open web_gui/index.html
This section describes how the plots in the manuscript (to be added) were generated using scATACpipe. For comparison, the manuscript conducted 3 separate analyses, each using a different preprocessing strategy (default
, 10xgenomics
, chromap
). Since the commands and preprocessed results are quite similar across the three methods, only the chromap
option will be demonstrated here.
- The initial execution:
nextflow run main.nf -profile singularity,lsf --preprocess chromap --outdir ./results_chromap_initial --input_fastq ./assets/10X_human_scatac_fastq.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens' --archr_scrnaseq '/path/scRNA-Hematopoiesis-Granja-2019.rds' --archr_blacklist /home/hl84w/lucio_castilla/scATAC-seq/docs/hg38-blacklist.v2.bed.gz
Break down:
-profile singularity,lsf
:
This option instructs scATACpipe to use Singularity containers and LSF as the executor. Multiple parameters are separated by commas. Since profile
is pipeline-level flag, it is prefixed with a single dash (-
). Module-level flags are prefixed with double dash (--
).
--preprocess chromap
:
This instructs scATACpipe to use Chromap preprocessing strategy.
--outdir ./results_chromap_initial
:
Output will be saved into ./results_chromap_initial
folder.
--input_fastq ./assets/10X_human_scatac_fastq.csv
:
Please replace the /path/
in the 10X_human_scatac_fastq.csv
with absolute paths. Details regarding the 6 samples can be found in the supplementary section of the paper. If you detect any outlier samples, you can remove them from the downstream analyses using the --filter_sample = 'PBMC_10K_C, PBMC_10K_X'
flag.
--ref_fasta_ensembl homo_sapiens
:
This specifies that the genome Homo Sapiens from ENSEMBLE will be used as reference. To view all supported genomes, check out nextflow run main.nf --support_genome
.
--species_latin_name 'homo sapiens'
:
Simply the Latin name of the reference genome.
--archr_scrnaseq '/path/scRNA-Hematopoiesis-Granja-2019.rds'
Matching scRNA-seq data. Can ignore if not available. The example file can be downloaded here.
--archr_blacklist ./assets/hg38-blacklist.v2.bed.gz
:
Blacklist to exclude for downstream analysis. Click here for other species.
Instead of passing each flag option via the command line, you can include them all in a configuration file and supply it with the -c
option. Below is equivalent to above:
nextflow run main.nf -profile singularity,lsf -c ./conf/test_chromap_initial.config
What are inside test_chromap_initial.config:
scATACpipe/conf/test_chromap_initial.config
Lines 1 to 10 in 216f146
Again, you have to replace /path/
with full absolute paths.
- The final execution:
After examining the results from the initial execution, we decided to remove the outlier clusters (C1, C6) from downstream analyses. These two clusters are considered problematic according to the following two plots:
- The clustering heatmap plot from ./results_chromap_initial/archr_clustering/ folder: the cell proportions from
PBMC_5K_N
andPBMC_5K_V
samples are unbalanced for C1, C6.
- The marker gene heatmap plot from ./results_chromap_initial/archr_marker_gene_clusters/ folder: no distinct marker gene pattern detected in cluster C1, C6.
We used the following line to remove C1 and C6:
scATACpipe/conf/test_chromap_final.config
Line 18 in 20e0c82
Also, we would like to perform constrained integration of scRNA-seq data in addition to the unconstrained integration. The following line was used to supply the grouping information:
scATACpipe/conf/test_chromap_final.config
Line 12 in 20e0c82
To specify marker genes to plot, edit the following lines:
scATACpipe/conf/test_chromap_final.config
Line 30 in 20e0c82
scATACpipe/conf/test_chromap_final.config
Line 43 in 20e0c82
scATACpipe/conf/test_chromap_final.config
Line 53 in 20e0c82
scATACpipe/conf/test_chromap_final.config
Line 56 in 20e0c82
To specify a set of motifs for downstream analyses, edit the following lines:
scATACpipe/conf/modules.config
Line 289 in b0bed3f
scATACpipe/conf/modules.config
Line 298 in b0bed3f
To specify a set of motifs for footprinting analyses, edit the following lines:
scATACpipe/conf/modules.config
Line 305 in b0bed3f
scATACpipe/conf/modules.config
Line 314 in b0bed3f
We ended up with a final test_chromap_final.config:
scATACpipe/conf/test_chromap_final.config
Lines 1 to 66 in 20e0c82
The final execution command looks like below:
nextflow run main.nf -profile singularity,lsf -c ./conf/test_chromap_final.config -resume session_id
In order to skip already-performed analyses, you must provide the -resume session_id
option. The corresponding session ID can be found using the nextflow log
command. Please note that the --outdir
directory is set to results_chromap_final
.
When the final execution is complete, we can look at clustering heatmaps and marker gene heatmaps, and they both look good now:
- The marker gene heatmap plot from ./results_chromap_final/archr_marker_gene_clusters/ folder:
- The marker gene heatmap plot from ./results_chromap_final/archr_marker_gene_clusters/ folder:
Upon pipeline execution completion, Nextflow will produce time and resource usage reports that are stored under pipeline_info
:
- Using
chromap
option: results_chromap_final/pipeline_info - Using
default
option: results_default_final/pipeline_info - Using
10xgenomics
option: results_10xgenomics_final/pipeline_info
For this example (GSE155304), integrated analysis cannot be performed due to the lack of matched scRNA-seq data. Also note, for motifSet
, when set to 'cisbp', only human and mouse are currently supported by ArchR. Therefore, for this dataset, we need to replace all occurrences of motifSet = "cisbp"
to motifSet = "encode"
in the ./conf/motif.config
file.
- Command line used for
default
option:
nextflow run main.nf -c conf/test_default_plant.config -profile singularity,lsf
Results are stored here.
- Command used for
chromap
option:
nextflow run main.nf -c conf/test_chromap_plant.config -profile singularity,lsf
Results are stored here.
- Command used for
10xgenomics
option:
nextflow run main.nf -c conf/test_10xgenomics_plant.config -profile singularity,lsf
Results are stored here.
The scATACpipe workflow comes with documentation about the pipeline: usage and output.
scATACpipe was originally designed and written by Kai Hu, Haibo Liu, and Lihua Julie Zhu.
We thank the following people for their extensive assistance in the development of this pipeline: Nathan Lawson.
For help, bug report, or feature requests, the developers would prefer and appreciate that you create a GitHub issue by clicking here. If you would like to extend scATACpipe for your own good, feel free to fork the repo.
Please cite scATACpipe [to be added] if you use it for your research.
A Template of Method can be found here.
A complete list of references for the tools used by scATACpipe can be found here.
v0.1.0
- initial release
v0.1.1
- add web_gui for easy generation of config file
- add one more example for Arabidopsis Thaliana
- minor improvements