Streamlined Nextflow Analysis Pipeline for Immunoprecipitation-Based Epigenomic Profiling of Circulating Chromatin
Preprocessing
βββ createSamplesheet
βββ fastqc
Reference Download
βββ downloadGenome
βββ createGenomeIndex
βββ fetch_chrom_sizes
βββ downloadDACFile
βββ downloadSNPRef
Alignment
βββ trim
βββ align
Bam processing
βββ sort_bam
βββ filter_properly_paired
βββ lib_complex_preseq
βββ unique_sam
βββ quality_filter
βββ dedup
βββ dac_exclusion
βββ createStatsSamtoolsfiltered
βββ index_sam
βββ createSMaSHFingerPrint
Fragments processing
βββ sort_readname_bam
βββ createMotifGCfile
βββ calcFragsLengthDistribuition
βββ fragle_ct_estimation
βββ bam_to_bed
βββ unique_frags
Signal processing
βββ bam_to_bedgraph
βββ bedgraph_to_bigwig
βββ igv_reports
βββ call_peaks
βββ chromatin_count_normalization
βββ peaks_report
βββ peaks_annotations
βββ enrichmentReport
βββ report_lite
There are two ways to provide input files to the pipeline:
- A directory containing the files to be processed.
- A spreadsheet with basic metadata about the files.
The input files can be of three types:
- FASTA files
- Raw BAM files (generated immediately after alignment with a reference genome)
- Processed BAM files (expected to be sorted, deduplicated, and filtered for unique reads)
Below is a detailed explanation of how to use each method.
If your sample files are in FASTA format, use the --sample_dir_fasta parameter. The pipeline assumes that each sample is stored in a separate directory, and all files within that directory belong to the corresponding sample.
Sample_folder/
βββ Sample1/
β βββ Sample1_ABC_123_1.fasta
β βββ Sample1_ABC_123_2.fasta
βββ Sample2/
β βββ Sample2_ABC_123_1.fasta
β βββ Sample2_ABC_123_2.fasta
βββ Sample3/
β βββ Sample3_ABC_123.fasta
When using this method, the pipeline will generate a spreadsheet (CSV format) with the following structure:
sampleId, enrichment_mark, read1, read2, control
- The pipeline supports both paired-end and single-end FASTA files.
- You can specify which enrichment mark should be calculated using the
--enrichment_markparameter.
By default, the SNAPIE pipeline supports the following enrichment marks:
- H3K4me3
- H3K27ac
- MeDIP
The reference files required for this calculation are stored in:
ref_files/enrichment_states/enrichment_mark/
If you need to provide custom enrichment mark files, place the on-target and off-target BED files in a local directory and name it after the desired histone mark. The expected file names are:
off.target.filt.bedon.target.filt.bed
Then, set:
--enrichment_mark <custom_mark_name>--enrichment_states_ref <path_to_directory>
π Note: If using the directory-based input method, all samples will have the same enrichment mark applied. If you need to assign different histone marks to individual samples within the same execution, you must provide a spreadsheet instead.
To provide input using a spreadsheet, use the --samplesheetfasta parameter. This expects a CSV file with the following structure:
sampleId, enrichment_mark, read1, read2,control
-
The enrichment_mark field can be left blank if no enrichment mark calculation is required.
-
The read2 field can be left blank for single-end FASTA files.
-
The control field is used to indicate which sample will serve as the control reference for each experimental sample. In this column, you should provide the sampleId of the control sample (which must also be included in the sample sheet like any other sample). If you leave the field blank, the sample will not have an associated control. If multiple samples share the same control, you can repeat the control sampleId in their rows.
Example:
sampleId,enrichment_mark,read1,read2,control sample1,no_enrichment_mark,/data/baca/sample1_1.fq.gz,/data/baca/sample1_2.fq.gz,sample_ctrl sample2,no_enrichment_mark,/data/baca/sample2_1.fq.gz,/data/baca/sample2_2.fq.gz,sample_ctrl sample3,no_enrichment_mark,/data/baca/sample3_1.fq.gz,/data/baca/sample3_2.fq.gz,sample_ctrl sample_ctrl,no_enrichment_mark,/data/baca/sample_crtl_1.fq.gz,/data/baca/sample_crtl_2.fq.gz,
Here, sample1, sample2, and sample3 all use sample_ctrl as their control. The last line defines the control sample itself, and since it is the control, its own control field is left blank.
If your data is already aligned (BAM format), you can use one of the following options:
- Directory-based input:
--sample_dir_bam - Spreadsheet-based input:
--samplesheetBams
Sample_folder/
βββ Sample1/
β βββ Sample1_ABC_123.bam
βββ Sample2/
β βββ Sample2_ABC_123.bam
βββ Sample3/
β βββ Sample3_ABC_123.bam
To provide BAM files via a spreadsheet, use the --samplesheetBams parameter. The expected CSV format is:
sampleId, enrichment_mark, bam
- The enrichment_mark field is optional.
π Automatic Spreadsheet Generation:
Whenever a directory is provided using --sample_dir_fasta or --sample_dir_bam, the pipeline will generate a spreadsheet in the output directory. The file will be named:
snap-samplesheet-bam-date_and_time.csvsnap-samplesheet-fasta-date_and_time.csv
If your BAM files are already processed (sorted, deduplicated, and filtered for unique reads), you can skip redundant processing steps by setting:
βdeduped_bam true
This should be used in combination with either --samplesheetBams or --sample_dir_bam.
π Skipping Steps: When --deduped_bam true is set, the pipeline will bypass:
- BAM sorting
- Library complexity calculation
- Unique read filtering
- Quality filtering
samtoolsstats calculation- Duplicate removal
This reduces processing time if your BAM files have already undergone these steps.
Specify the reference genome version using:
βgenome <reference_genome>
The SNAPIE pipeline is pre-configured to support:
- hg19
- hg38
For these references, no additional downloads are requiredβthe necessary FASTA files, blacklist regions, and SNP files for sample identification are already available in: ref_files/genome/genome_paths.csv
If using a custom reference genome, provide a local version of this file and specify:
--genomeInfoPaths <full_path_to_genome_paths.csv>--genome <custom_genome_name>(must match the Genome field in the CSV file)
The output directory where results will be saved is specified with: βoutputFolder <directory_path>
(Default: analysis_folder)
The SNAPIE pipeline consists of six main processing stages. By default, all stages are executed, but you can stop the pipeline at a specific stage using the --until parameter:
PREPROCESSING
DOWNLOAD_REFERENCES
ALIGNMENT
BAM_PROCESSING
FRAGMENTS_PROCESSING
BAM_SIGNAL_PROCESSING
By default, the SNAPIE pipeline removes genomic regions known to cause artifacts, reducing false positives and biases in the analysis.
To disable this step, set:
βexclude_dac_regions false
The pipeline generates an END motif analysis file at:motifs/bp_motif.bed
By default, this analysis considers 4-mers, but this value can be adjusted using the parameter --nmer
SNAPIE includes an integrated module for chromatin fragment count quantification and normalization at user-defined genomic regions. This step generates fragment count matrices from BED-formatted fragment files and supports both single-sample and batch execution modes. The description and behavior of this module are fully implemented within SNAPIE and are based on the upstream Chromatin Fragment Count Normalization library (https://github.com/chhetribsurya/chromatin-frags-normalization), which provides a standalone and more advanced implementation, including detailed normalization logic, execution examples, and extended customization options.
Parameters :
--chromatin_count_mode (single/batch)
Controls how the normalization step is executed:
single Runs the normalization independently for each sample, producing one output directory per sample.
batch Runs the normalization jointly across all samples, generating a single site-by-sample matrix. This mode is recommended when downstream analyses require direct comparison between samples.
--target-sites
Target regions BED file
--reference-sites (optional)
Specifies a BED file containing reference genomic regions used for reference-based normalization.
When provided, fragment counts at target regions are normalized by the total fragment signal observed across these reference loci.
- This helps correct for global signal-to-noise differences between samples and is particularly useful for cfDNA and other low-input assays.
- If not provided, normalization is performed using library-sizeβbased scaling only.
The SNAPIE pipeline generates pileup reports in two formats:
- MultiQC report (found in
reports/multiqc/) - IGV session file (stored in
reports/igv_session/, viewable in IGV Browser)
By default, both reports use regions defined in: ref_files/pileup_report/test_housekeeping.bed
To use custom regions, specify:βgenes_pileup_report <path_to_custom_bed_file>
After running the SNAPIE pipeline, the results will be organized into several directories within the output folder which you can provide via the parameter --outputFolder (the default value for this parameter is analysis_folder). Each directory stores specific files related to different stages of the pipeline. Below is an explanation of each directory:
output_folder/
βββ align/
β βββ Sample1/
β βββ Sample2/
βββ fastqc/
β βββ Sample1/
β βββ Sample2/
βββ trim/
β βββ Sample1/
β βββ Sample2/
βββ frags/
β βββ Sample1/
β βββ Sample2/
βββ motifs/
β βββ Sample1/
β βββ Sample2/
βββ peaks/
β βββ Sample1/
β βββ Sample2/
βββ chromatin_count_normalization/
β βββ Sample1/
β βββ Sample2/
βββ reports/
β βββ fragle/
β βββ igv_session/
β βββ metrics_lite/
β βββ multiqc/
β βββ SMaSH/
βββ snap-samplesheet-fasta.csv
βββ software_versions/
βββ stats_files/
- Contains subdirectories for each sample (e.g.,
Sample1/,Sample2/). - Stores alignment-related files, including BAM files, index files
- Contains subdirectories for each sample, storing FastQC quality control reports.
- Each sample will have an associated
.htmlreport and.zipfile summarizing sequence quality.
- Contains subdirectories for each sample with trimmed reads after adapter removal and quality filtering.
- These are the cleaned sequencing reads used for downstream processing.
- Contains subdirectories for each sample with fragment length distribution data.
- Useful for downstream analysis such as nucleosome positioning studies.
- Contains subdirectories for each sample with motif analysis results.
- Stores END motif analysis and GC content calculations.
- The file
bp_motif.bedcontains motif information.
- Contains subdirectories for each sample with peak-calling results (if applicable).
- Include peak files (
.bed,.narrowPeak,.bedgraph,.bw,.control_lambda.bdg,.treat_pileup.bdg,`peaks.xls).
- Contains the analysis_summary.txt and also the the cpm and raw matrices used for the chromatin fragment counter
- Stores different types of reports generated during the analysis.
- Subdirectories:
fragle/β Contains ctDNA_Burden for all samples calculated using Fragle.igv_session/β Contains IGV session files for easy visualization of sequencing data in Integrative Genomics Viewer (IGV).multiqc/β Aggregated MultiQC report summarizing quality control metrics.metrics_lite/β Aggregated text file with only the basic metrics.SMaSH/β A dendogram clustering the samples using SMAsH (https://github.com/rbundschuh/SMaSH) that is also included in the final report.
- The automatically generated spreadsheet that summarizes the input samples processed in the pipeline.
- Includes sample names, enrichment marks, and input file paths.
- Stores details about the versions of software tools used in the pipeline execution.
- Helps with reproducibility and tracking software dependencies.
- Contains all the statistics related to sequencing reads, alignment metrics, and quality filtering.
Below are some practical examples showing how to run the SNAPIE pipeline in different scenarios. These examples aim to help you get started quickly, whether youβre working with FASTA or BAM files, using spreadsheets or directory-based input.
In this example, we assume you have sequencing files in compressed FASTA format (.fasta.gz) and a spreadsheet that specifies the location of each read file. Your reference genome is hg19.
You can run the pipeline with the following command:
nextflow run main.nf -profile docker_light_macos --genome hg19 --outputFolder result_analysis --samplesheetfasta control_sample_sheet.csv
Explanation of the parameters:
-profile docker_light_macos** - chooses the appropriate environment profile (in this case, lightweight Docker for macOS)
--genome hg19 β specifies the reference genome
--outputFolder result_analysis β sets the output directory
--samplesheetfasta control_sample_sheet.csv β provides the path to your input spreadsheet (FASTA mode)
In this example, your FASTA files are stored in a root directory (sample_folder), where each sample has its own subdirectory named after the sample, and the corresponding .fasta.gz files are located inside these subdirectories.
If your files follow this structure, you do not need to manually create a sample sheet β the pipeline will generate it automatically.
You can run the pipeline with the following command:
nextflow run main.nf -profile docker_light_macos --genome hg19 --outputFolder result_analysis --sample_dir_fasta sample_folder
In this example, you are working with raw BAM files (i.e., generated directly after alignment) and want to provide them via a directory. The pipeline can process both paired-end (PE) and single-end (SE) BAMs.
By default, the pipeline assumes the reads are paired-end. If your data is single-end, you must explicitly set the parameter --read_method SE.
You can run the pipeline with one of the following commands, depending on your data type:
For paired-end BAM files (default):
nextflow run main.nf -profile docker_light_macos --genome hg19 --outputFolder result_analysis --sample_dir_bam sample_folder
For single-end BAM files:
nextflow run main.nf -profile docker_light_macos --genome hg19 --outputFolder result_analysis --sample_dir_bam sample_folder --read_method SE
This example shows how to run the pipeline specifically to prioritize end motif analysis. A dedicated profile, end_motif_analysis, was created for this purpose.
In this profile: We set -q 0 in Trim Galore, disabling quality-based trimming.
This avoids removing biologically meaningful bases at the ends of reads due to low quality scores, which helps preserve true fragment ends β a crucial aspect for reliable motif discovery.
The pipeline also sets --until FRAGMENTS_PROCESSING automatically, stopping the execution right after generating the fragment files, which are used for motif analysis.
You can run the pipeline with:
nextflow run main.nf -profile docker_light_macos,end_motif_analysis --genome hg19 --outputFolder result_analysis --sample_dir_fasta sample_folder
Output of interest:
After execution, motif analysis results will be found in:
motifs/<sample_name>/<sample_id>_NMER_bp_motif.bed
