Skip to content

plattlab/Primary-Analysis-Workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Primary Analysis Workflow

This repository provides a Snakemake-based primary analysis workflow for processing Record-seq sequencing data and generating count tables. It is part of the computational framework described in: "An End-To-End Computational Framework for 'Record-seq' Transcriptional Recording Data" (doi placeholder).

Table of Contents

  1. Framework Description
  2. Configuration
  3. Setup & Usage
  4. Run Results and Diagnostics
  5. Sample Data
  6. Record-seq Data Download
  7. Citation

Framework Description

Record-seq enables the recording of transcriptional activity in bacterial populations. The technology utilizes an engineered Cas1-Cas2 acquisition complex from Fusicatenibacter saccharivorans, in which Cas1 is fused to a reverse transcriptase to capture intracellular RNA fragments. These fragments are integrated as spacers into CRISPR arrays on a recording plasmid, allowing for the reconstruction of past cellular transcriptional histories. Record-seq has been successfully applied to monitor microbial responses to dietary perturbations and inflammatory states within the mouse gut microbiome.

This workflow processes raw Record-seq FASTQ files through spacer extraction, alignment, and feature counting, producing count matrices suitable for downstream secondary analysis.

Configuration

Clone this repository and navigate into it:

git clone https://github.com/plattlab/placeholder.git
cd placeholder

Alternatively, download the repository as a zip file (click "Code" → "Download Zip") and unzip it.

Open config/config.yaml and set the following options:

  • genome — genome ID of your genome reference
  • plasmid — plasmid ID
  • genomepath — path to your genome directory
  • experimentfolder — directory where results will be written
  • datafolder — directory containing your .fastq.gz input files
  • genomefeature — feature type to count in the genome GFF3 file
  • genomefeatureid — feature ID (last GFF3 column) for genome counting
  • plasmidfeature — feature type to count in the plasmid GFF3 files
  • plasmidfeatureid — feature ID (last GFF3 column) for plasmid counting
  • min_read_length - minimum read length to consider read for spacer extraction
  • max_read_length - maximum read length to consider read for spacer extraction
  • dr1, dr2, drfull — sequences of the direct repeats and the full repeat
  • librarybarcode — library barcode sequence (set to auto for auto-detection)
  • umi — UMI-based counting (on or off)
  • umilength — length of the UMI sequence
  • max_expected_spacers_per_read — maximum number of spacers per read
  • thresholds: min_raw_reads — minimum reads in a FASTQ file to be considered valid
  • skip_deduplication — deduplicate unique spacers before alignment (true or false)
  • count_spacer_occurence — count occurrences of each unique spacer (on or off)
  • infofile — generate per sample info file (on or off)
  • mergelanes — merge sequencing lanes (on or off)
  • lane_pattern — pattern in file names corresponding to lane information
  • multi_spacer_require_LBC — reads with multiple spacer require LBC (on or off)
  • tu — transcription unit (TU)-based analysis (on or off)
  • tu_version — TU reference version: V1, V2
  • debug — generate extended spacer files for debugging (on or off)

Your genome directory must contain subdirectories for each reference (genome and plasmids), with each subdirectory and its .fasta and .gff3 files named after the reference.

For E. coli MG1655 K-12, we bundle the U00096.3 reference genome along with two of our published plasmid constructs (pFS_0453 and pFS_1113) in resources/references.

TU-based counting is currently only implemented for E. coli U00096.3/NC000913.3. The TU definitions broadly follow EcoCyc nomenclature (https://ecocyc.org/ECOLI/new-image?object=Transcription-Units):

  • V1: full TU list (a gene/operon with N promoters appears N times)
  • V2: collapsed TU list (a gene/operon with N promoters appears once in its longest form)

Note on multiple references: If you need to treat multiple elements as a single genome or plasmid, merge them into a single FASTA with distinct headers and merge the corresponding GFF3 files. Create a merged reference directory alongside individual reference directories. Ensure that feature IDs are unique across merged references to avoid summed counts.

>genome1
AATTGATAA...

>genome2
ATAAGAAGA...

It is good practice to use a fresh config.yaml per experiment — copy and rename the old file before editing.

Setup & Usage

Install Mambaforge for your platform if Conda/Mamba is not already available. On Windows, the dependency pywfa is not natively supported — install WSL with Ubuntu (wsl --install -d Ubuntu in PowerShell) and run all steps from the Ubuntu terminal; Windows drives are accessible at /mnt/c/.

Create and activate a Snakemake environment:

mamba create -n snakemake_env -c conda-forge -c bioconda python=3.11.6 snakemake=8.14.0
conda activate snakemake_env

Install the additional Python dependencies:

python -m pip install pywfa fuzzysearch pulp==2.7

Navigate to the repository directory and do a dry run to validate the config:

snakemake -np

Start the pipeline:

./run_recordseq.sh \
  --use-conda --conda-frontend mamba \
  --conda-prefix "/path/to/conda_envs_recordseq" \
  --cores 4 --keep-going -p

Run Results and Diagnostics

After each run finishes, the pipeline writes outputs, status information, and diagnostic logs into your experimentfolder. The most important locations are:

  • outputs/ — final result files produced by the pipeline.
    • spacers/:
      • all.fasta: one entry per extracted spacer; headers indicate read identifier and position on the read.
      • unique.fasta: spacers collapsed by sequence identity.
      • LBC.txt: all library barcode sequences found with counts.
      • multiple.fasta: reads where multiple spacers were extracted.
      • spacer_counts.tsv.gz: unique extracted spacer sequences and their counts (only if count_spacer_occurence is on).
      • aggregate/: contains spacer_counts_merged.tsv.gz merging spacer count info across files (only if count_spacer_occurence is on).
    • alignments/: BAM/FASTA files for alignments to genome (genomeBams/, genomeFastas/) and plasmids (plasmidBams/, plasmidFastas/), plus merged files (mergedBams/, mergedSams/).
    • countsMatrices/: main feature count outputs.
      • genomeCounts.txt (or genomeCounts_TU.txt with TU option): main count matrices.
      • contaminationStats.txt (if checkcontamination is on): contamination sequence occurrence counts and percentages.
      • genomeCounts.txt.summary: read assignment statistics.
      • Per-plasmid count and summary files.
      • summaryStats.txt: per-file statistics (reads, reads without LBC, extracted/unique/multiple spacers, spacers aligned to genome/plasmid).
  • outputs_umi/ (if umi is on) — equivalent to outputs/ but containing UMI-deduplicated data.
  • intermediates/ — intermediate FASTQ/FASTA files from trimming and conversion steps.
  • status/ — per-sample status TSVs and overall run summaries.

Within status/ you will always find:

  • status/diagnosis_summary.txt — human-readable run overview.
  • status/diagnosis_summary.csv — same content in tabular format.
  • status/run_success.txt — present when everything finished without blocking problems.
  • status/run_failed.txt — present when blocking problems were detected.

How to read the summary

  • Successful run: The top line shows "RUN STATUS: SUCCESS", with counts of samples that are OK or have only notes.
  • Hard failures (blocking): Problems that prevent results for a step or sample (e.g. unreadable input). These appear under "COMMON STEPS — HARD FAILURES" (all samples) or "PER-SAMPLE — HARD FAILURES" (specific samples). When hard failures are present, run_failed.txt is created.
  • Soft notes (non-blocking): Warnings that don't stop the run (e.g. "low reads" for negative controls). Outputs may still be produced for these samples.

Fixing issues and re-running

Update your inputs or configuration as needed, then re-run the pipeline command. Only the necessary steps will execute, and the summary files will be regenerated at the end.

Detailed logs

  • Rule logs in experimentfolder/slurm/: grouped by rule (and often by sample). Open the log for the rule mentioned in the summary.
  • Snakemake job logs in [repo-directory]/.snakemake/: per-job logs organized by rule.

Sample Data

This repository provides a small Record-seq example dataset you can use to quickly test the primary-analysis pipeline end-to-end.

Source study: Schmidt*, Zimmermann*, Tanna* et al. Science (2022), DOI: 10.1126/science.abm6038
What this example contains: a subset of the mouse dietary intervention dataset consisting of day 7 samples from three diet groups:

  • Chow (n = 5)
  • Fat (n = 5)
  • Starch (n = 5)

Recording construct: pFS_0453
The corresponding reference is already bundled with the pipeline under resources/references/ (no download needed), and the example config is set up for this construct.

Quickstart configuration (recommended)

The provided example config (config/config.yaml) is already tuned for this dataset. In most cases, you only need to change two fields:

  • experimentfolder: where the pipeline should write outputs
  • datafolder: where the input FASTQs live (set this to the fastqs/ directory you create below)

Record-seq Data Download (example dataset)

The raw reads for the study are available on NCBI SRA (study accession SRP364805, BioProject PRJNA807096).
The commands below download exactly the 15 runs used for this example.

Step 0 — Choose folders and create fastqs/

Create a folder where you want the pipeline outputs to live (experimentfolder) and a folder where you want the FASTQs to be stored (datafolder).

Important: Create a subfolder named fastqs/ for the downloads, and set datafolder to this fastqs/ path.

# EDIT THESE PATHS
export EXPERIMENTFOLDER="/path/to/your/recordseq_example_run"
export DATAFOLDER="/path/to/your/recordseq_example_data/fastqs"

mkdir -p "${EXPERIMENTFOLDER}"
mkdir -p "${DATAFOLDER}"

Now update config/config.yaml:

  • set experimentfolder: /path/to/your/recordseq_example_run
  • set datafolder: /path/to/your/recordseq_example_data/fastqs

Step 1 — Download FASTQs

Install fasterq-dump via the NCBI SRA Toolkit:

Then download and gzip the example FASTQs into your fastqs/ folder:

fasterq-dump \
  SRR18364655 SRR18364656 SRR18364657 SRR18364658 SRR18364659 \
  SRR18364660 SRR18364661 SRR18364663 SRR18364664 SRR18364665 \
  SRR18364666 SRR18364667 SRR18364668 SRR18364669 SRR18364670 \
  --outdir "${DATAFOLDER}" && gzip "${DATAFOLDER}"/*.fastq

This produces .fastq.gz files in the directory you set as datafolder.

Note: fasterq-dump writes uncompressed FASTQs before gzipping, so make sure you have enough temporary disk space.


Step 2 — Create a metadata file

Create a simple tab-separated metadata file mapping sample IDs to diet groups:

cat > "${EXPERIMENTFOLDER}/metadata.txt" << 'EOF'
sample	group
SRR18364655	Fat
SRR18364656	Fat
SRR18364657	Fat
SRR18364658	Fat
SRR18364659	Fat
SRR18364660	Starch
SRR18364661	Starch
SRR18364663	Starch
SRR18364664	Starch
SRR18364665	Starch
SRR18364666	Chow
SRR18364667	Chow
SRR18364668	Chow
SRR18364669	Chow
SRR18364670	Chow
EOF

Full dataset (optional)

If you want to download additional samples beyond this example subset, use the same approach with the SRA study accession SRP364805 and retrieve the run accessions you need.


Next step: Secondary analysis with recoRdseq

After the primary analysis finishes, you can perform downstream/secondary analysis with the recoRdseq R package:

recoRdseq already ships with count matrices and metadata for this exact example dataset, and the full downstream analysis workflow is demonstrated in the package vignette. This means you can also explore the secondary analysis without rerunning the primary pipeline if you just want to reproduce/inspect the example results.

Citation

Please cite the following publication when using this workflow:

"An End-To-End Computational Framework for 'Record-seq' Transcriptional Recording Data" Journal placeholder doi placeholder

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors