This repository provides a Snakemake-based primary analysis workflow for processing Record-seq sequencing data and generating count tables. It is part of the computational framework described in: "An End-To-End Computational Framework for 'Record-seq' Transcriptional Recording Data" (doi placeholder).
- Authors: Florian Hugi (ORCID: 0009-0006-4840-0638), Tanmay Tanna (ORCID: 0000-0002-1150-0188)
- Framework Description
- Configuration
- Setup & Usage
- Run Results and Diagnostics
- Sample Data
- Record-seq Data Download
- Citation
Record-seq enables the recording of transcriptional activity in bacterial populations. The technology utilizes an engineered Cas1-Cas2 acquisition complex from Fusicatenibacter saccharivorans, in which Cas1 is fused to a reverse transcriptase to capture intracellular RNA fragments. These fragments are integrated as spacers into CRISPR arrays on a recording plasmid, allowing for the reconstruction of past cellular transcriptional histories. Record-seq has been successfully applied to monitor microbial responses to dietary perturbations and inflammatory states within the mouse gut microbiome.
This workflow processes raw Record-seq FASTQ files through spacer extraction, alignment, and feature counting, producing count matrices suitable for downstream secondary analysis.
Clone this repository and navigate into it:
git clone https://github.com/plattlab/placeholder.git
cd placeholderAlternatively, download the repository as a zip file (click "Code" → "Download Zip") and unzip it.
Open config/config.yaml and set the following options:
genome— genome ID of your genome referenceplasmid— plasmid IDgenomepath— path to your genome directoryexperimentfolder— directory where results will be writtendatafolder— directory containing your.fastq.gzinput filesgenomefeature— feature type to count in the genome GFF3 filegenomefeatureid— feature ID (last GFF3 column) for genome countingplasmidfeature— feature type to count in the plasmid GFF3 filesplasmidfeatureid— feature ID (last GFF3 column) for plasmid countingmin_read_length- minimum read length to consider read for spacer extractionmax_read_length- maximum read length to consider read for spacer extractiondr1,dr2,drfull— sequences of the direct repeats and the full repeatlibrarybarcode— library barcode sequence (set toautofor auto-detection)umi— UMI-based counting (onoroff)umilength— length of the UMI sequencemax_expected_spacers_per_read— maximum number of spacers per readthresholds: min_raw_reads— minimum reads in a FASTQ file to be considered validskip_deduplication— deduplicate unique spacers before alignment (trueorfalse)count_spacer_occurence— count occurrences of each unique spacer (onoroff)infofile— generate per sample info file (onoroff)mergelanes— merge sequencing lanes (onoroff)lane_pattern— pattern in file names corresponding to lane informationmulti_spacer_require_LBC— reads with multiple spacer require LBC (onoroff)tu— transcription unit (TU)-based analysis (onoroff)tu_version— TU reference version:V1,V2debug— generate extended spacer files for debugging (onoroff)
Your genome directory must contain subdirectories for each reference (genome and plasmids), with each subdirectory and its .fasta and .gff3 files named after the reference.
For E. coli MG1655 K-12, we bundle the U00096.3 reference genome along with two of our published plasmid constructs (pFS_0453 and pFS_1113) in resources/references.
TU-based counting is currently only implemented for E. coli U00096.3/NC000913.3. The TU definitions broadly follow EcoCyc nomenclature (https://ecocyc.org/ECOLI/new-image?object=Transcription-Units):
V1: full TU list (a gene/operon with N promoters appears N times)V2: collapsed TU list (a gene/operon with N promoters appears once in its longest form)
Note on multiple references: If you need to treat multiple elements as a single genome or plasmid, merge them into a single FASTA with distinct headers and merge the corresponding GFF3 files. Create a merged reference directory alongside individual reference directories. Ensure that feature IDs are unique across merged references to avoid summed counts.
>genome1
AATTGATAA...
>genome2
ATAAGAAGA...
It is good practice to use a fresh config.yaml per experiment — copy and rename the old file before editing.
Install Mambaforge for your platform if Conda/Mamba is not already available. On Windows, the dependency pywfa is not natively supported — install WSL with Ubuntu (wsl --install -d Ubuntu in PowerShell) and run all steps from the Ubuntu terminal; Windows drives are accessible at /mnt/c/.
Create and activate a Snakemake environment:
mamba create -n snakemake_env -c conda-forge -c bioconda python=3.11.6 snakemake=8.14.0
conda activate snakemake_envInstall the additional Python dependencies:
python -m pip install pywfa fuzzysearch pulp==2.7Navigate to the repository directory and do a dry run to validate the config:
snakemake -npStart the pipeline:
./run_recordseq.sh \
--use-conda --conda-frontend mamba \
--conda-prefix "/path/to/conda_envs_recordseq" \
--cores 4 --keep-going -pAfter each run finishes, the pipeline writes outputs, status information, and diagnostic logs into your experimentfolder. The most important locations are:
outputs/— final result files produced by the pipeline.spacers/:all.fasta: one entry per extracted spacer; headers indicate read identifier and position on the read.unique.fasta: spacers collapsed by sequence identity.LBC.txt: all library barcode sequences found with counts.multiple.fasta: reads where multiple spacers were extracted.spacer_counts.tsv.gz: unique extracted spacer sequences and their counts (only ifcount_spacer_occurenceis on).aggregate/: containsspacer_counts_merged.tsv.gzmerging spacer count info across files (only ifcount_spacer_occurenceis on).
alignments/: BAM/FASTA files for alignments to genome (genomeBams/,genomeFastas/) and plasmids (plasmidBams/,plasmidFastas/), plus merged files (mergedBams/,mergedSams/).countsMatrices/: main feature count outputs.genomeCounts.txt(orgenomeCounts_TU.txtwith TU option): main count matrices.contaminationStats.txt(ifcheckcontaminationis on): contamination sequence occurrence counts and percentages.genomeCounts.txt.summary: read assignment statistics.- Per-plasmid count and summary files.
summaryStats.txt: per-file statistics (reads, reads without LBC, extracted/unique/multiple spacers, spacers aligned to genome/plasmid).
outputs_umi/(ifumiis on) — equivalent tooutputs/but containing UMI-deduplicated data.intermediates/— intermediate FASTQ/FASTA files from trimming and conversion steps.status/— per-sample status TSVs and overall run summaries.
Within status/ you will always find:
status/diagnosis_summary.txt— human-readable run overview.status/diagnosis_summary.csv— same content in tabular format.status/run_success.txt— present when everything finished without blocking problems.status/run_failed.txt— present when blocking problems were detected.
- Successful run: The top line shows "RUN STATUS: SUCCESS", with counts of samples that are OK or have only notes.
- Hard failures (blocking): Problems that prevent results for a step or sample (e.g. unreadable input). These appear under "COMMON STEPS — HARD FAILURES" (all samples) or "PER-SAMPLE — HARD FAILURES" (specific samples). When hard failures are present,
run_failed.txtis created. - Soft notes (non-blocking): Warnings that don't stop the run (e.g. "low reads" for negative controls). Outputs may still be produced for these samples.
Update your inputs or configuration as needed, then re-run the pipeline command. Only the necessary steps will execute, and the summary files will be regenerated at the end.
- Rule logs in
experimentfolder/slurm/: grouped by rule (and often by sample). Open the log for the rule mentioned in the summary. - Snakemake job logs in
[repo-directory]/.snakemake/: per-job logs organized by rule.
This repository provides a small Record-seq example dataset you can use to quickly test the primary-analysis pipeline end-to-end.
Source study: Schmidt*, Zimmermann*, Tanna* et al. Science (2022), DOI: 10.1126/science.abm6038
What this example contains: a subset of the mouse dietary intervention dataset consisting of day 7 samples from three diet groups:
- Chow (n = 5)
- Fat (n = 5)
- Starch (n = 5)
Recording construct: pFS_0453
The corresponding reference is already bundled with the pipeline under resources/references/ (no download needed), and the example config is set up for this construct.
The provided example config (config/config.yaml) is already tuned for this dataset. In most cases, you only need to change two fields:
experimentfolder: where the pipeline should write outputsdatafolder: where the input FASTQs live (set this to thefastqs/directory you create below)
The raw reads for the study are available on NCBI SRA (study accession SRP364805, BioProject PRJNA807096).
The commands below download exactly the 15 runs used for this example.
Create a folder where you want the pipeline outputs to live (experimentfolder) and a folder where you want the FASTQs to be stored (datafolder).
Important: Create a subfolder named
fastqs/for the downloads, and setdatafolderto thisfastqs/path.
# EDIT THESE PATHS
export EXPERIMENTFOLDER="/path/to/your/recordseq_example_run"
export DATAFOLDER="/path/to/your/recordseq_example_data/fastqs"
mkdir -p "${EXPERIMENTFOLDER}"
mkdir -p "${DATAFOLDER}"Now update config/config.yaml:
- set
experimentfolder: /path/to/your/recordseq_example_run - set
datafolder: /path/to/your/recordseq_example_data/fastqs
Install fasterq-dump via the NCBI SRA Toolkit:
Then download and gzip the example FASTQs into your fastqs/ folder:
fasterq-dump \
SRR18364655 SRR18364656 SRR18364657 SRR18364658 SRR18364659 \
SRR18364660 SRR18364661 SRR18364663 SRR18364664 SRR18364665 \
SRR18364666 SRR18364667 SRR18364668 SRR18364669 SRR18364670 \
--outdir "${DATAFOLDER}" && gzip "${DATAFOLDER}"/*.fastqThis produces .fastq.gz files in the directory you set as datafolder.
Note:
fasterq-dumpwrites uncompressed FASTQs before gzipping, so make sure you have enough temporary disk space.
Create a simple tab-separated metadata file mapping sample IDs to diet groups:
cat > "${EXPERIMENTFOLDER}/metadata.txt" << 'EOF'
sample group
SRR18364655 Fat
SRR18364656 Fat
SRR18364657 Fat
SRR18364658 Fat
SRR18364659 Fat
SRR18364660 Starch
SRR18364661 Starch
SRR18364663 Starch
SRR18364664 Starch
SRR18364665 Starch
SRR18364666 Chow
SRR18364667 Chow
SRR18364668 Chow
SRR18364669 Chow
SRR18364670 Chow
EOFIf you want to download additional samples beyond this example subset, use the same approach with the SRA study accession SRP364805 and retrieve the run accessions you need.
After the primary analysis finishes, you can perform downstream/secondary analysis with the recoRdseq R package:
recoRdseq already ships with count matrices and metadata for this exact example dataset, and the full downstream analysis workflow is demonstrated in the package vignette. This means you can also explore the secondary analysis without rerunning the primary pipeline if you just want to reproduce/inspect the example results.
Please cite the following publication when using this workflow:
"An End-To-End Computational Framework for 'Record-seq' Transcriptional Recording Data" Journal placeholder doi placeholder