This repository contains a robust, reproducible workflow for building a custom Nextclade dataset for Enterovirus A. It enables you to generate reference and annotation files, download and process sequence data, infer an ancestral sequence, and create all files needed for Nextclade analyses and visualization.
# 1. Set up folders
mkdir -p dataset data ingest resources results scripts
# 2. Generate reference files
python3 scripts/generate_from_genbank.py --reference "NC_001612.1" --output-dir dataset/
# 3. Configure pathogen.json (edit manually)
# 4. If first time, enable inference in Snakefile:
# Set INFERRENCE_RERUN = True
# 5. Run workflow
snakemake --cores 9 all --config static_inference_confirmed=trueSee detailed instructions below for each step.
Follow the Nextclade example workflow or use the structure below:
mkdir -p dataset data ingest resources results scriptsThis workflow is composed of several modular steps:
- Reference Generation
Extracts relevant reference and annotation files from GenBank. - Dataset Ingest
Downloads and processes sequences and metadata from NCBI Virus. - Inferred Ancestral Root (Recommended)
Uses outgroup rooting to infer a dataset-specific ancestral sequence. This is rooted on a Static Inferred Ancestor — a phylogenetically reconstructed sequence at the MRCA (most recent common ancestor) of the ingroup, which provides a stable, biologically accurate reference point for mutation and clade assignments. This approach addresses the issue that the Reference differs substantially from currently circulating strains. - Augur Phylogenetics & Nextclade Preparation
Builds trees rooted on the inferred ancestor, prepares multiple sequence alignments, and generates all files required for Nextclade and Auspice. - Visualization & Analysis
Enables both command-line and web-based Nextclade analyses, including local dataset hosting.
Run the script to extract the reference FASTA and genome annotation from GenBank:
python3 scripts/generate_from_genbank.py --reference "NC_001612.1" --output-dir dataset/During the script execution, follow the prompts for CDS annotation selection.
[0][product]or[leave empty for manual choice]to select proteins.[2].
Outputs:
dataset/reference.fastadataset/genome_annotation.gff3
Edit pathogen.json to:
- Reference your generated files (
reference.fasta,genome_annotation.gff3) - Update metadata and QC settings as needed
Warning
If QC is not set, Nextclade will skip quality checks.
See the Nextclade pathogen config documentation for details.
Copy your GenBank file to resources/reference.gb and edit it to ensure compatibility with the workflow.
Important requirements:
- Each coding sequence (CDS) must have either a
productorgenename present - The annotation keys must match exactly between
reference.gbandgenome_annotation.gff3 - Use simple, consistent names (e.g.,
product="VP1"instead ofproduct="VP1_protein") - Remove any genes that are not relevant for your dataset
Warning
Mismatched or inconsistent gene names will cause augur ancestral to fail, as it cannot match features across files. Ensure your protein names match those defined in the GENES list in the Snakefile.
- Adjust the workflow parameters and file paths as needed for your dataset.
- Ensure required files are available:
data/sequences.fastadata/metadata.tsvresources/auspice_config.json
Sequences and metadata can be downloaded automatically via the ingest process (see below).
Automates downloading of Enterovirus A sequences and metadata from NCBI Virus.
See ingest/README.md for specifics.
Required packages:
csvtk, nextclade, tsv-utils, seqkit, zip, unzip, entrez-direct, ncbi-datasets-cli (installable via conda-forge/bioconda)
The inferred-root/ directory contains a reproducible pipeline that uses outgroup rooting to infer a dataset-specific ancestral sequence for Enterovirus A. This method:
- Builds a phylogenetic tree including both Enterovirus A sequences (ingroup) and related enterovirus sequences (outgroup)
- Roots the tree on the outgroup to establish correct evolutionary directionality
- Extracts the ancestral sequence at the MRCA of all Enterovirus A sequences
- Fills gaps with reference nucleotides to ensure a complete, biologically plausible genome
This Static Inferred Ancestor serves as the root of your Nextclade dataset, providing:
- More accurate mutation calls relative to a realistic Enterovirus A ancestor
- A stable reference that better represents Enterovirus A diversity than the distant Reference sequence
The workflow has two key parameters in the main Snakefile:
STATIC_ANCESTRAL_INFERRENCE = True— enables using the inferred root (default:True)INFERRENCE_RERUN = False— controls whether to regenerate the inferred root (default:False)
Use the existing inferred root:
snakemake --cores 9 allWhen you need to regenerate with new data or updated outgroups:
- Set
INFERRENCE_RERUN = Truein the Snakefile - Run the workflow:
snakemake --cores 9 all --config static_inference_confirmed=true
- The workflow will:
- Clean previous results in
inferred-root/results/ - Run the full inference pipeline with your current sequences
- Generate a new
resources/inferred-root.fasta - Incorporate it into the dataset build
- Clean previous results in
- After successful regeneration, set
INFERRENCE_RERUN = Falsefor future runs
Warning
Setting INFERRENCE_RERUN = True will overwrite your existing resources/inferred-root.fasta file and clear inferred-root/results/. Only use this when you want to regenerate the root with updated data.
Note
- First-time users: If
resources/inferred-root.fastadoesn't exist, you must setINFERRENCE_RERUN = Trueinitially. - To disable this feature: Set
STATIC_ANCESTRAL_INFERRENCE = Falseand changeROOTINGparameter (e.g.,ROOTING="mid_point"). - Outgroup configuration: Sequences are in
resources/outgroup/; update theOUTGROUPlist ininferred-root/Snakefileto modify which species are used.
See: inferred-root/README.md for technical details and the complete workflow.
To generate the Auspice JSON and Nextclade dataset:
snakemake --cores 9 allThis will use the existing inferred root (see Inferred Ancestral Root section above for regeneration instructions).
The workflow will:
- Build the reference tree rooted on the inferred ancestor
- Produce the Nextclade dataset in
out-dataset/ - Run Nextclade on example sequences
- Output results to
test_out/(alignment, translations, summary TSV)
Key Snakefile parameters:
ROOTING = "ancestral_sequence"— roots tree on the inferred ancestorSTATIC_ANCESTRAL_INFERRENCE = True— enables inferred root in the dataset (default)INFERRENCE_RERUN = False— set toTrueonly when regenerating the root (default:False)
To label mutations of interest, execute the mutLabels rule as a standalone instance. They will be added to the out-dataset/pathogen.json file.
To use the dataset in Nextclade Web, serve it locally:
serve --cors out-dataset -l 3000Then open:
https://master.clades.nextstrain.org/?dataset-url=http://localhost:3000
- Click "Load example", then "Run"
- You may want to reduce "Max. nucleotide markers" to 500 under "Settings" → "Sequence view" to optimize performance
The workflow includes a Snakemake rule test which builds a mixed test set (real sequences + generated edge cases) and runs Nextclade CLI against the newly assembled dataset. This is intended as a quick regression check when you change QC rules or alignment parameters.
snakemake --cores 9 test- Creates synthetic test inputs (fragments + recombinants) with:
scripts/generate_test_sequences.py
- Combines:
- dataset sequences (
data/sequences.fasta) - example sequences included in the dataset
- generated fragments + recombinants
- negative controls in
testing/(e.g.testing/non-EV-A_sequence.fasta) - optional EV-A background sequences (
testing/EV_A.fasta, or fetched from NCBI if missing)
- dataset sequences (
- Runs
nextclade3 runusing the freshly builtdataset.zip - Parses the Nextclade run log and outputs summaries / failed sequences using:
scripts/parse_nextclade_log.py
All outputs are written to test_out/ (Nextclade outputs + aggregated test FASTAs, logs, and any derived summaries).
The Nextclade CLI log is saved to testing/test.log.
- Maintainers: Nadia Neuner-Jehle, Alejandra González-Sánchez and Emma B. Hodcroft (eve-lab.org)
- For questions or suggestions, please open an issue or email: eve-group[at]swisstph.ch
If you use this template in your research, please cite:
Neuner-Jehle, N., González Sánchez, A., Hodcroft, E. B., & European Non-Polio Enterovirus Network (ENPEN). (2025). enterovirus-phylo/nextclade_d68: Enterovirus D68 Nextclade Dataset v1.0.0 (v1.0.0--2025-11-18). Zenodo. https://doi.org/10.5281/zenodo.17642338
- For issues, see the official Nextclade documentation or open an issue.
- For details on the inferred root workflow, see
inferred-root/README.md.
This template provides a scalable, transparent workflow for building and maintaining high-quality Nextclade datasets for Enteroviruses — adaptable to other Enterovirus species as well.