Update cchf example #16

anna-parker · 2025-12-03T10:46:42Z

Add fastaIds metadata column to CCHF example data. This is backward compatible (can still be submitted on main) but will also mean the example CCHF seqeunces can be submitted after the multi segment submission changes in loculus-project/loculus#5382 are merged

example_files/cchfv_test_metadata.tsv

… refactor multi segment submission in backend and edit page and have prepro assign segments (#5382) resolves #4999 #4708, #4734, #5511 partially resolves #5392, #5185 (comment) includes work done in #5398 and #5402 This PR additionally fixes submission, subtype assignment and search for EVs and other multi-path organisms. ### BREAKING CHANGES When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaIds` column with a space -separated list of the `fastaId`s (fasta header IDs) of the respective sequences. If no `fastaIds` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort (uses a minimizer index for fast local alignment) or nextclade align (full sequence alignment to reference) will be used to assign segments/subtypes for all multi-segmented and multi-pathogen sequences (this is also done in ingest for grouping segments): ``` segment_classification_method: "minimizer" or "align" minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format `<submissionId>_<segmentName>` (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaId in the processedData, the map is called: `sequenceNameToFastaId`. This allows us to surface the segment assignment on the edit page. ### Nextclade Preprocessing pipeline config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a dictionary where each item includes all information required to run nextclade. I.e. we change from: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` to: ``` nextclade_sequence_and_datasets: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > genes: [RdRp] - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M genes: [GPC] - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S genes: [NP] nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align"> minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort> ``` ### Ingest Pipeline Config changes `minimizer_index` is changed to `minimizer_url` for consistency (can be used in ingest and preprocessing and should both be the same) ### Optional additional Config changes Limit the number of sequences the backend will accept per submission by using - should be added for multi-segmented organisms: ` submissionDataTypes: &defaultSubmissionDataTypes consensusSequences: true maxSequencesPerEntry: 1 ` ### Testing You can use pathoplexus/example_data#16 and pathoplexus/dev_example_data#2 for testing. ### PR Checklist - [x] Update values.schema.json and other READMEs - [x] add fastaId to commonMetadata (ensure it is downloaded in templates): #5561 - [x] Fix how genes are returned (will cause a config update): #5563 - [x] Improve prepro code (less duplication and more tests): #5554 - [x] ingest EVs as single segmented to ensure search works: #5511 - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - ~add integration testing for full EV submission user journey~ -> will be done in a later PR - [x] improve CCHF minimizer (some segments are again not assigned) - [x] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) -> decided against - [x] update PPX docs with new multi-segment submission format -> test PR here: pathoplexus/pathoplexus#759 - [x] update example data for demo 🚀 Preview: https://edit-page-anya.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com> Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com> Co-authored-by: Theo Sanderson <theo@sndrsn.co.uk>

anna-parker added 2 commits December 3, 2025 11:40

feat(cchf): add fastaIds to CCHF metadata entry

874aec2

wupps

c43ea09

anna-parker requested review from corneliusroemer, emmahodcroft and theosanderson December 3, 2025 10:50

anna-parker mentioned this pull request Dec 3, 2025

feat!(website, prepro, backend, config, integration):multi pathogen - refactor multi segment submission in backend and edit page and have prepro assign segments loculus-project/loculus#5382

Merged

13 tasks

corneliusroemer reviewed Dec 3, 2025

View reviewed changes

example_files/cchfv_test_metadata.tsv Outdated Show resolved Hide resolved

corneliusroemer approved these changes Dec 3, 2025

View reviewed changes

rename submissionId to id

7529575

anna-parker merged commit 21900e7 into main Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update cchf example #16

Update cchf example #16

Uh oh!

anna-parker commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update cchf example #16

Update cchf example #16

Uh oh!

Conversation

anna-parker commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants