Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed/wrong nextclade.tsv #392

Closed
chaoran-chen opened this issue Mar 1, 2023 · 12 comments
Closed

Malformed/wrong nextclade.tsv #392

chaoran-chen opened this issue Mar 1, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@chaoran-chen
Copy link

chaoran-chen commented Mar 1, 2023

I think that nextclade.tsv of the open dataset is malformed (or wrong in a different way). The deletions column contains values for mutations.

One way to reproduce the error is using tsv-utils to inspect the deletions column:

➜  tsv-select -H -f deletions nextclade.tsv | grep C241T
C241T,G443A,T670G,C2790T,T2954C,C3037T,C3096T,G4184A,C4321T,C8956T,C9344T,A9424G,C9534T,C10029T,C10198T,G10447A,C10449A,C11750T,G12160A,C12880T,T14257C,C14408T,C15714T,C16733T,G16935A,A17039G,C17410T,A18163G,C19955T,A20055G,C21618T,C21714T,G21987A,T22200G,G22578A,G22599C,C22674T,T22679C,C22686T,A22688G,G22775A,G22813T,T22882G,A22893C,T22917G,G22992A,C22995A,A23013C,T23018G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23758A,C23854A,G23948T,A24424T,T24469A,C25000T,C25584T,C26060T,G26529A,C26577G,G26709A,C27807T,C27889T,A28271T,C28311T,C28312T,G28681T,G28881A,G28882A,G28883C,A29510C
C241T,T670G,C2790T,T2954C,C3037T,G4184A,C4321T,C9344T,A9424G,C9534T,C10029T,C10198T,G10447A,C10449A,C11750T,G12160A,C12880T,T14257C,C14408T,C15714T,G16935A,A17039G,C17410T,A18163G,C19955T,A20055G,A21417G,C21618T,G21987A,T22200G,G22578A,G22599C,C22674T,T22679C,C22686T,A22688G,G22775A,G22813T,T22882G,A22893C,G22992A,C22995A,A23013C,T23018G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25584T,C26060T,G26529A,C26577G,G26709A,C27807T,C27889T,A28271T,C28311T,C28312T,G28681T,G28881A,G28882A,G28883C,A29510C
C241T,T469C,T670G,C2790T,T2954C,C3037T,G4184A,C4321T,C9344T,A9424G,C9534T,C10029T,C10198T,G10447A,C10449A,C11750T,G12160A,C12880T,T14257C,C14408T,A14424C,C15714T,G16935A,A17039G,C17410T,A18163G,G18498A,C19955T,A20055G,C21618T,G21987A,T22200G,G22578A,G22599C,C22674T,T22679C,C22686T,A22688G,G22775A,G22813T,T22882G,A22893C,T22917G,T22942A,G22992A,C22995A,A23013C,T23018G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25584T,C26060T,G26529A,C26577G,G26709A,A27571G,C27807T,C27889T,T27910C,A28271T,C28311T,C28312T,G28681T,G28881A,G28882A,G28883C,A29510C
[truncated]

The expected output would be empty.

The same issue is also reported by the CSV/TSV parser that we use for LAPIS (org.apache.commons.csv.CSVParser) causing the preprocessing pipeline for the open LAPIS instance to crash: GenSpectrum/LAPIS#159

@chaoran-chen
Copy link
Author

Oh, and there are entries with the same index - that's not correct, is it? E.g., index 2 exists twice. The second entry has 73 tabs instead of 71.

➜  tsv-filter -H --eq index:2 nextclade.tsv
index	seqName	clade	Nextclade_pango	partiallyAliased	clade_nextstrain	clade_who	clade_legacy	qc.overallScore	qc.overallStatus	totalSubstitutions	totalDeletions	totalInsertions	totalFrameShifts	totalMissing	totalNonACGTNs	totalAminoacidSubstitutions	totalAminoacidDeletions	totalAminoacidInsertions	alignmentScore	alignmentStart	alignmentEnd	coverage	isReverseComplement	substitutions	deletions	insertions	frameShifts	aaSubstitutions	aaDeletions	aaInsertions	privateNucMutations.reversionSubstitutions	privateNucMutations.labeledSubstitutions	privateNucMutations.unlabeledSubstitutions	privateNucMutations.totalReversionSubstitutions	privateNucMutations.totalLabeledSubstitutions	privateNucMutations.totalUnlabeledSubstitutions	privateNucMutations.totalPrivateSubstitutions	missing	nonACGTNs	qc.missingData.missingDataThreshold	qc.missingData.score	qc.missingData.status	qc.missingData.totalMissing	qc.mixedSites.mixedSitesThreshold	qc.mixedSites.score	qc.mixedSites.status	qc.mixedSites.totalMixedSites	qc.privateMutations.cutoff	qc.privateMutations.excess	qc.privateMutations.score	qc.privateMutations.status	qc.privateMutations.total	qc.snpClusters.clusteredSNPs	qc.snpClusters.score	qc.snpClusters.status	qc.snpClusters.totalSNPs	qc.frameShifts.frameShifts	qc.frameShifts.totalFrameShifts	qc.frameShifts.frameShiftsIgnored	qc.frameShifts.totalFrameShiftsIgnored	qc.frameShifts.score	qc.frameShifts.status	qc.stopCodons.stopCodons	qc.stopCodons.totalStopCodons	qc.stopCodons.score	qc.stopCodons.status	totalPcrPrimerChanges	pcrPrimerChanges	failedGenes	warnings	errors
2	USA/FL-CDC-LC0174858/2021	21J	AY.25.1	B.1.617.2.25.1	21J	Delta	21J (Delta)	0	good	39	13	0	0	0	0	34	4	0	89124	39	29819	0.9958867003310704	false	G210T,C241T,C3037T,A3948G,G4181T,C6402T,C7124T,C8986T,G9053T,C10029T,A11201G,A11332G,G11562T,C14408T,G15451A,C16466T,C19220T,C21618G,G21987A,T22917G,C22995A,C23277T,A23403G,C23604G,C23921A,G24410A,C25469T,G26107C,T26767C,A27507C,T27638C,C27752T,C27874T,C28045T,A28461G,G28881T,G28916T,G29402T,G29742T	22029-22034,28248-28253,28271			M:I82T,N:D63G,N:R203M,N:G215C,N:D377Y,ORF1a:D1228G,ORF1a:A1306S,ORF1a:P2046L,ORF1a:P2287S,ORF1a:V2930L,ORF1a:T3255I,ORF1a:T3646A,ORF1a:C3766F,ORF1b:P314L,ORF1b:G662S,ORF1b:P1000L,ORF1b:A1918V,ORF3a:S26L,ORF3a:E239Q,ORF7a:V82A,ORF7a:T120I,ORF7b:T40I,ORF8:A51V,ORF9b:T60A,S:T19R,S:G142D,S:R158G,S:L452R,S:T478K,S:T572I,S:D614G,S:P681R,S:Q787K,S:D950N	ORF8:D119-,ORF8:F120-,S:E156-,S:F157-				C23277T,C23921A,C28045T	0	0	3	3			3000	0	good	0	10	0	good	0	24	-5	0	good	3		0	good	0		0		0	0	good		0	0	good	2	Charité_RdRp_F:G15451A,ChinaCDC_N_F:G28881T			
2	IMS-10594-CVDP-EB0C1053-2A0F-4426-A1FD-779CB3C43C4A	22E	BQ.1.1	BA.5.3.1.1.1.1.1.1	22E	Omicron	22E (Omicron)	48.444444	mediocre	77	59	0	0	255	2	59	14	0	69	89188	1	29903	0.9914055445941878	false	C241T,G443A,T670G,C2790T,T2954C,C3037T,C3096T,G4184A,C4321T,C8956T,C9344T,A9424G,C9534T,C10029T,C10198T,G10447A,C10449A,C11750T,G12160A,C12880T,T14257C,C14408T,C15714T,C16733T,G16935A,A17039G,C17410T,A18163G,C19955T,A20055G,C21618T,C21714T,G21987A,T22200G,G22578A,G22599C,C22674T,T22679C,C22686T,A22688G,G22775A,G22813T,T22882G,A22893C,T22917G,G22992A,C22995A,A23013C,T23018G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23758A,C23854A,G23948T,A24424T,T24469A,C25000T,C25584T,C26060T,G26529A,C26577G,G26709A,C27807T,C27889T,A28271T,C28311T,C28312T,G28681T,G28881A,G28882A,G28883C,A29510C	11288-11296,21633-21641,21765-21770,28362-28370,29734-29759			M:D3N,M:Q19E,M:A63T,N:P13L,N:E136D,N:R203K,N:G204R,N:S413R,ORF1a:V60I,ORF1a:S135R,ORF1a:T842I,ORF1a:S944L,ORF1a:G1307S,ORF1a:L3027F,ORF1a:T3090I,ORF1a:T3255I,ORF1a:P3395H,ORF1a:L3829F,ORF1b:Y264H,ORF1b:P314L,ORF1b:S1089L,ORF1b:M1156I,ORF1b:N1191S,ORF1b:R1315C,ORF1b:I1566V,ORF1b:T2163I,ORF3a:T223I,ORF9b:P10F,S:T19I,S:A27S,S:T51I,S:G142D,S:V213G,S:G339D,S:R346T,S:S371F,S:S373P,S:S375F,S:T376A,S:D405N,S:K417N,S:N440K,S:K444T,S:L452R,S:S477N,S:T478K,S:E484A,S:F486V,S:Q498R,S:N501Y,S:Y505H,S:D614G,S:H655Y,S:N679K,S:P681H,S:N764K,S:D796Y,S:Q954H,S:N969K	N:E31-,N:R32-,N:S33-,ORF1a:S3675-,ORF1a:G3676-,ORF1a:F3677-,ORF9b:E27-,ORF9b:N28-,ORF9b:A29-,S:L24-,S:P25-,S:P26-,S:H69-,S:V70-		A1931C,C22786A,T26270C		G443A,C3096T,C8956T,C16733T,C21714T,C23758A	3	0	6	9	13,22942,26965-27160,29847-29903	E:9,M:148-213,ORF1a:556,S:460	M:1931,Y:26270	3000	0	good	255	10	20	good	2	24	16	66.666667	mediocre	24		0	good	0		0		0	0	good		0	0	good	6	ChinaCDC_N_F:G28881A;G28882A;G28883C,USCDC_N1_P:C28311T;C28312T,USCDC_N3_F:G28681T			

@joverlee521
Copy link
Contributor

Thanks for reporting this issue @chaoran-chen!

This is most likely due to the release of Nextclade 2.12.0 yesterday that added new columns to the CSV/TSV output.

Our current cache system blindly concatenates new Nextclade runs to the existing cache so the mismatch in the number of column would lead to the malformed nextclade TSV.

I'm going to re-run our ingest without the cache so that the nextclade.tsv will have the same number of columns throughout.

@joverlee521
Copy link
Contributor

To prevent this issue from happening in the future, I see a couple options:

  1. Use the --output-columns-selection option for Nextclade to be explicit about which columns to include in our output.
    For our pipeline, we would only need the columns we join with the metadata. However, it's unclear to me if outside users depend on the other columns in the nextclade.tsv. We could just start with all of the columns that are currently in the nextclade.tsv (after today's no cache run).

  2. Automate the pipeline to run without cache whenever there's a new version of Nextclade. This would fit in with the ideas discussed in Add Nextclade dataset version tracking file #390.

We can start with easier option [1] then take our time to design/implement option [2].

@emmahodcroft
Copy link
Member

I'd probably agree that the tidiest thing to do here is to implement both.

Re 2: Even though the error may not be so obvious if columns aren't being added/subtracted, we could still adjust how something else is calculated in Nextclade which may lead to discordant data when we're appending files. So I think we should try to hard to set up some kind of system that whenever there's a Nextclade release we do a full Nextstrain metadata re-run.

However to avoid column addition/removal from messing up downstream operations we could also implement 1. I would naively imagine that not many are using the somewhat-intermediate nextclade.tsv file rather than the metadata file, but it's possible. I imagine the number is small and hopefully they could contact us if they ran into trouble?

@emmahodcroft
Copy link
Member

PS - Really sorry about this (again) @chaoran-chen ! Thanks for flagging it and hope it's sorted now!

@ivan-aksamentov
Copy link
Member

A few more ideas to complement Jover's list:

  • Freeze Nextclade version, and automate updates using bot-submitted, manually-approved PRs. These PRs would only be merged after reading Nextclade's changelog.
  • Automate cache invalidation by listening to Nextclade releases and creating the touchfiles

@ivan-aksamentov
Copy link
Member

Another possibility, although a downgrade, but is much simpler to implement:

  • Remove the caching and merging things and always do full runs. Nextclade v2 is so much faster than v1, that caching is not that much important. If I remember, downloading and decompressing the cache also takes quite some time. It might be that after removing the cache the difference will not be such big of a deal. Bit it would simplify the pipeline and make it a bit less fragile.

@chaoran-chen
Copy link
Author

Thanks everyone for fixing this so quickly, and don't worry @emmahodcroft!

@joverlee521
Copy link
Contributor

Remove the caching and merging things and always do full runs.

This should be doable for us internally, but not sure how much this would affect costs for external users such as Africa CDC on Terra.

@j23414 do you happen to have the latest costs of ncov-ingest runs with/without caches on Terra? I know we're pausing the development for ncov-ingest of GISAID on Terra for now, but just wanted keep that in mind when making a decision here.

@corneliusroemer
Copy link
Member

Problem is fixed in current nextclade.tsvs (both public and GISAID). Sorry for the trouble and thanks for raising this so quickly @chaoran-chen!

I should have thought about this when we added new columns to the nextclade.tsv 🤦

I think the most straightforward way to solve cache inconsistencies is to add extra columns to the nextclade produced output before merging/appending to cache:

  • nextclade version
  • dataset version

Cache invalidation is then quite easy to automate as we have all we need in the cache. Most of our trouble is that we don't include that info in the cache. Will make a PR.

The point you raise about index is a separate one. The index is not meant to be meaningful in our caching case. It's only meaningful if you run nextclade on a single file, mapping the position of the input sequence to the output result. You can just ignore it

@emmahodcroft
Copy link
Member

That sounds like a really good solution Cornelius, to put some armour between our foot and our gun 😉
Thanks!

@corneliusroemer
Copy link
Member

While we're at it - we could add these as potential output columns to Nextclade to make it more straightforward to include this optional provenance information for everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

5 participants