Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find and use better source for typical mutations of lineages #4

Closed
lenaschimmel opened this issue Mar 15, 2022 · 14 comments
Closed

Find and use better source for typical mutations of lineages #4

lenaschimmel opened this issue Mar 15, 2022 · 14 comments

Comments

@lenaschimmel
Copy link
Owner

lenaschimmel commented Mar 15, 2022

See this comment by @AngieHinrichs which even contains an alternative.

Thanks a lot for your detailed explanation! I'm trying to move this over here so it's easier to find for me.

(Also, if the comment thread over at pange-designation gets locked down after too many "off topic" comments, I won't be able to comment there at all. Already happened in other issues.)

@lenaschimmel
Copy link
Owner Author

lenaschimmel commented Mar 15, 2022

And @SVN-PhD recommended that I take a look at outbreak.info for mutation prevalences.

Currently I have problems with it, neither the website nor the API seems to work properly at the moment, but I will check back later.

@FedeGueli
Copy link

FedeGueli commented Mar 15, 2022

I suggest to look at covspectrum too.
Maybe you could open an issue there asking them (@chaoran-chen) to add a tool there to download mutations list in machine readable format.
The advantage with Cov-Spectrum would be you can choose country and period restricting the mass of mutations to the ones really circulating in that determined place and period.

@chaoran-chen
Copy link

chaoran-chen commented Mar 15, 2022

Hi everyone. I was just reading this issue here. Do the following APIs look useful to you?

Mutations of BA.1 globally:
https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?pangoLineage=BA.1*&minProportion=0.01

Pango lineage with the C25708T mutation:
https://lapis.cov-spectrum.org/open/v1/sample/aggregated?nucMutations=C25708T&fields=pangoLineage

You can also further filter by location, dates (and much more). For example:
https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?pangoLineage=BA.1*&minProportion=0.01&dateFrom=2022-01-01&region=Europe

Here is the documentation:
https://lapis.cov-spectrum.org/

It uses data from GenBank (prepared and hosted by Nextstrain).

@lenaschimmel
Copy link
Owner Author

Thanks a lot, that looks perfect!

@lenaschimmel
Copy link
Owner Author

I've been working on cov-spectrum integration yesterday. It's not yet finished, but looks promising!

Also, I've been ignoring deletions and insertions until now, because they are not present in virus_properties.json and are also ignored by some other tools. Looks like cov-spectrum handles deletions just like any other mutation, which I might do as well. @chaoran-chen is there a way to also get the insertions of a lineage from cov-spectrum?

And @corneliusroemer, I saw your comment there. My code (not yet pushed, will do it in the evening) currently can get current mutations lists form cov-spectrum and either generate a virus_properties.json with mostly the same syntax as your file:

 "21K": [
            "G21989-",
            "T13195C",
            ...

or it can also include the prevalence for each mutation:

        "21K": [
            {
                "mutation": "G21989-",
                "proportion": 0.9401410657729306,
                "count": 764958
            },
            {
                "mutation": "T13195C",
                "proportion": 0.9829978750416327,
                "count": 799829
            },

I don't have a use for the absolute count, so I could also break it down to:

        "21K": {
           "G21989-": 0.9401410657729306,
           "T13195C": 0.9829978750416327,
           ...

Does any of this seem useful for your work on Nextclade?

@chaoran-chen
Copy link

@chaoran-chen is there a way to also get the insertions of a lineage from cov-spectrum?

Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it.

@AngieHinrichs
Copy link

Unfortunately for SARS-CoV-2 sequences there are many genome assembly pipelines in use that do not do a good job with indels, so it may be just as well to skip them. I've seen cases where expected deletions are filled in with Ns, back-filled with reference sequence, or partially filled with read alignments that extend a bit into the deleted part of the reference genome sometimes causing false "substitutions" in the deleted region. There is definitely enough information in the substitutions alone to distinguish between the Nextstrain clades. (Although if properly assembled sequences with reliable indels are available, I suppose including indels could provide a more precise estimation of the breakpoint.)

@lenaschimmel lenaschimmel changed the title Check if virus_properties is really a good source for this use case Find and use better source for typical mutations of lineages Mar 16, 2022
@lenaschimmel
Copy link
Owner Author

@chaoran-chen:

Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it.

Ok, just wanted to make sure that I'm not missing anything that's already there.

@AngieHinrichs: I agree. So I'll make is so that deletions are ignored by default, but can be enabled with a flag.

@lenaschimmel
Copy link
Owner Author

Support for LAPIS / cov-spectrum is now released! The repo contains a pre-built virus_properties.json which can be updated with --rebuild-examples.

Deletions are disabled / ignored by default, but can be enabled with --enable-deletions. I'm not perfectly happy with how it works right now, but I think it's a good start:

Without deletions

screenshot-no-deletions

With deletions

screenshot-with-deletions

Thanks a lot for your input!

@corneliusroemer
Copy link

@lenaschimmel

Does any of this seem useful for your work on Nextclade?

This is pretty much how I started creating the virus_properties.json before switching to Nextclade data because covSpectrum doesn't have our clades

@lenaschimmel
Copy link
Owner Author

lenaschimmel commented Mar 19, 2022

I just pushed an update with the new --mutation-threshold paramter. See this comment for more details.

I think this finally addresses @AngieHinrichs' original suggestion.

@AngieHinrichs
Copy link

Thanks @lenaschimmel, --mutation-threshold should do the trick!

I think another tweak might be needed for --rebuild-examples, however. In the latest virus_properties.json, and after running --rebuild-examples, the lists for 21I and 21J are empty:

        "21I": [],
        "21J": [],

-- is that perhaps because all of their defining mutations are now in 21A because of the new minimum of 0.05 when rebuilding?

21J grew much larger than 21I (almost 10x as many genomes per quick stats on the UCSC/UShER tree), so the allele frequencies in 21A are heavily skewed towards 21J.

When I run on the GenBank sequences from cov-lineages/pango-designation#471 (471.genbank.aligned.fa.gz), the label for Delta is "Delta (B.1.617.2 / 21A)" but the mutations are more like 21J because they include 4181T, 6402T, 7124T, 8986T, 9053G and so on. Since the proposed recombinant is from 21J (like most would be by chance since 21J was so much more common than 21I, probably especially by the time Omicron was around though I have not checked dates), the recombination picture comes out perfect except for the '21A' label:
image

I believe there are very few Delta sequences that are 21A but not 21I or 21J, so the quickest fix might be to simply skip 21A, although I'm not sure what that would mean for mutations shared by 21I and 21J.

It should be pretty straightforward to transform my file of not-masked-for-UShER Nextstrain clade mutations to the virus_properties.json format. I will give that a try.

@corneliusroemer
Copy link

There are indeed not that many Deltas (lately) that are neither 21I nor 21J. They do exist, there are a few pango lineages, but for identifying current recombinants, one can drop 21A without having to worry too much.

@lenaschimmel
Copy link
Owner Author

There's an update on #10 which is also relevant to this issue. See my comment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants