Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest NCBI GenBank data for H5N1 outbreak #37

Closed
joverlee521 opened this issue May 21, 2024 · 7 comments
Closed

Ingest NCBI GenBank data for H5N1 outbreak #37

joverlee521 opened this issue May 21, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

Jotting down concrete steps for ingesting NCBI GenBank data for H5N1 outbreak based on internal team discussion and GDoc notes.

Original plan that was discussed:

  1. Fetch GenBank accessions from NCBI Virus Download (example URL) -- this provides additional filter options for collection date and genotype which are not available through the datasets command and shows 2,561 records on NCBI Virus.
  2. Use datasets to download dataset for accessions
    datasets download virus genome accession --inputfile accessions.txt
    
  3. Use Bio.Entrez.efetch to fetch the GenBank records to parse out additional fields that are not included in the dataset metadata: strain, serotype, segment.
  4. Merge the dataset metadata and the entrez metadata into a single NDJSON that can then go through the usual curation pipeline.

Detours

  • I waffled a little on whether we needed (3), because I realized that the GenBank Title is included in the FASTA headers of the sequences downloaded! We could potentially parse out strain, serotype, and segment from titles such as

    PP766980.1 Influenza A virus (A/Canada goose/North Carolina/W24-90A/2024(H5N1)) clone new segment 7 matrix protein 2 (M2) and matrix protein 1 (M1) genes, complete cds
    

    However, I'm not sure that all record titles will follow this format. GenBank docs for the Definition/Title does not make me confident about it either

    Brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function (if the sequence is non-coding).

    I think I'll stick with the original plan to use Entrez, but it was a nice thought.

  • (1) is currently not possible because the download function from NCBI Virus is broken due to a bug for H5N1 (Slack thread).
    I thought we could just download all of the influenza genomes then filter locally, but I'm hitting the invalid zip archive error when running

    datasets download virus genome taxon 11320
    

    I'm not hitting the error with some minimum filtering on release date and geo-location, so we can go with this:

    datasets download virus genome taxon 11320 --released-after "04/01/2024" --geo-location "North America"
    

    This returns 32,262 records. Grep for H5N in the genomic.fna brought this down to 3,275 which is closer to the number on NCBI Virus, though I cannot easily check to see they are the same sequences.

@joverlee521 joverlee521 added the enhancement New feature or request label May 21, 2024
@AngieHinrichs
Copy link

Hi from Nextstrain's biggest fan/lurker. 🙂

We could potentially parse out strain, serotype, and segment from titles such as

PP766980.1 Influenza A virus (A/Canada goose/North Carolina/W24-90A/2024(H5N1)) ...

That can work with some effort for sequences submitted to GenBank, but not for sequences submitted to ENA (and then imported to GenBank as part of the INSDC partnership). ENA forbids useful naming in the title. Unfortunately you have to go through BioSample (either efetch (ugh! takes forever) or NCBI datasets JSON, much nicer but then there's the occasional invalid zip error) in order to get isolate names for sequences that were submitted to ENA.

Until recently, I was usually able to get decent metadata for all ~million influenza A sequences, including the /strain and /isolate keywords from GenBank, using this vvsearch2 query, and then I got fasta sequence using this datasets command. But in the past week both the Virus query and datasets command started failing. [I intend to switch over to using only NCBI datasets at some point (not the vvsearch2 query), and ultimately to fetch only the newest sequences instead of all sequences, but dunno when I'll actually get around to that.]

@joverlee521
Copy link
Contributor Author

Thanks for the tip @AngieHinrichs, that's super helpful!

joverlee521 added a commit that referenced this issue May 21, 2024
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that
uses the `Strain_s` field. Based on this field, I was able to guess the
fields for serotype and segment.

Keeping the `isolate` field because some records use the `isolate` for
the strain name instead of the `strain` field.

Also removes the `sequence` field since that is no longer returned by
the API.²

¹ <#37 (comment)>
² <nextstrain/ingest#18>
@joverlee521
Copy link
Contributor Author

Based on the vvsear2 query with Strain_s, I was able to guess the fields for serotype and segment in 109dcdb.

However, when I've been continuously running into an error when fetching:

curl: (18) transfer closed with outstanding read data remaining

@AngieHinrichs is this what you mean by the Virus query failing in the past week?

@AngieHinrichs
Copy link

curl: (18) transfer closed with outstanding read data remaining

@AngieHinrichs is this what you mean by the Virus query failing in the past week?

Yep. I used to get that error every once in a while, but in the past week it has been happening every time for influenza (similar queries still work for RSV, dengue, MPXV 🤞).

@joverlee521
Copy link
Contributor Author

Also tried with wget and continuously see this error:

Read error at byte 1422564 (The request is invalid.).Retrying.

@AngieHinrichs
Copy link

The vvsearch2 query worked for me just now!

joverlee521 added a commit that referenced this issue May 24, 2024
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that
uses the `Strain_s` field. Based on this field, I was able to guess the
fields for serotype and segment.

Keeping the `isolate` field because some records use the `isolate` for
the strain name instead of the `strain` field.

Also removes the `sequence` field since that is no longer returned by
the API.²

¹ <#37 (comment)>
² <nextstrain/ingest#18>
joverlee521 added a commit that referenced this issue May 29, 2024
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that
uses the `Strain_s` field. Based on this field, I was able to guess the
fields for serotype and segment.

Keeping the `isolate` field because some records use the `isolate` for
the strain name instead of the `strain` field.

Also removes the `sequence` field since that is no longer returned by
the API.²

¹ <#37 (comment)>
² <nextstrain/ingest#18>
@joverlee521
Copy link
Contributor Author

Closing since we have the base ingest workflow set up in #40.
I'll be creating other issues for the remaining tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants