Ingest NCBI GenBank data for H5N1 outbreak #37

joverlee521 · 2024-05-21T01:00:06Z

Jotting down concrete steps for ingesting NCBI GenBank data for H5N1 outbreak based on internal team discussion and GDoc notes.

Original plan that was discussed:

Fetch GenBank accessions from NCBI Virus Download (example URL) -- this provides additional filter options for collection date and genotype which are not available through the datasets command and shows 2,561 records on NCBI Virus.

Use datasets to download dataset for accessions

datasets download virus genome accession --inputfile accessions.txt

Use Bio.Entrez.efetch to fetch the GenBank records to parse out additional fields that are not included in the dataset metadata: strain, serotype, segment.
Merge the dataset metadata and the entrez metadata into a single NDJSON that can then go through the usual curation pipeline.

Detours

I waffled a little on whether we needed (3), because I realized that the GenBank Title is included in the FASTA headers of the sequences downloaded! We could potentially parse out strain, serotype, and segment from titles such as
```
PP766980.1 Influenza A virus (A/Canada goose/North Carolina/W24-90A/2024(H5N1)) clone new segment 7 matrix protein 2 (M2) and matrix protein 1 (M1) genes, complete cds
```
However, I'm not sure that all record titles will follow this format. GenBank docs for the Definition/Title does not make me confident about it either

Brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function (if the sequence is non-coding).

I think I'll stick with the original plan to use Entrez, but it was a nice thought.
(1) is currently not possible because the download function from NCBI Virus is broken due to a bug for H5N1 (Slack thread).
I thought we could just download all of the influenza genomes then filter locally, but I'm hitting the invalid zip archive error when running
```
datasets download virus genome taxon 11320
```
I'm not hitting the error with some minimum filtering on release date and geo-location, so we can go with this:
```
datasets download virus genome taxon 11320 --released-after "04/01/2024" --geo-location "North America"
```
This returns 32,262 records. Grep for H5N in the genomic.fna brought this down to 3,275 which is closer to the number on NCBI Virus, though I cannot easily check to see they are the same sequences.

The text was updated successfully, but these errors were encountered:

AngieHinrichs · 2024-05-21T20:01:48Z

Hi from Nextstrain's biggest fan/lurker. 🙂

We could potentially parse out strain, serotype, and segment from titles such as

PP766980.1 Influenza A virus (A/Canada goose/North Carolina/W24-90A/2024(H5N1)) ...

That can work with some effort for sequences submitted to GenBank, but not for sequences submitted to ENA (and then imported to GenBank as part of the INSDC partnership). ENA forbids useful naming in the title. Unfortunately you have to go through BioSample (either efetch (ugh! takes forever) or NCBI datasets JSON, much nicer but then there's the occasional invalid zip error) in order to get isolate names for sequences that were submitted to ENA.

Until recently, I was usually able to get decent metadata for all ~million influenza A sequences, including the /strain and /isolate keywords from GenBank, using this vvsearch2 query, and then I got fasta sequence using this datasets command. But in the past week both the Virus query and datasets command started failing. [I intend to switch over to using only NCBI datasets at some point (not the vvsearch2 query), and ultimately to fetch only the newest sequences instead of all sequences, but dunno when I'll actually get around to that.]

joverlee521 · 2024-05-21T20:16:24Z

Thanks for the tip @AngieHinrichs, that's super helpful!

@AngieHinrichs

Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>

joverlee521 · 2024-05-21T23:47:31Z

Based on the vvsear2 query with Strain_s, I was able to guess the fields for serotype and segment in 109dcdb.

However, when I've been continuously running into an error when fetching:

curl: (18) transfer closed with outstanding read data remaining

@AngieHinrichs is this what you mean by the Virus query failing in the past week?

AngieHinrichs · 2024-05-21T23:53:08Z

curl: (18) transfer closed with outstanding read data remaining

@AngieHinrichs is this what you mean by the Virus query failing in the past week?

Yep. I used to get that error every once in a while, but in the past week it has been happening every time for influenza (similar queries still work for RSV, dengue, MPXV 🤞).

joverlee521 · 2024-05-22T17:29:59Z

Also tried with wget and continuously see this error:

Read error at byte 1422564 (The request is invalid.).Retrying.

AngieHinrichs · 2024-05-23T21:13:56Z

The vvsearch2 query worked for me just now!

@AngieHinrichs

Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>

@AngieHinrichs

Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>

joverlee521 · 2024-05-30T23:06:19Z

Closing since we have the base ingest workflow set up in #40.
I'll be creating other issues for the remaining tasks.

joverlee521 added the enhancement New feature or request label May 21, 2024

This was referenced May 24, 2024

WIP: Ingest from NCBI Datasets + Entrez #39

Closed

Ingest from NCBI Virus + NCBI Datasets #40

Merged

joverlee521 closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest NCBI GenBank data for H5N1 outbreak #37

Ingest NCBI GenBank data for H5N1 outbreak #37

joverlee521 commented May 21, 2024

AngieHinrichs commented May 21, 2024

joverlee521 commented May 21, 2024

joverlee521 commented May 21, 2024

AngieHinrichs commented May 21, 2024

joverlee521 commented May 22, 2024

AngieHinrichs commented May 23, 2024

joverlee521 commented May 30, 2024

Ingest NCBI GenBank data for H5N1 outbreak #37

Ingest NCBI GenBank data for H5N1 outbreak #37

Comments

joverlee521 commented May 21, 2024

Original plan that was discussed:

Detours

AngieHinrichs commented May 21, 2024

joverlee521 commented May 21, 2024

joverlee521 commented May 21, 2024

AngieHinrichs commented May 21, 2024

joverlee521 commented May 22, 2024

AngieHinrichs commented May 23, 2024

joverlee521 commented May 30, 2024