Skip to content

Date ranges (e.g., 2014/2019) are incorrectly parsed as single year in metadata #557

@nneune

Description

@nneune

Before opening an issue, please:

  • Make sure you are using the latest version using datasets --version
    datasets version: 18.15.0
  • Review our documentation

Describe the bug

Date ranges are not correctly downloaded with the datasets CLI. Sequences with collection date ranges in the format YYYY/YYYY (e.g., 2014/2019), which is a valid ENA format, are truncated to only the end year in the downloaded metadata.

For example, sequence MW179421.1 was collected between 2014 and 2019 (shown as 2014/2019 on GenBank), but datasets reports only year=2019 in the metadata file.

This affects all NCBI entries with this date format.

Impact: This is problematic for temporal analyses and phylodynamic inference, as the date uncertainty/range is lost, potentially biasing tip date calibrations and evolutionary rate estimates.

Indicate what operating system you're using

Linux (WSL)

To Reproduce

Steps to reproduce the behavior:

  1. Run:
    datasets download virus genome taxon "39054" --no-progressbar --filename data/ncbi_dataset.zip
  2. Unzip and open data/ncbi_dataset/data/data_report.jsonl
  3. Search for accession MW179421.1 and check collectionDate

Actual output:

"isolate": {
  "collectionDate": "2019",
  "name": "ZY2017-12-EV71"
}

Expected behavior

The date range should be preserved, either as:

  • Original format: "collectionDate": "2014/2019"
  • Structured fields: "collectionDateStart": "2014", "collectionDateEnd": "2019"

Currently, only the end year (2019) is retained, losing the 5-year uncertainty window.

Note: This issue appears specific to the / format for date ranges. Other range formats have not been tested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions