Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider ISO8601 for datetime which would be much easier to parse and validate across systems #5

Closed
jeffbaumes opened this issue Apr 1, 2021 · 20 comments
Assignees

Comments

@jeffbaumes
Copy link
Collaborator

Basically the portal team would like to review with the metadata team the choice of this more difficult to parse datetime. If there are good reasons for this format we can keep it. If there are standard libraries for parsing these somewhat odd timestamps we should utilize them.

@jeffbaumes
Copy link
Collaborator Author

@cmungall Since this is such a far-reaching question on what we have as our consistent datetime format, I'm assigning to you to chime in. Should we stick with the time format we have? Is it inherited from something else like MIxS?

@StantonMartin
Copy link

I fully support doing this. There are a number of ISO 8601 formats which can be used. We will also need to be explicit about which version. I believe the latest is ISO 8601-2:2019

@jbeezley
Copy link

jbeezley commented Apr 7, 2021

After looking at some of the logic of biolinkml and the generated schemas, this might be a more broad issue. The schemas for attributes generally contain a has_raw_value which is meant to contain the original (non-normalized) value. This is what I'm getting in lat_lon for example: "has_raw_value": "10.1 120.33". The GeolocationValue however, contains the normalized attributes of latitude and longitude that remain null. The comment in the TimestampValue does indicate the range attribute should be set to an ISO 8601 formatted string, but there is no validation of this requirement.

I seems to me the intention is that the ETL pipeline should be normalizing terms like this so that consumers (such as me) don't have to, but perhaps I'm missing something.

@dehays
Copy link
Contributor

dehays commented Apr 9, 2021

We had consensus that ISO 8601 should be used for all datetime.

Further - the nmdc-schema should have a single ISO 8601 conformant string value and no raw value, and the nmdc-schema should regex validate the value. The responsibility for transforming sources to ISO 8601 conformant values lies with the ETL (at least presently in the GOLD as source case). There is agreement that transforming date-time should NOT be the responsibility of the search application ingest but should happen upstream. (Ideally in a post-ETL world, the source would provide conformant values.)

@jeffbaumes jeffbaumes assigned dehays and unassigned cmungall Apr 22, 2021
@jeffbaumes
Copy link
Collaborator Author

@dehays @wdduncan What do you think about this issue? I think it's great that we have consensus here but implementing the changes in the portal might take a bit of effort. Should we push to the next sprint? I reassigned to you @dehays since I don't think this needs @cmungall input anymore.

@wdduncan
Copy link
Contributor

The current fields with datetime values (e.g. add_date, mode_date) are pulled of the data source "as is". This what the has raw value slot holds. I can do some clean up / normalization during ETL. I think it would help if we prioritize which fields this is needed for.

As for pushing to the next sprint, I vote "yes".

@dehays
Copy link
Contributor

dehays commented Apr 24, 2021

@jeffbaumes @wdduncan I feel like there is agreement on what is intended, but implementing in the ETL and then the portal is work to be done. Moved this to the May sprint.

@wdduncan
Copy link
Contributor

wdduncan commented May 5, 2021

@jeffbaumes I can do for the GOLD fields that I process. Do you happen to have a list of the fields you need to be converted?

@jbeezley
Copy link

jbeezley commented May 5, 2021

I think it should be validated for any date/datetime field in the schema.

@wdduncan
Copy link
Contributor

wdduncan commented May 5, 2021

@jbeezley it would help if you had a list of the fields of the fields that you were especially concerned with.

@jbeezley
Copy link

jbeezley commented May 5, 2021

This is primarily about validating that the data is correct. According to the schema, datetimes should be provided as xsd:dateTime which is defined as ISO 8601. If that validation is not occurring for all datetime fields, I would classify it as a bug in the schema code.

@wdduncan
Copy link
Contributor

wdduncan commented May 5, 2021

@jbeezley I understand that. I have to write a script to transform the values. The only data types I have to work with are strings (i.e., all the data in the GOLD dump are strings). Moreover, many fields are empty (the data can be very sparse). So, I was just wondering if there were certain fields that were causing you problems so that I could prioritize tasks to suit your needs.

We already identified add_date and mod_date above. If you know of others w/o my having to search that would be helpful. If you don't that's okay. I'll have a look (hopefully soon).

@jbeezley
Copy link

jbeezley commented May 5, 2021

My ingest goes to great lengths to recognize a wide array of datetime formats, so I don't have the information myself. I would think if the validation were performed correctly, then getting that information from the errors emitted would be straightforward.

@dwinston
Copy link
Collaborator

dwinston commented May 5, 2021

@wdduncan the only other one I notice at a glance is the database object's date_created slot.

biosample collection_date, extreme_event, fire, and flooding are already annotated to be timestamp values. I'm not sure why the latter three are timestamp values, but I guess that's MIxS?

@jbeezley FYI the issue is upstream of validation because add_date and mod_date currently have assigned ranges of xsd:string, not xsd:dateTime (or "timestamp value" like the biosample fields). Thank you for debugging these, @wdduncan.

@wdduncan wdduncan transferred this issue from microbiomedata/nmdc-metadata May 13, 2021
@wdduncan wdduncan moved this from To do to In progress in NMDC May 2021 Sprint May 13, 2021
@wdduncan
Copy link
Contributor

I went through the schema looking for timestamp fields. In MIXS, I found the following fields took dates as values:

- fire
- last_clean
- date_last_rain
- iw_bt_date_well
- flooding
- extreme_event
- collection_date

Of these terms, only collection_date has data values associated with it, and these values are already form in YYYY-MM-DD.

As for the data that I am pulling in from GOLD, the only fields I found with datetime values were the add_date and mod_date fields in the biosample and omics_processing datasets.

I also exact data for the omics_processing.completion_date field, but (currently) there is no data for this field.

@wdduncan
Copy link
Contributor

@dwinston I've made the changes to my code, but on the repo you moved the code that was in metadata-translation/src/bin/lib. I remember you mentioning something about this ... but where did you move it to?

Once we figure this out we can merge PR #6.

@dwinston
Copy link
Collaborator

It is now in metadata-translation/src/bin/lib in the nmdc-runtime repo as well.

I had it there in my local repository, but it wasn't under version control because of a conflicting lib/ entry for Python packaging in the .gitignore. I added !metadata-translation/src/bin/lib to the .gitignore to override this for our directory of interest.

You should be all set now. Let me know if anything else is blocking for you.

@dwinston
Copy link
Collaborator

I merged in @wdduncan's PR #6. Will close this once the change is propagated for nmdc-server ingest, i.e. will co-close with #2.

@wdduncan
Copy link
Contributor

@ssarrafan this is pretty much finished, but waiting on @dwinston to close the ticket. So, I'll move this to the June sprint.

@wdduncan wdduncan removed this from In progress in NMDC May 2021 Sprint May 29, 2021
@wdduncan wdduncan added this to To do in NMDC June 2021 Sprint via automation May 29, 2021
@dwinston
Copy link
Collaborator

dwinston commented Jun 1, 2021

As the code for this is written, and what remains is for the metadata to be re-processed, I think this issue can be closed for now. It should be re-opened if any datetimes are discovered to be improperly encoded.

@dwinston dwinston closed this as completed Jun 1, 2021
NMDC June 2021 Sprint automation moved this from To do to Done Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

7 participants