-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revisit storage and validation of temporal data #384
Comments
@turbomam what is the to do on this? |
@turbomam moving this to Sept but please let me know if you're not actively working on it for the next 2 weeks |
Checked in with @turbomam and moving this out of the sprint and adding the backlog label. |
@ssarrafan This will start on the sprint from Dec26-6th |
I don't think the next sprint will start till January since LBL is closed for the holidays Dec 23-Jan 2. I can add it to that sprint. Are you planning to work the week between December 26 and January @mslarae13? |
I am working that week! PNNL doesn't close :( |
@turbomam I think working on this today would be helpful. In relation to the updates I've made to the soil package relevant slots. Does the validation still hold, do we need to add additional validation rules anywhere? |
Due date is Jan 20th so moving to next sprint |
Looks like this is in the backlog now so I'll remove from the sprint. @mslarae13 if you plan to work on this next sprint let me know. Thanks. |
This is a component of microbiomedata/sample-annotator#90
Background
nmdc-schema has a TimestampValue class, based on the AttributeValue class.
In fact the only real data slot for TimestampValue is the very generic, inherited has_raw_value, whose
range
isstring
.TimestampValue's description does say
But that's not enforced anywhere in the schema
Objective
In my understanding, NMDC submitters should be able to enter partial datetimes for things like
collection_date
. Ie2022-08
should be accepted as meaning that the sample was collected some time in August of 2022. The day-of-month is not known, and should not be fudged as2022-08-01
Current solution
So we have configured our DH templates to validate values of
2022-08
from slots likecollection_date
with heavyweight regular expressions like^[12]\d{3}(?:(?:-(?:0[1-9]|1[0-2]))(?:-(?:0[1-9]|[12]\d|3[01]))?)?$
And providing examples like "2021-04-15; 2021-04; 2021"
(BTW: GenomicsStandardsConsortium/mixs#446)
You can check those examples at regexr
(BTW see #385)
Proposed solution
It should be possible to at least validate these
has_raw_value
s ofTimestampValue
s against a proper datetime parser. Most Python datetime parsers will silent add 1s to the missing datetime parts. We don't have to use that parsed value, but it should at least parse. I think that will rule out dates that match the regular expression, but don't exist, like2022-02-31
iso8601 seems to require pretty strict templates, but I think some of these other ones don't
I'll consult with LinkML colleagues and most likely try arrow and pendulum. Will post conclusions here.
The text was updated successfully, but these errors were encountered: