-
Notifications
You must be signed in to change notification settings - Fork 22
Robust transform-date-fields #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,14 +9,20 @@ from datetime import datetime | |
| from sys import stderr, stdin, stdout | ||
|
|
||
|
|
||
| def format_date(date_string: str, expected_formats: set) -> str: | ||
| def format_date(date_string: str, expected_formats: list) -> str: | ||
| """ | ||
| Originally from nextstrain/ncov-ingest | ||
|
|
||
| Format *date_string* to ISO 8601 date (YYYY-MM-DD). | ||
| If *date_string* does not match *expected_formats*, return *date_string*. | ||
| If *date_string* is missing the year, return masked date 'XXXX-XX-XX'. | ||
| If *date_string* is an incomplete date (i.e. missing month or day), then | ||
| missing values are masked with 'XX'. | ||
|
|
||
| >>> expected_formats = {'%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ'} | ||
| >>> expected_formats = ['%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ', '%m-%d'] | ||
|
|
||
| >>> format_date("01-01", expected_formats) | ||
| 'XXXX-XX-XX' | ||
|
|
||
| >>> format_date("2020", expected_formats) | ||
| '2020-XX-XX' | ||
|
|
@@ -36,14 +42,79 @@ def format_date(date_string: str, expected_formats: set) -> str: | |
| >>> format_date("2020-01-15T00:00:00Z", expected_formats) | ||
| '2020-01-15' | ||
| """ | ||
| # Potential directives that datetime accepts that can return the correct year, month, day fields | ||
| # see https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes | ||
| # | ||
| # Allows us to check if year/month/day are included in the date format so we | ||
| # know when to mask incomplete dates with 'XX' | ||
| all_field_directives = {'%c', '%x', | ||
| ('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w'), ('%G', '%V', '%u') | ||
| } | ||
| month_and_day_directives = {'%j', | ||
| ('%U', '%A'), ('%U', '%a'), ('%U', '%w'), ('%U', '%u'), | ||
| ('%W', '%A'), ('%W', '%a'), ('%W', '%w'), ('%W', '%u') | ||
| } | ||
| year_directives = {'%y', '%Y'} | ||
| month_directives = {'%b', '%B', '%m'} | ||
| day_directives = {'%d'} | ||
|
|
||
| def directive_is_included(potential_directives: set, date_format: str) -> bool: | ||
| """ | ||
| Checks if any of the directives in *potential_directives* is included | ||
| in *date_format* string. | ||
|
|
||
| If an element within *potential_directives* is a tuple, then all directives | ||
| within the tuple must be included in *date_format*. | ||
| """ | ||
| return any( | ||
| ( | ||
| (isinstance(directive, str) and directive in date_format) or | ||
| (isinstance(directive, tuple) and all(sub_directive in date_format for sub_directive in directive)) | ||
|
Comment on lines
+71
to
+72
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While the Regexps provide a nice way to handle this with negative lookbehind assertions, e.g. def included(directive, format):
return bool(re.search(f"(?<!%){re.escape(directive)}", format)) |
||
| ) | ||
| for directive in potential_directives | ||
|
Comment on lines
+70
to
+74
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With polymorphic values like this (str or tuple), it can often be simplifying to just pick one or the other. In this case, since we need tuples, we could either do that dynamically with something like: but I'd suggest defining them all as tuples instead, e.g. and leaving this function's code to only deal with tuples. |
||
| ) | ||
|
|
||
| for date_format in expected_formats: | ||
| try: | ||
| parsed_date = datetime.strptime(date_string, date_format) | ||
| except ValueError: | ||
| continue | ||
| year_string = str(parsed_date.year) | ||
| month_string = str(parsed_date.month).zfill(2) if date_string.count('-') >= 1 else 'XX' | ||
| day_string = str(parsed_date.day).zfill(2) if date_string.count('-') == 2 else 'XX' | ||
|
|
||
| # Default to date masked as 'XXXX-XX-XX' so we don't return incorrect dates | ||
| year_string = 'XXXX' | ||
| month_string = day_string = 'XX' | ||
|
|
||
| parsed_year_string = str(parsed_date.year) | ||
| parsed_month_string = str(parsed_date.month).zfill(2) | ||
| parsed_day_string = str(parsed_date.day).zfill(2) | ||
|
|
||
| # If a directive for ALL fields is included in date format, | ||
| # then use all of the parsed field strings | ||
| if (directive_is_included(all_field_directives, date_format)): | ||
| year_string = parsed_year_string | ||
| month_string = parsed_month_string | ||
| day_string = parsed_day_string | ||
|
|
||
| # If not all fields directives are included, then check year | ||
| # directive was included in date format | ||
| elif(directive_is_included(year_directives, date_format)): | ||
| year_string = parsed_year_string | ||
|
|
||
| # Only check for month and day directives if year is included | ||
| # Check if directive for BOTH month and year is included in date format | ||
| if (directive_is_included(month_and_day_directives, date_format)): | ||
| month_string = parsed_month_string | ||
| day_string = parsed_day_string | ||
|
|
||
| # If not directives for BOTH month and day are included, then check | ||
| # month directive was included in date format | ||
| elif(directive_is_included(month_directives, date_format)): | ||
| month_string = parsed_month_string | ||
|
|
||
| # Only check for day directives if month is included | ||
| if(directive_is_included(day_directives, date_format)): | ||
| day_string = parsed_day_string | ||
|
Comment on lines
+98
to
+116
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This cascade of checks to greater specificity makes sense as an implementation approach. It also serves an important functional role because some directives are dependent on others to parse correctly, e.g. However, I wonder if the sets of directives above and this code would be easier to reason about if we explicitly listed the combinations for each level instead of cascading. It'd be more verbose, but I think not overly so. For example, what if instead of cascading down these sets we used a flat if/elif/elif chain against these sets:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the current way makes it easier to add new directives. However, that may not be a worry since the directives rarely get updated 🤔
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, the directives are likely to change approximately never.¹ ;-) And in general, I think the default should be to optimize the code for reading, not writing, as code will be read many many more times by many more people than it is written. (But this doesn't mean you have to take my suggestion here!) ¹ These in particular been the same my entire programming career, because they were defined as part of C89, as in the year 1989.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I might not put in the effort here in monkeypox since I expect it to be replaced by the new augur command, but will definitely go for readability in the augur version of date transformations. |
||
|
|
||
| return f"{year_string}-{month_string}-{day_string}" | ||
|
|
||
| if date_string: | ||
|
|
@@ -64,11 +135,12 @@ if __name__ == '__main__': | |
| parser.add_argument("--date-fields", nargs="+", | ||
| help="List of date field names in the NDJSON record that need to be standardized.") | ||
| parser.add_argument("--expected-date-formats", nargs="+", | ||
| help="Expected date formats that are currently in the provided date fields") | ||
| help="Expected date formats that are currently in the provided date fields." + | ||
| "If a date string matches multiple formats, it will be parsed as the first format in the list.") | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| expected_formats = set(args.expected_date_formats) | ||
| expected_formats = args.expected_date_formats | ||
|
|
||
| for record in stdin: | ||
| record = json.loads(record) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure it's valid to combine ISO week-based numbering (
%G,%u,%V) with non-ISO week-based numbering (%U,%W,%a,%A,%w), as the two systems are "indexed" differently. So we may want to drop('%U', '%u')and('%W', '%u')frommonth_and_day_directivesand('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w')fromall_field_directives.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I included them since they were valid combinations for
datetime.strptimebased on the error message when trying to parse an incomplete ISO directive:I did see that the Year/week directives cannot be interchanged between ISO/non-ISO, I think that matches what you mean by "indexed" differently? I would think the weekday directives can be used across ISO/non-ISO directives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, hmm. I guess they're technically valid, in the sense that datetimes can be formatted with them and re-parsed back successfully. Ok to keep them then.
I guess I was more thinking about if they're valid for reasonable use, as you can't tell unambiguously from just the formatted datetime how it should be parsed, so it's a bit "dangerous" to mix them. For example,
2022 01 Sun(year, week, weekday) means different days depending on how it was produced:By "indexed" differently, I mean the starting points for counting weeks in a year and the weekday of weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess we would be leaving it up to the user to understand the datetime directives and provide the correct combinations. The same problem exists even within non-ISO:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Ok to leave as is.