Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 79 additions & 7 deletions ingest/bin/transform-date-fields
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,20 @@ from datetime import datetime
from sys import stderr, stdin, stdout


def format_date(date_string: str, expected_formats: set) -> str:
def format_date(date_string: str, expected_formats: list) -> str:
"""
Originally from nextstrain/ncov-ingest

Format *date_string* to ISO 8601 date (YYYY-MM-DD).
If *date_string* does not match *expected_formats*, return *date_string*.
If *date_string* is missing the year, return masked date 'XXXX-XX-XX'.
If *date_string* is an incomplete date (i.e. missing month or day), then
missing values are masked with 'XX'.

>>> expected_formats = {'%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ'}
>>> expected_formats = ['%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ', '%m-%d']

>>> format_date("01-01", expected_formats)
'XXXX-XX-XX'

>>> format_date("2020", expected_formats)
'2020-XX-XX'
Expand All @@ -36,14 +42,79 @@ def format_date(date_string: str, expected_formats: set) -> str:
>>> format_date("2020-01-15T00:00:00Z", expected_formats)
'2020-01-15'
"""
# Potential directives that datetime accepts that can return the correct year, month, day fields
# see https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
#
# Allows us to check if year/month/day are included in the date format so we
# know when to mask incomplete dates with 'XX'
all_field_directives = {'%c', '%x',
('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w'), ('%G', '%V', '%u')
}
month_and_day_directives = {'%j',
('%U', '%A'), ('%U', '%a'), ('%U', '%w'), ('%U', '%u'),
('%W', '%A'), ('%W', '%a'), ('%W', '%w'), ('%W', '%u')
}
Comment on lines +50 to +56
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure it's valid to combine ISO week-based numbering (%G, %u, %V) with non-ISO week-based numbering (%U, %W, %a, %A, %w), as the two systems are "indexed" differently. So we may want to drop ('%U', '%u') and ('%W', '%u') from month_and_day_directives and ('%G', '%V', '%A'), ('%G', '%V', '%a'), ('%G', '%V', '%w') from all_field_directives.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I included them since they were valid combinations for datetime.strptime based on the error message when trying to parse an incomplete ISO directive:

ValueError: ISO year directive '%G' must be used with the ISO week directive '%V' and a weekday directive ('%A', '%a', '%w', or '%u').

I did see that the Year/week directives cannot be interchanged between ISO/non-ISO, I think that matches what you mean by "indexed" differently? I would think the weekday directives can be used across ISO/non-ISO directives.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, hmm. I guess they're technically valid, in the sense that datetimes can be formatted with them and re-parsed back successfully. Ok to keep them then.

I guess I was more thinking about if they're valid for reasonable use, as you can't tell unambiguously from just the formatted datetime how it should be parsed, so it's a bit "dangerous" to mix them. For example, 2022 01 Sun (year, week, weekday) means different days depending on how it was produced:

>>> datetime.strptime("2022 01 Sun", "%Y %U %a")
datetime.datetime(2022, 1, 2, 0, 0)
>>> datetime.strptime("2022 01 Sun", "%G %V %a")
datetime.datetime(2022, 1, 9, 0, 0)

By "indexed" differently, I mean the starting points for counting weeks in a year and the weekday of weeks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess we would be leaving it up to the user to understand the datetime directives and provide the correct combinations. The same problem exists even within non-ISO:

>>> datetime.strptime("2022 01 Sun", "%Y %U %a")
datetime.datetime(2022, 1, 2, 0, 0)
>>> datetime.strptime("2022 01 Sun", "%Y %W %a")
datetime.datetime(2022, 1, 9, 0, 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Ok to leave as is.

year_directives = {'%y', '%Y'}
month_directives = {'%b', '%B', '%m'}
day_directives = {'%d'}

def directive_is_included(potential_directives: set, date_format: str) -> bool:
"""
Checks if any of the directives in *potential_directives* is included
in *date_format* string.

If an element within *potential_directives* is a tuple, then all directives
within the tuple must be included in *date_format*.
"""
return any(
(
(isinstance(directive, str) and directive in date_format) or
(isinstance(directive, tuple) and all(sub_directive in date_format for sub_directive in directive))
Comment on lines +71 to +72
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the in operator is super nice for readability here, it doesn't handle the edge case of a format string containing an escaped directive (e.g. %%Y means literal %Y not "a 4+ digit year").

Regexps provide a nice way to handle this with negative lookbehind assertions, e.g. (?<!%)%Y, generalized to something like:

def included(directive, format):
    return bool(re.search(f"(?<!%){re.escape(directive)}", format))

)
for directive in potential_directives
Comment on lines +70 to +74
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With polymorphic values like this (str or tuple), it can often be simplifying to just pick one or the other. In this case, since we need tuples, we could either do that dynamically with something like:

any(
    all(
        sub_directive in date_format
        for sub_directive in (directive if isinstance(directive, tuple) else (directive,))
    )
    for directive in potential_directives
)

but I'd suggest defining them all as tuples instead, e.g.

year_directives = {('%y',), ('%Y',)}

and leaving this function's code to only deal with tuples.

)

for date_format in expected_formats:
try:
parsed_date = datetime.strptime(date_string, date_format)
except ValueError:
continue
year_string = str(parsed_date.year)
month_string = str(parsed_date.month).zfill(2) if date_string.count('-') >= 1 else 'XX'
day_string = str(parsed_date.day).zfill(2) if date_string.count('-') == 2 else 'XX'

# Default to date masked as 'XXXX-XX-XX' so we don't return incorrect dates
year_string = 'XXXX'
month_string = day_string = 'XX'

parsed_year_string = str(parsed_date.year)
parsed_month_string = str(parsed_date.month).zfill(2)
parsed_day_string = str(parsed_date.day).zfill(2)

# If a directive for ALL fields is included in date format,
# then use all of the parsed field strings
if (directive_is_included(all_field_directives, date_format)):
year_string = parsed_year_string
month_string = parsed_month_string
day_string = parsed_day_string

# If not all fields directives are included, then check year
# directive was included in date format
elif(directive_is_included(year_directives, date_format)):
year_string = parsed_year_string

# Only check for month and day directives if year is included
# Check if directive for BOTH month and year is included in date format
if (directive_is_included(month_and_day_directives, date_format)):
month_string = parsed_month_string
day_string = parsed_day_string

# If not directives for BOTH month and day are included, then check
# month directive was included in date format
elif(directive_is_included(month_directives, date_format)):
month_string = parsed_month_string

# Only check for day directives if month is included
if(directive_is_included(day_directives, date_format)):
day_string = parsed_day_string
Comment on lines +98 to +116
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cascade of checks to greater specificity makes sense as an implementation approach. It also serves an important functional role because some directives are dependent on others to parse correctly, e.g. %U requires %Y (or %y), and we enforce that thru this cascade.

However, I wonder if the sets of directives above and this code would be easier to reason about if we explicitly listed the combinations for each level instead of cascading. It'd be more verbose, but I think not overly so.

For example, what if instead of cascading down these sets

all_field_directives
month_and_day_directives
year_directives
month_directives
day_directives

we used a flat if/elif/elif chain against these sets:

year_month_day_directives
year_month_directives
year_directives

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current way makes it easier to add new directives. However, that may not be a worry since the directives rarely get updated 🤔

Copy link
Copy Markdown
Contributor

@tsibley tsibley Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the directives are likely to change approximately never.¹ ;-) And in general, I think the default should be to optimize the code for reading, not writing, as code will be read many many more times by many more people than it is written.

(But this doesn't mean you have to take my suggestion here!)

¹ These in particular been the same my entire programming career, because they were defined as part of C89, as in the year 1989.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I might not put in the effort here in monkeypox since I expect it to be replaced by the new augur command, but will definitely go for readability in the augur version of date transformations.


return f"{year_string}-{month_string}-{day_string}"

if date_string:
Expand All @@ -64,11 +135,12 @@ if __name__ == '__main__':
parser.add_argument("--date-fields", nargs="+",
help="List of date field names in the NDJSON record that need to be standardized.")
parser.add_argument("--expected-date-formats", nargs="+",
help="Expected date formats that are currently in the provided date fields")
help="Expected date formats that are currently in the provided date fields." +
"If a date string matches multiple formats, it will be parsed as the first format in the list.")

args = parser.parse_args()

expected_formats = set(args.expected_date_formats)
expected_formats = args.expected_date_formats

for record in stdin:
record = json.loads(record)
Expand Down
2 changes: 2 additions & 0 deletions ingest/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ transform:
# List of date fields to standardize
date_fields: ['date', 'date_submitted']
# Expected date formats present in date fields
# These date formats should use directives expected by datetime
# See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
expected_date_formats: ['%Y', '%Y-%m', '%Y-%m-%d', '%Y-%m-%dT%H:%M:%SZ']
# Titlecase rules
titlecase:
Expand Down