Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Error parsing date..." #2279

Closed
noahlh opened this issue Jul 14, 2022 · 5 comments
Closed

"Error parsing date..." #2279

noahlh opened this issue Jul 14, 2022 · 5 comments
Assignees

Comments

@noahlh
Copy link
Contributor

noahlh commented Jul 14, 2022

Describe the bug

I'm attempting to train a model from data in a large JSON array. An example of the relevant slice looks like:

[
  {
    "date": "1941-03-01",
    "foo": "foo"
    "bar": "bar"
  },
  {
    "date": "2001-05-15",
    "foo": "foo"
    "bar": "bar"
  }
]

When I run the following command:

ludwig train --dataset input.json --data_format json -c config.yaml

After a few moments, I get the following output for every element of the array:

Error parsing date: 1941-03-01 00:00:00 with error strptime() argument 1 must be str, not Timestamp Please provide a datetime format that parses it in the preprocessing section of the date feature in the config. The preprocessing fill in value will be used.For more details: https://ludwig-ai.github.io/ludwig-docs/0.5/configuration/features/date_features/

Error parsing date: 2001-05-15 00:00:00 with error strptime() argument 1 must be str, not Timestamp Please provide a datetime format that parses it in the preprocessing section of the date feature in the config. The preprocessing fill in value will be used.For more details: https://ludwig-ai.github.io/ludwig-docs/0.5/configuration/features/date_features/

I've tried this with several config options in config.yaml, including each of the following:

input_features:
  - name: date
    type: date
    preprocessing:
      datetime_format: "%Y-%m-%d %H:%M:%S"
input_features:
  - name: date
    type: date
    preprocessing:
      datetime_format: "%Y-%m-%d"
input_features:
  - name: date
    type: date

All 3 result in the same error message.

It seems to me what's happening is that the date string is being parsed too soon in the process -- it's being turned into a Timestamp object BEFORE being fed into strptime(), hence the error.

I'm not familiar enough with the codebase to know if this is a bug, an issue with JSON as the input source, or something else, but some guidance would be appreciated!

Expected behavior
The date should parse according to the provided format.

Environment (please complete the following information):

  • OS: MacOS
  • Version 12.3.1
  • Python version 3.10.5
  • Ludwig version 0.5.4
@noahlh
Copy link
Contributor Author

noahlh commented Jul 14, 2022

Ok I did some exploring and as a temporary workaound I've successfully patched ludwig/features/date_feature.py as follows:

Changed:

def date_to_list(date_str, datetime_format, preprocessing_parameters):
    try:
        if datetime_format is not None:
            datetime_obj = datetime.strptime(date_str, datetime_format)
        else:
            datetime_obj = parse(date_str)

To:

def date_to_list(date_str, datetime_format, preprocessing_parameters):
    try:
        if isinstance(date_str, datetime): 
            datetime_obj = date_str
        elif datetime_format is not None:
            datetime_obj = datetime.strptime(date_str, datetime_format)
        else:
            datetime_obj = parse(date_str)

I'm not going to submit this as a patch because this doesn't fix the root cause - but it does get things working. I'd be happy to help submit a proper patch if someone can guide me in the right direction to where this might be happening upstream.

@justinxzhao
Copy link
Contributor

Hi @noahlh!

What you've proposed is actually quite reasonable -- could you create a PR with this change, with an additional test case in test_date_to_list in test_date_feature.py?

Looking into this a bit, it looks like there's a discrepancy between pandas.read_csv and pandas.read_json. The former parses date-like strings as an object while the latter parses date-like strings as datetime64[ns] type. This confirms your hypothesis that the data is being parsed as a datetime string "too early".

There's a more principled solution to cast all datetime features to datetime64[ns] always in preprocessing, and handle pure datetime objects during feature data building. However, this would require some additional refactoring since the way missing value handling is currently implemented for date features happens within the feature data building stage.

I think your patch is a reasonable stopgap -- happy to help you land that.

@noahlh
Copy link
Contributor Author

noahlh commented Jul 14, 2022

@justinxzhao Ah ha! You got it -- happy to help. I'll put that together ASAP.

@noahlh
Copy link
Contributor Author

noahlh commented Jul 15, 2022

@justinxzhao Just reading through the Pandas docs on read_json and another possible upstream fix might be to set the convert_dates option on pandas.read_json to be False (it defaults to True).

From the docs, it looks like it automatically converts for datelike columns, which includes columns with the name "date" (which is exactly my situation).

We might be able to avoid a refactor that way, since you'll consistently be getting the same input. Just an idea.

@justinxzhao
Copy link
Contributor

@noahlh Thanks for looking more into the read_json docs.

Setting convert_dates=False seems like a reasonable suggestion. However, there might be other inconsistencies with other data loading libraries, so making sure that date_feature can handle pre-datetime-typed data seems like a good change to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants