Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Parsing week-freq Periods #50803

Merged
merged 5 commits into from
Jan 26, 2023
Merged

Conversation

jbrockmendel
Copy link
Member

@lithomas1 IIRC youve spent some time recently in optimizing this portion of the constructor. I'm open to ideas on optimize this check or do it as the fallback rather than up-front.

The reason it is currently up-front is bc ATM parse_datetime_string_with_reso gets this wrong, bc dateutil gets it wrong. Following #50790, we could move the if dt.tzinfo is not None: check from parse_datetime_string into parse_dateutil, at which point parse_datetime_string_with_reso would raise instead of return something incorrect (at least for some cases, I think others might get through).

@lithomas1
Copy link
Member

lithomas1 commented Jan 17, 2023

Not too familiar with datetime stuff in general, but from reading the code it seems like week-freq Periods have a fixed length?

Then maybe we could check if the middle is a "/", and then try to send it down string_to_dts twice(by inserting a null character to terminate the c-str where the slash is, and then calling again passing in the char pointer after)?

@jbrockmendel
Copy link
Member Author

I'm not too concerned about perf of this particular case (though it can certainly be optimized), more about the fact that pre-checking for it with re.search will mean a perf hit for all other cases.

Current thought is to wait for #50790, then re-arrange this to do this check as a fallback rather than upfront.

@lithomas1
Copy link
Member

Does this conflict with #50149? If not, we might also want to get that one in first, since it touches the same lines of code.

@jbrockmendel
Copy link
Member Author

This can definitely wait until after #50149 and ill rebase then.

@jbrockmendel
Copy link
Member Author

Updated to sit on top of #50790 and do the week-checking case as a fallback instead of up-front

@jbrockmendel
Copy link
Member Author

@MarcoGorelli gentle ping on this and #50914, not time-sensitive but blockery for further work in this area

@MarcoGorelli
Copy link
Member

sure, taking a look today, thanks for the ping (in general, please do feel free to ping / request reviews from me when necessary, with all the activity in this repo it's easy to miss things otherwise)

"""
# GH#50803
start, end = value.split("/")
start = Timestamp(start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use Timestamp here?

IIUC from the last time I was in Period code, Timestamp sends strings down parse_datetime_string_with_reso.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use datetime.strptime. either way is fine. Timestamp doesn't yet use parse_datetime_string_with_reso, but getting there is one of the goals this is aiming at. even when it does, that won't affect behavior here.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks good to me

@MarcoGorelli MarcoGorelli added this to the 2.0 milestone Jan 26, 2023
@mroeschke mroeschke merged commit a504276 into pandas-dev:main Jan 26, 2023
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the bug-parse-period branch January 26, 2023 19:30
@phofl
Copy link
Member

phofl commented Feb 12, 2023

Could this have caused regressions in read_json?

https://asv-runner.github.io/asv-collection/pandas/#io.json.ReadJSONLines.peakmem_read_json_lines_concat ant others

@jbrockmendel
Copy link
Member Author

It doesn't look like that benchmark does any period parsing, so it seems unlikely.

@rhshadrach
Copy link
Member

I'm seeing this PR as being flagged for over 400 regressions. This sometimes happens when dependencies change, but as far as I can tell that didn't happen here. Every regression is memory use, not performance.

Regressions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants