Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: raise ValueError if invalid period freq pass to asfreq when the index of df is a PeriodIndex #56945

Conversation

natmokval
Copy link
Contributor

@natmokval natmokval commented Jan 18, 2024

xref #55785, #52064

PeriodIndex.asfreq silently converts for offsets such as offsets.MonthBegin(), offsets.BusinessMonthEnd(), etc. (with no attribute '_period_dtype_code') frequency to period frequency (in this case 'M').

Reproducible Example:

>>> import pandas as pd
>>> from pandas.tseries import offsets
>>>
>>> index = pd.PeriodIndex(["2020-01-01", "2021-01-01"], freq="Q")
>>> index.asfreq(freq=offsets.MonthBegin(1))
PeriodIndex(['2020-03', '2021-03'], dtype='period[M]')
>>>
>>> index.asfreq(freq=offsets.BusinessMonthEnd(1)) 
PeriodIndex(['2020-03', '2021-03'], dtype='period[M]')

the correct behaviour would be raising an Error:

>>> index.asfreq(freq=offsets.MonthBegin(1))
ValueError: <MonthBegin> is not supported as period frequency

Another problem: so far in the example below

>>> index = pd.PeriodIndex(["2020-01-01", "2021-01-01"], freq="M")
>>> index.asfreq(freq='BMS')

we get on main

AttributeError: 'pandas._libs.tslibs.offsets.BusinessMonthBegin' object has no attribute '_period_dtype_code'

the correct behaviour would be raising an Error:

ValueError: BMS is not supported as period frequency

added to the definition of asfreq a check if string denoting frequency is supported as period frequency. If the index of a DataFrame is a PeriodIndex and the frequency is invalid period frequency a ValueError is raised.

@natmokval natmokval assigned natmokval and unassigned natmokval Jan 18, 2024
@natmokval natmokval added the Frequency DateOffsets label Jan 18, 2024
@natmokval natmokval marked this pull request as ready for review January 18, 2024 23:02
@natmokval
Copy link
Contributor Author

@MarcoGorelli, could you please take a look at this PR?

@@ -2827,6 +2827,16 @@ def asfreq(
if how is None:
how = "E"

if isinstance(freq, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few lines down, there's obj.index.asfreq(freq, how=how) - should these validation go in there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, you are right. I moved the check if isinstance(freq, str) to the definition of asfreq in pandas/core/indexes/period.py

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this

a lot of these validation paths go through to_offset - could we just check that if you run to_offset(..., is_period=True) then it raises an informative error message if the return value doesn't have a _period_dtype_code attribute?

So here

if isinstance(freq, BaseOffset):
return freq
if isinstance(freq, tuple):
raise TypeError(
f"to_offset does not support tuples {freq}, pass as a string instead"
)
elif PyDelta_Check(freq):
return delta_to_tick(freq)

instead of returning, assign the return value to a variable (say, delta, as that's what's used below)

then, just after

if delta is None:
raise ValueError(INVALID_FREQ_ERR_MSG.format(freq))

raise if is_period and delta doesn't have a _period_dtype_code attribute?

Comment on lines 217 to 221
elif offset.name == freq.replace(f"{offset.n}", ""):
raise ValueError(
f"Invalid offset: '{offset.name}' for converting time series "
f"with PeriodIndex."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this, but I think you don't need it - just raise ValueError(INVALID_FREQ_ERR_MSG.format(f"{freq}")) in the else branch below?

Copy link
Contributor Author

@natmokval natmokval Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use two different error messages here because we raise an error of invalid freq such as “ABC” and also for freq=“BMS” which is valid for offsets, but invalid for period. I agree, that it can be confusing, so better to use the standard error message f"Invalid frequency: {freq}"

)
else:
raise ValueError(INVALID_FREQ_ERR_MSG.format(f"{freq}"))

arr = self._data.asfreq(freq, how)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you follow this asfreq and put this validation even deeper?

EDIT: as mentioned in the review, it might be better to go all the way down and do this validation within to_offset itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I did as you suggested and moved this validation to to_offset. It works very well.

return delta_to_tick(freq)
delta = delta_to_tick(freq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe on line 4841, we can do

    if isinstance(freq, BaseOffset):
        delta = freq

so then that gets validated too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to do this assignment on line 4841 as well? just in case an offset which isn't valid for periods is passed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure if it's possible. I tried to do this assignment on line 4841

if isinstance(freq, BaseOffset):
    delta = freq

but then I got failures.
The reason: if we replace return freq with the delta = freq, we go to the line 4962 and assign delta to None and then on line 4965 we raise a ValueError.
Which is why instead of the assignment delta = freq I added the check

if is_period and not hasattr(freq, "_period_dtype_code"):
    raise ValueError(f"{freq.base} is not supported as period frequency")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just move this down to before elif PyDelta_Check(freq):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I saw you already made commit for it

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one - does this mean it's also possible to get rid of the except in

pandas/pandas/core/resample.py

Lines 2814 to 2820 in acd914d

if hasattr(freq, "_period_dtype_code"):
freq = freq_to_period_freqstr(freq.n, freq.name)
else:
raise ValueError(
f"Invalid offset: '{freq.base}' for converting time series "
f"with PeriodIndex."
)

?

@natmokval
Copy link
Contributor Author

nice one - does this mean it's also possible to get rid of the except in

pandas/pandas/core/resample.py

Lines 2814 to 2820 in acd914d

if hasattr(freq, "_period_dtype_code"):
freq = freq_to_period_freqstr(freq.n, freq.name)
else:
raise ValueError(
f"Invalid offset: '{freq.base}' for converting time series "
f"with PeriodIndex."
)

?

I am afraid we can't rid of this block yet. When I removed this check the test_asfreq_invalid_period_freq in pandas/tests/resample/test_period_index.py failed.

Without this block we silently convert index, for example:

index = pd.PeriodIndex(["2020-01-01", "2021-01-01"], freq="Q")
df = pd.DataFrame({"a": pd.Series([0, 1], index=index)})
res = df.asfreq(freq=offsets.BusinessMonthEnd(1))
print(res.index)

PeriodIndex(['2020-03', '2021-03'], dtype='period[M]')

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be good thanks, just one minor comment

does this change anything since 2.2? if so, I'd say that a whatsnew note in 3.0 is warranted (or in 2.2.1 if it should be backported, for example if anything introduced in 2.2 wasn't correct and needs fixing here)

return delta_to_tick(freq)
delta = delta_to_tick(freq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to do this assignment on line 4841 as well? just in case an offset which isn't valid for periods is passed

Comment on lines -56 to 59
msg = "WeekOfMonth is not supported as period frequency"
with pytest.raises(TypeError, match=msg):
msg = "WOM-1MON is not supported as period frequency"
with pytest.raises(ValueError, match=msg):
Period("2012-01-02", freq="WOM-1MON")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! yeah better to match what the user passed if possible, good one

Copy link
Contributor Author

@natmokval natmokval Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure if it's possible to do this assignment on line 4841, if I do it

if isinstance(freq, BaseOffset):
    delta = freq

I get failures. The reason: if we replace return freq with the delta = freq, we go to the line 4962 and assign delta to None and then on line 4965 we raise a ValueError.

Instead of the assignment delta = freq I added the check

if is_period and not hasattr(freq, "_period_dtype_code"):
    raise ValueError(f"{freq.base} is not supported as period frequency")

@natmokval
Copy link
Contributor Author

should be good thanks, just one minor comment

does this change anything since 2.2? if so, I'd say that a whatsnew note in 3.0 is warranted (or in 2.2.1 if it should be backported, for example if anything introduced in 2.2 wasn't correct and needs fixing here)

yes, I think we have an incorrect conversion in 2.2

>>> index = pd.PeriodIndex(["2020-01-01", "2021-01-01"], freq="M")
>>> index.asfreq(freq=offsets.MonthBegin(1))
PeriodIndex(['2020-03', '2021-03'], dtype='period[M]')
>>>
>>> index.asfreq(freq=offsets.YearBegin(1))
PeriodIndex(['2020', '2021'], dtype='period[Y-DEC]')
>>>
>>> index.asfreq(freq=offsets.BusinessMonthEnd(1))
PeriodIndex(['2020-03', '2021-03'], dtype='period[M]')

I fixed it in this PR and added an entry the 2.2.1. Now we raise a ValueError, e.g.

>>> index.asfreq(freq=offsets.MonthBegin(1))
ValueError: <MonthBegin> is not supported as period frequency

@MarcoGorelli MarcoGorelli added this to the 2.2.1 milestone Feb 7, 2024
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonderful, thanks a tonne @natmokval !

@MarcoGorelli MarcoGorelli merged commit cb97ce6 into pandas-dev:main Feb 7, 2024
47 checks passed
Copy link

lumberbot-app bot commented Feb 7, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.2.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 cb97ce6e496e7d1df76f06550c0fe8c4590ff3ce
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #56945: ENH: raise ValueError if invalid period freq pass to asfreq when the index of df is a PeriodIndex'
  1. Push to a named branch:
git push YOURFORK 2.2.x:auto-backport-of-pr-56945-on-2.2.x
  1. Create a PR against branch 2.2.x, I would have named this PR:

"Backport PR #56945 on branch 2.2.x (ENH: raise ValueError if invalid period freq pass to asfreq when the index of df is a PeriodIndex)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@natmokval
Copy link
Contributor Author

thanks @MarcoGorelli, for helping me with this PR!

MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Feb 7, 2024
…d freq pass to asfreq when the index of df is a PeriodIndex'

(cherry picked from commit cb97ce6)
MarcoGorelli added a commit that referenced this pull request Feb 7, 2024
…period freq pass to asfreq when the index of df is a PeriodIndex) (#57292)

'Backport PR #56945: ENH: raise ValueError if invalid period freq pass to asfreq when the index of df is a PeriodIndex'

(cherry picked from commit cb97ce6)

Co-authored-by: Natalia Mokeeva <91160475+natmokval@users.noreply.github.com>
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants