Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resampling uses inconsistent labeling for sub-daily and super-daily frequencies #9586

Open
shoyer opened this issue Mar 4, 2015 · 16 comments
Assignees
Labels
Bug Frequency DateOffsets Resample resample method

Comments

@shoyer
Copy link
Member

shoyer commented Mar 4, 2015

xref #2665
xref #5440

Resample appears to be use an inconsistent label convention depending on whether the target frequency is sub-daily/daily or super-daily:

  • For sub-daily/daily frequencies, label='left' makes labels at the timestamp corresponding to the start of each frequency bin, and label='right' that makes labels at that timestamp plus the frequency (at the timestamp dividing exactly dividing bins).
  • For super-daily frequencies, both labels appears to shifted minus one day to the left, so the timestamps no longer cleanly divide the frequencies. Moreover, the default label shifts from 'left' to 'right'! My guess is that the default was changed here because users were confused by label='left' no longer falling inside the expected interval. (I guess I could check git blame for the details.)

I found this behavior quite surprising and confusing. Is it intentional? I would like to rationalize this if possible, because this strikes me as very poor design. The behavior also couples in a weird way with the closed argument (see the linked issues).

From my perspective (as someone who uses monthly and yearly data), the sub-daily/daily behavior makes sense and the super-daily behavior is a bug: there's no particular reason why it makes sense to use 1 day as an offset for frequencies with super-daily resolution.

CC @Cd48 @kdebrab


Here's my test script:

for orig_freq, target_freq in [('20s', '1min'), ('20min', '1H'), ('10H', '1D'),
                               ('3D', '10D'), ('10D', '1M'), ('1M', 'Q'), ('3M', 'A')]:
    print '%s -> %s:' % (orig_freq, target_freq)
    ind = pd.date_range('2000-01-01', freq=orig_freq, periods=10)
    s = pd.Series(np.arange(10), ind)
    print 'default', s.resample(target_freq, how='first').index[0]
    print 'left', s.resample(target_freq, label='left', how='first').index[0]
    print 'right', s.resample(target_freq, label='right', how='first').index[0]
20s -> 1min:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-01 00:01:00
20min -> 1H:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-01 01:00:00
10H -> 1D:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-02 00:00:00
3D -> 10D:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-11 00:00:00
10D -> 1M:
default 2000-01-31 00:00:00
left 1999-12-31 00:00:00
right 2000-01-31 00:00:00
1M -> Q:
default 2000-03-31 00:00:00
left 1999-12-31 00:00:00
right 2000-03-31 00:00:00
3M -> A:
default 2000-12-31 00:00:00
left 1999-12-31 00:00:00
right 2000-12-31 00:00:00
@shoyer shoyer added Datetime Datetime data dtype API Design Resample resample method labels Mar 4, 2015
@shoyer
Copy link
Member Author

shoyer commented Mar 4, 2015

OK, after digging more deeply into it.... I discover that M corresponds to the offset "month end", which apparently means the start of the last day of the month. To get "month start", I need to use MS (or likewise, QS or AS).

This is... deeply non-intuitive.

I wish there was a way to change this without breaking a bunch of existing code.

I suppose we could add ME, etc., for month end, but changing the offset M from month-end to month-start seems like a non-starter. Ugh. So I guess we're left with a documentation issue (#5023), unless we want to add a hack for resample.

@jreback
Copy link
Contributor

jreback commented Mar 4, 2015

IIRC M matches what scikit-timeseries established, a long established pattern. You could change it but that would prob require a long deprecation cycle (IMHO not worth it).

also xref to #9528 as the fill args to .resample are different from up/down resampling. Further I think the docs to resample need much more. my2c.

@MarcoGorelli
Copy link
Member

IIRC M matches what scikit-timeseries established, a long established pattern. You could change it but that would prob require a long deprecation cycle (IMHO not worth it).

Given how confusing 'M' is, I think it might actually be worth it

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Mar 2, 2023

Right, time for another @pandas-dev/pandas-core @pandas-dev/pandas-triage tag

Would anyone object to a deprecation cycle to remove 'M' and rename it to 'ME'? Likewise for 'Y' -> 'YE'

It's a really common source of confusion, nobody expects 'M' to mean "month end" (or 'Y' to mean "year end") the first time they see it

Concretely, I'd suggest:

  • pandas 2.1: passing freq='M' warns users that 'M' is deprecated and to use 'MS' (month start) or 'ME' (month end) instead
  • pandas 3.0: passing freq='M' errors, advising users to use 'MS' (month start) or 'ME' (month end) instead

@attack68
Copy link
Contributor

attack68 commented Mar 2, 2023

FWIW in finance the APIs typically use ME for "month end" and M for "monthly" when dealing with scheduling. They might have different contexts but M is never used for month end. I would support the change even if its a long way to go.

@datapythonista
Copy link
Member

Not sure if there is any convention followed by other apps, or anything, but seems like an improvement. The current API seems quite confusing indeed. And changing the behavior seems easy if M won't be used by anything else after the change, which would make the change way trickier.

Just to be clear, we're talking about the rule (not freq) parameter of DataFrame.resample (and I guess Series.resample), right? Or am I missing something?

@WillAyd
Copy link
Member

WillAyd commented Mar 2, 2023

I'd be OK with this deprecation. I get bit by this all of the time

@MarcoGorelli
Copy link
Member

Just to be clear, we're talking about the rule (not freq)

Yes, that's right, thanks - in resample it's rule and in date_range it's freq, but that's what I meant

@jorisvandenbossche
Copy link
Member

Would anyone object to a deprecation cycle to remove 'M' and rename it to 'ME'? Likewise for 'Y' -> 'YE'

You only mention monthly and yearly, but the other frequencies that default to right might have the same issue? (although less used): quarterly (Q), weekly (W), and then the business version of monthly/quarterly/yearly.

@MarcoGorelli
Copy link
Member

Thanks - yes, those too

@jbrockmendel
Copy link
Member

if M won't be used by anything else after the change

"M" and "Y" would presumably still be used for Period/PeriodDtype.

There are occasional issues with people being surprised that freqs which are valid for date_range are not valid for Period, so it might be helpful to make a cleaner separation between the two concepts. e.g. #38859, #13871, #5091

@natmokval
Copy link
Contributor

natmokval commented Mar 17, 2023

I would like to work on this issue.
Considering what @MarcoGorelli suggested:

Concretely, I'd suggest:

* pandas 2.1: passing `freq='M'` warns users that `'M'` is deprecated and to use `'MS'` (month start) or `'ME'` (month end) instead

* pandas 3.0: passing `freq='M'` errors, advising users to use `'MS'` (month start) or `'ME'` (month end) instead

I’ll start with warnings while passing freq=‘M’.

@AlexKirko
Copy link
Member

When I worked in finance, we would do present value valuations to calculate interest rate risk. These kind of inconsistencies were present in some of the software, and they were a nightmare to detect and circumvent. Securities are quite standartized, so if bonds in a particular country mostly pay the coupon at the start of a period, they do so no matter what the period is.

Fixing this would save a lot of people somewhere a lot of hours.

@MarcoGorelli
Copy link
Member

"M" and "Y" would presumably still be used for Period/PeriodDtype.

Are we sure we want this?

There's an example in the docs which shows:

In[366]: p = pd.Period("2014-07", freq="M")

In[367]: p + pd.offsets.MonthEnd(3)
Out[367]: Period('2014-10', 'M')

In[368]: p + pd.offsets.MonthBegin(3)
Traceback
   ...
ValueError: Input has different freq from Period(freq=M)

@jbrockmendel If the prefix were to stay as "M" for Period but "ME" for the offsets, then wouldn't that cause confusion when people try to sum an offset to a Period? Would people be expected to know that MonthEnd is the offset corresponding to the Period 'M'?

I think it might be simpler (and easier to teach) to just use 'MS' and 'ME', without special-casing Period

@MarcoGorelli
Copy link
Member

From today's call: seeing as long-term the idea is to decouple Period from Offsets, both p + pd.offsets.MonthEnd(3) and p + pd.offsets.MonthStart(3) would raise, and so it would probably be best to keep 'M' for Period

@jbrockmendel
Copy link
Member

From today's call: seeing as long-term the idea is to decouple Period from Offsets

To be clear, this is something id like to see, but have no concrete plans to actually implement. There hasn't been a targeted discussion of the idea, except for the mention of it yesterday and lack of objection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Frequency DateOffsets Resample resample method
Projects
None yet
Development

No branches or pull requests