Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: non-standard frequency Period arithmetic #23878

Closed
aa1371 opened this issue Nov 23, 2018 · 7 comments · Fixed by #23915
Closed

BUG: non-standard frequency Period arithmetic #23878

aa1371 opened this issue Nov 23, 2018 · 7 comments · Fixed by #23915
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Period Period data type
Milestone

Comments

@aa1371
Copy link
Contributor

aa1371 commented Nov 23, 2018

Code Sample

starting_date = pd.Period('20180101', freq='24H') 
base_date = pd.Period('19700101', freq='24H')
offset = starting_date - base_date
base_date + offset == starting_date  # -> False   should be True
base_date + offset == (pd.Period(str(base_date), freq='H') + (24*offset)).asfreq('24H')  # -> True  This is how it's actually being processed now
(pd.Period(str(base_date), freq='H') + offset).asfreq('24H') == starting_date  # -> True  This is what needs to be done now to get the desired result

Problem description

In short: (A - B) + B != A

The result of diffing two periods of a scaled frequency is represented in the number of periods of the base frequency, not the scaled frequency. Thus when adding back to the scaled frequency the resulting Period will be N (where N is the scaling factor) times more periods than should be added.

The 2 options that don't involve some sort of custom delta object don't really work well:

  1. offset divided by the scaling factor
  • This would not work for some scaling factors (e.g. '5H').
  1. adding an int to a Period should not scale the int by the scaling factor, but rather just add that number of base periods.
  • This breaks the logic that, for example, pd.Period('19700101', freq='3H') + 1 == pd.Period('19700101 03:00', freq='3H'), which I don't think should be compromised

Suggestion:

A simple PeriodDelta object that just contains (int, freq), where the int is the current resulting int. When added to a period, if the frequency matches the PeriodDelta freq, then the Period will be cast to it's base frequency the int added and then cast back. Ints will still be valid input, but the difference of Periods could return this object instead.

Output of pd.show_versions()

v0.23.4

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

we already rejected PeriodDelta as pretty niche and not worth the supports costs

that said some of the above might be bugs

@aa1371
Copy link
Contributor Author

aa1371 commented Nov 24, 2018

I think that ints are fine to represent period differences except it needs some modifications/restrictions (of course, I'm not aware of any of the original decisions on how this should work, so please let me know if there is any reason these wouldn't work):

  1. The int should represent the number of scaled periods (since this is how addition works), not base periods
  2. Periods must be in phase - required for (1) to work.
  • In phase: pd.Period('20180101 09:00', freq='5H') - pd.Period('20180101 04:00', freq='5H')
  • Out of phase: pd.Period('20180101 04:00', freq='5H') - pd.Period('20180101 00:00', freq='5H')

I would also consider allowing something like a 2-item tuple of ints, that add would also understand, to allow for out of phase operations. Which would just be (, ).

These would be easy to implement, add almost no support cost and also make the Period operations much more reliable.

Example Implementation (allowing for out of phase arithmetic):

def __sub__(self, other):
    ret_val = super().__sub__(other)  # not actually super, just placeholder for existing logic
    return divmod(ret_val, self.freq_scaling_factor)  # not sure how to actually access this right now

def __add__(self, other):
    if is_2_item_tuple_of_ints(other):
        other_val = other[0] * self. self.freq_scaling_factor + other[1]
        return (self.astype(self.base_freq) + other_val).astype(self.freq)  # not perfect, astype doesn't always behave how I would want for this to work off the bat
    else:
        return self + other

Example Implementation (for no out of phase operations):

def __sub__(self, other):
    ret_val = super().__sub__(other)
    periods, out_of_phase = divmod(ret_val, self.freq_scaling_factor)
    if out_of_phase:
        raise ValueError('out of phase')
    return periods

@jbrockmendel
Copy link
Member

Is this in master? We fixed a very similar bug recently (though that may have been for PeriodIndex and not Period)

@aa1371
Copy link
Contributor Author

aa1371 commented Nov 24, 2018

@jbrockmendel - I was on 23.4, but looks like Master is still wrong.

So from what I can tell the pandas.tseries.offsets.Hour object when added to a Period object, only factors in the base frequency. So you'll get this:

>>> str(pd.Period('20180101', freq='H') + Hour(10)) == str(pd.Period('20180101', freq='5H') + Hour(10))  # note the freqs

This would not be a problem (in fact it behaves the way I proposed in option 2 in the original post), however, it seems when diffing scaled freqs the scaling factor is multiplied by the base hours, this is unnecessary (and wrong) because of what was deomstrated above. So all that would need to be done is remove the multiplication of the resulting offset.

>>> pd.Period('20180101 00:00', freq='12H') - pd.Period('19700101 00:00', freq='12H')
<5049216 * Hours>  # should be 5049216/12

@aa1371
Copy link
Contributor Author

aa1371 commented Nov 25, 2018

@jbrockmendel - I believe the fix would just be to change the following line:

pandas/_libs/tslibs/period.pyx:1688

return (self.ordinal - other.ordinal) * self.freq

To

return (self.ordinal - other.ordinal) * type(self.freq)()  # couldn’t find a better way to get the base freq

@jbrockmendel
Copy link
Member

Great, a PR would be welcome.

@aa1371
Copy link
Contributor Author

aa1371 commented Nov 26, 2018

@jbrockmendel - PR created #23915

@gfyoung gfyoung added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Period Period data type labels Nov 26, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Period Period data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants