Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix some PeriodIndex resampling issues #16153

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
ca7b6f2
CLN: move PeriodIndex binning code to TimeGrouper
winklerand Apr 26, 2017
c27f430
TST/CLN: raise error when resampling with on= or level= selection
winklerand Apr 26, 2017
390e16e
BUG: resampling PeriodIndex now returns PeriodIndex (GH 12884, 15944)
winklerand Apr 26, 2017
23566c2
BUG: OHLC-upsampling of PeriodIndex now returns DataFrame (GH 13083)
winklerand Apr 26, 2017
a82879d
BUG: enable resampling with NaT in PeriodIndex (GH 13224)
winklerand Apr 26, 2017
4b1c740
CLN: remove warning on falling back to tstamp resampling with loffset
winklerand Apr 30, 2017
73c0990
CLN: use memb._isnan for NaT masking
winklerand May 1, 2017
fa6c1d3
DOC: added issue reference for OHLC resampling
winklerand May 1, 2017
7ea04e9
STYLE: added blank lines
winklerand May 1, 2017
82a8275
TST: convert to parametrized tests / pytest idiom
winklerand May 6, 2017
432c623
CLN/TST: call assert_almost_equal() when comparing Series/DataFrames
winklerand May 6, 2017
c8814fb
STYLE: added blank lines, removed odd whitespace, fixed typo
winklerand May 13, 2017
486ad67
TST: add test case for multiple consecutive NaTs in PeriodIndex
winklerand May 13, 2017
ad8519f
TST/DOC: added issue number to test case
winklerand May 13, 2017
39fc7e2
TST: consolidate test_asfreq_downsample, test_asfreq_upsample -> test…
winklerand May 13, 2017
efcad5b
TST: set fixtures to default function scoping
winklerand May 13, 2017
41401d4
TST: convert constant 'setup-like' values/objects to pytest fixtures
winklerand May 13, 2017
398a684
DOC: whatsnew v0.21.0 entry (in API changes section)
winklerand May 21, 2017
8358c41
fixups
jreback Sep 28, 2017
6084e0c
moar whatsnew
jreback Sep 29, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
76 changes: 76 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Expand Up @@ -171,6 +171,82 @@ Other Enhancements
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_0210.api_breaking.period_index_resampling:

``PeriodIndex`` resampling
^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions of pandas, resampling a ``Series``/``DataFrame`` indexed by a ``PeriodIndex`` returned a ``DatetimeIndex`` in some cases (:issue:`12884`). Resampling to a multiplied frequency now returns a ``PeriodIndex`` (:issue:`15944`). As a minor enhancement, resampling a ``PeriodIndex`` can now handle ``NaT`` values (:issue:`13224`)

Previous Behavior:

.. code-block:: ipython

In [1]: pi = pd.period_range('2017-01', periods=12, freq='M')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fixed things that are not changed (e.g. [1], [2]), do in a separate ipython block above (to avoid repeating in previous/new).


In [2]: s = pd.Series(np.arange(12), index=pi)

In [3]: resampled = s.resample('2Q').mean()

In [4]: resampled
Out[4]:
2017-03-31 1.0
2017-09-30 5.5
2018-03-31 10.0
Freq: 2Q-DEC, dtype: float64

In [5]: resampled.index
Out[5]: DatetimeIndex(['2017-03-31', '2017-09-30', '2018-03-31'], dtype='datetime64[ns]', freq='2Q-DEC')

New Behavior:

.. ipython:: python

pi = pd.period_range('2017-01', periods=12, freq='M')

s = pd.Series(np.arange(12), index=pi)

resampled = s.resample('2Q').mean()

resampled

resampled.index


Upsampling and calling ``.ohlc()`` previously returned a ``Series``, basically identical to calling ``.asfreq()``. OHLC upsampling now returns a DataFrame with columns ``open``, ``high``, ``low`` and ``close`` (:issue:`13083`). This is consistent with downsampling and ``DatetimeIndex`` behavior.

Previous Behavior:

.. code-block:: ipython

In [1]: pi = pd.PeriodIndex(start='2000-01-01', freq='D', periods=10)

In [2]: s = pd.Series(np.arange(10), index=pi)

In [3]: s.resample('H').ohlc()
Out[3]:
2000-01-01 00:00 0.0
...
2000-01-10 23:00 NaN
Freq: H, Length: 240, dtype: float64

In [4]: s.resample('M').ohlc()
Out[4]:
open high low close
2000-01 0 9 0 9

New Behavior:

.. ipython:: python

pi = pd.PeriodIndex(start='2000-01-01', freq='D', periods=10)

s = pd.Series(np.arange(10), index=pi)

s.resample('H').ohlc()

s.resample('M').ohlc()


.. _whatsnew_0210.api_breaking.deps:

Expand Down
132 changes: 74 additions & 58 deletions pandas/core/resample.py
Expand Up @@ -14,7 +14,7 @@
from pandas.core.indexes.datetimes import DatetimeIndex, date_range
from pandas.core.indexes.timedeltas import TimedeltaIndex
from pandas.tseries.offsets import DateOffset, Tick, Day, _delta_to_nanoseconds
from pandas.core.indexes.period import PeriodIndex, period_range
from pandas.core.indexes.period import PeriodIndex
import pandas.core.common as com
import pandas.core.algorithms as algos
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries
Expand Down Expand Up @@ -834,53 +834,32 @@ class PeriodIndexResampler(DatetimeIndexResampler):
def _resampler_for_grouping(self):
return PeriodIndexResamplerGroupby

def _get_binner_for_time(self):
if self.kind == 'timestamp':
return super(PeriodIndexResampler, self)._get_binner_for_time()
return self.groupby._get_period_bins(self.ax)

def _convert_obj(self, obj):
obj = super(PeriodIndexResampler, self)._convert_obj(obj)

offset = to_offset(self.freq)
if offset.n > 1:
if self.kind == 'period': # pragma: no cover
print('Warning: multiple of frequency -> timestamps')

# Cannot have multiple of periods, convert to timestamp
if self._from_selection:
# see GH 14008, GH 12871
msg = ("Resampling from level= or on= selection"
" with a PeriodIndex is not currently supported,"
" use .set_index(...) to explicitly set index")
raise NotImplementedError(msg)

if self.loffset is not None:
# Cannot apply loffset/timedelta to PeriodIndex -> convert to
# timestamps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we show a warning for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should, given that we kind of did that before (it was print statement with a warning!?)

self.kind = 'timestamp'

# convert to timestamp
if not (self.kind is None or self.kind == 'period'):
if self._from_selection:
# see GH 14008, GH 12871
msg = ("Resampling from level= or on= selection"
" with a PeriodIndex is not currently supported,"
" use .set_index(...) to explicitly set index")
raise NotImplementedError(msg)
else:
obj = obj.to_timestamp(how=self.convention)
if self.kind == 'timestamp':
obj = obj.to_timestamp(how=self.convention)

return obj

def aggregate(self, arg, *args, **kwargs):
result, how = self._aggregate(arg, *args, **kwargs)
if result is None:
result = self._downsample(arg, *args, **kwargs)

result = self._apply_loffset(result)
return result

agg = aggregate

def _get_new_index(self):
""" return our new index """
ax = self.ax

if len(ax) == 0:
values = []
else:
start = ax[0].asfreq(self.freq, how=self.convention)
end = ax[-1].asfreq(self.freq, how='end')
values = period_range(start, end, freq=self.freq).asi8

return ax._shallow_copy(values, freq=self.freq)

def _downsample(self, how, **kwargs):
"""
Downsample the cython defined function
Expand All @@ -898,22 +877,17 @@ def _downsample(self, how, **kwargs):
how = self._is_cython_func(how) or how
ax = self.ax

new_index = self._get_new_index()

# Start vs. end of period
memb = ax.asfreq(self.freq, how=self.convention)

if is_subperiod(ax.freq, self.freq):
# Downsampling
if len(new_index) == 0:
bins = []
else:
i8 = memb.asi8
rng = np.arange(i8[0], i8[-1] + 1)
bins = memb.searchsorted(rng, side='right')
grouper = BinGrouper(bins, new_index)
return self._groupby_and_aggregate(how, grouper=grouper)
return self._groupby_and_aggregate(how, grouper=self.grouper)
elif is_superperiod(ax.freq, self.freq):
if how == 'ohlc':
# GH #13083
# upsampling to subperiods is handled as an asfreq, which works
# for pure aggregating/reducing methods
# OHLC reduces along the time dimension, but creates multiple
# values for each period -> handle by _groupby_and_aggregate()
return self._groupby_and_aggregate(how, grouper=self.grouper)
return self.asfreq()
elif ax.freq == self.freq:
return self.asfreq()
Expand All @@ -936,19 +910,16 @@ def _upsample(self, method, limit=None, fill_value=None):
.fillna

"""
if self._from_selection:
raise ValueError("Upsampling from level= or on= selection"
" is not supported, use .set_index(...)"
" to explicitly set index to"
" datetime-like")

# we may need to actually resample as if we are timestamps
if self.kind == 'timestamp':
return super(PeriodIndexResampler, self)._upsample(
method, limit=limit, fill_value=fill_value)

self._set_binner()
ax = self.ax
obj = self.obj
new_index = self._get_new_index()
new_index = self.binner

# Start vs. end of period
memb = ax.asfreq(self.freq, how=self.convention)
Expand Down Expand Up @@ -1293,6 +1264,51 @@ def _get_time_period_bins(self, ax):

return binner, bins, labels

def _get_period_bins(self, ax):
if not isinstance(ax, PeriodIndex):
raise TypeError('axis must be a PeriodIndex, but got '
'an instance of %r' % type(ax).__name__)

memb = ax.asfreq(self.freq, how=self.convention)

# NaT handling as in pandas._lib.lib.generate_bins_dt64()
nat_count = 0
if memb.hasnans:
nat_count = np.sum(memb._isnan)
memb = memb[~memb._isnan]

# if index contains no valid (non-NaT) values, return empty index
if not len(memb):
binner = labels = PeriodIndex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use _shallow_copy here, but this is OK

Copy link
Contributor

@jreback jreback Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this, ok for now.

data=[], freq=self.freq, name=ax.name)
return binner, [], labels

start = ax.min().asfreq(self.freq, how=self.convention)
end = ax.max().asfreq(self.freq, how='end')

labels = binner = PeriodIndex(start=start, end=end,
freq=self.freq, name=ax.name)

i8 = memb.asi8
freq_mult = self.freq.n

# when upsampling to subperiods, we need to generate enough bins
expected_bins_count = len(binner) * freq_mult
i8_extend = expected_bins_count - (i8[-1] - i8[0])
rng = np.arange(i8[0], i8[-1] + i8_extend, freq_mult)
rng += freq_mult
bins = memb.searchsorted(rng, side='left')

if nat_count > 0:
# NaT handling as in pandas._lib.lib.generate_bins_dt64()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this path tested sufficiently, e.g. 0, 1, 2 NaT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test case for consecutive NaTs in the index (1cad7fa)

Should be sufficiently tested, cases covered:

  • 0 NaT: basically all other resampling tests
  • multiple single NaTs (at beginning, inside and end of index)
  • consecutive NaTs (at beginning, inside and end of index)

Any ideas for more exhaustive test cases?

# shift bins by the number of NaT
bins += nat_count
bins = np.insert(bins, 0, nat_count)
binner = binner.insert(0, tslib.NaT)
labels = labels.insert(0, tslib.NaT)

return binner, bins, labels


def _take_new_index(obj, indexer, new_index, axis=0):
from pandas.core.api import Series, DataFrame
Expand Down