Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resample() should allow a "null" fill_method #11217

Open
jd opened this issue Oct 2, 2015 · 24 comments
Open

resample() should allow a "null" fill_method #11217

jd opened this issue Oct 2, 2015 · 24 comments

Comments

@jd
Copy link
Contributor

@jd jd commented Oct 2, 2015

I'm using resample() to aggregate data in a timeframe.

>>> s = pandas.Series((3, 4), (pandas.Timestamp("2014-1-1"), pandas.Timestamp("2015-1-1")))
>>> x = s.resample("1s")
>>> len(x)
31536001

When doing such a call, resample fills with NaN all the (31536001 - 2) inexistent values, which ends up creating thousands of points and making Python using 500M+ RAM. The thing is that I don't care about the NaN point, so I would like to not fill them in the Series and having so much memory used. AFAICS resample does not offer such as fill_method.

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Oct 2, 2015

this is trivially done with

s.resample('1s').dropna()

@jd

This comment has been minimized.

Copy link
Contributor Author

@jd jd commented Oct 2, 2015

@jreback Sure, but that still makes Pandas uses 500M+ RAM.

Things are actually worse for something like that:

>>> s = pandas.Series(range(1000), pandas.date_range('2014-1-1', periods=1000))
>>> s.resample('1s', how='median')

The result is pretty obvious and should take less than one second to compute. But since Pandas resample with thousands of points filled with NaN, it takes 10+ minutes to compute. And uses 1.5G+ RAM.

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Oct 2, 2015

you need to show what you are going to do with this

why are you resampling such a large period? what's the point

@jd

This comment has been minimized.

Copy link
Contributor Author

@jd jd commented Oct 2, 2015

I don't understand why the reason is important here.
I treat statistics over a year with fine grained resolution up to the second.

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Oct 2, 2015

a usecase/example of what you are doing would be helpful to understand the goal. e.g. are you merely snaping things to a close freq? or are you doing an actual resample but with sparse data?

the point of resample IS to give you a full-rank set for every point in time.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Oct 2, 2015

@jd does snap cover your needs?

@jd

This comment has been minimized.

Copy link
Contributor Author

@jd jd commented Oct 2, 2015

@jreback Yeah, I'm doing a resampling (aggregation) using sparse data, but I'm not only snapping – there might be several values in a time bucket.

@TomAugspurger It does not, though I was looking for such a function anyway so thanks for the pointer :)

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Oct 2, 2015

@jd

so api for this could be 1 of the following:

  • Series.resample(...., fill_method='sparse') (which could return a sparse Series)
  • Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)
  • SparseSeries(....).resample(....) (which would by definition return a sparse Series)

none of these are implemented, but potentially useful.

want to take a stab at this?

@jd

This comment has been minimized.

Copy link
Contributor Author

@jd jd commented Oct 3, 2015

@jreback Hum I've trouble figuring out the different between sparse and drop, but I'm sure I would need one of those.

I don't think I'm know enough of Pandas internal to implement that right now, but I'd be happy to test it and report if you write some code. :)

Thanks a lot!

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Oct 3, 2015

Here's a way to do a sparse resample. A resample is just a groupby. So we are grouping by a combination of the date, and a second (e.g. to give us a freq of 's'). This only groups by the representative elements in the input index (e.g. your sparse points), so this at most generates n groups where n is the length of the set.

In [24]: pd.options.display.max_rows=12

In [25]: np.random.seed(1234)

In [26]: i = date_range('20140101',periods=100000000,freq='s')
^[[A
In [27]: s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))

In [28]: s
Out[28]: 
2014-07-19 19:23:27      0
2014-09-23 09:00:37      1
2014-01-12 07:42:04      2
2016-07-01 18:35:32      3
2014-07-24 05:43:52      4
2016-11-27 19:22:33      5
                      ... 
2015-01-04 08:11:55    994
2016-09-26 02:24:18    995
2016-03-01 18:48:41    996
2014-03-15 01:18:10    997
2015-07-14 06:45:46    998
2016-03-09 08:47:23    999
dtype: int64

In [29]: s.groupby([s.index.date,s.index.second]).sum()
Out[29]: 
2014-01-03  9      52
            50    414
2014-01-05  31    535
2014-01-07  16    937
2014-01-08  28    164
2014-01-12  4       2
                 ... 
2017-02-27  39    174
2017-03-01  5     536
            26    760
            31    208
            51    488
2017-03-03  7     810
dtype: int64
@jd

This comment has been minimized.

Copy link
Contributor Author

@jd jd commented Oct 3, 2015

That really looks like a good way of approaching the solution. I probably lack knowledge about Pandas usage to understand how to map the groupby result to something closer than the output of resample, but it looks like that indeed. I see the result has an index and 2 columns, not sure what the first column is for. Also the index at only date, and not timestamps. I guess it's just a matter of accessing and presenting the data in the same way than resample after.

Also the by argument for groupby is really not obvious to me; does it make possible to group by e.g. 3 minutes ?

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Oct 3, 2015

yes this is a multi-groupby. Its a bit tricky as the default way of doing this will create all of the groups, but here you only want some. So this could be built into resample as I proposed above. The idea is that you map your index (even using a function is enough).

You are essentially rounding the value to whatever interval you want, so here's sort of a trivial way to do this:

xref #4314 (e.g. we should simply define this on a Timestamp)

In [25]: def round(t, freq):
    # round a Timestamp to a specified freq
    return pd.Timestamp((t.value/freq.delta.value)*freq.delta.value)
   ....: 

In [26]: from functools import partial

In [27]: s.groupby(partial(round,freq=pd.offsets.Minute(3))).sum()
Out[27]: 
2014-01-03 02:12:00     52
2014-01-03 11:00:00    414
2014-01-05 20:39:00    535
2014-01-07 07:03:00    937
2014-01-08 03:09:00    164
2014-01-12 07:42:00      2
                      ... 
2017-02-27 20:27:00    214
2017-03-01 04:51:00    488
2017-03-01 12:06:00    536
2017-03-01 15:45:00    208
2017-03-01 19:06:00    760
2017-03-03 00:06:00    810
dtype: int64

In [28]: s.groupby(partial(round,freq=pd.offsets.Minute(5))).sum()
Out[28]: 
2014-01-03 02:10:00     52
2014-01-03 11:00:00    414
2014-01-05 20:40:00    535
2014-01-07 07:00:00    937
2014-01-08 03:10:00    164
2014-01-12 07:40:00      2
                      ... 
2017-02-27 20:25:00    214
2017-03-01 04:50:00    488
2017-03-01 12:05:00    536
2017-03-01 15:45:00    208
2017-03-01 19:05:00    760
2017-03-03 00:05:00    810
dtype: int64
@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Oct 3, 2015

actually if u would like add this to the timeseries.rst docs under the resample section would be great

can add as an actual method at some point later

@shoyer

This comment has been minimized.

Copy link
Member

@shoyer shoyer commented Oct 4, 2015

+1 for fill_method='drop'

jreback added a commit that referenced this issue Oct 8, 2015
As discussed in #11217, there's another way of doing resampling that is not yet
covered by `resample' itself. Let's document that.
openstack-gerrit pushed a commit to openstack/gnocchi that referenced this issue Oct 8, 2015
This changes the resampling method we used to have by not doing any real
resampling like Pandas used too. The `resampling' method from Pandas insert a
lot of empty points filled with NaN as value if your timeserie is sparse –
which is a typical case in Carbonara/Gnocchi. This ends up creating timeseries
with millions of empty points, consuming hundreds of MB of memory for nothing.

This method inspired by Jeff on pandas-dev/pandas#11217
implements a simpler versino of what `resample` does: it groups the sample by
timestamp, and then compute an aggregation method on them. This avoids creating
thousands of useless points and ends up being much faster and consume a *LOT*
less memory.

Benchmarked: for a new timeserie with 10k measures with 10-80k points by
archive this reduces the memory usage of metricd from 2 GB to 100 MB and the
compute speed of the most complicated aggregations like percentile to 15min to
20s (45× speed improvement).

Change-Id: I1b8718508bdd4633e7324949b76184efc3718ede
@joddm

This comment has been minimized.

Copy link

@joddm joddm commented May 13, 2018

Any updates here?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented May 16, 2018

@xpl

This comment has been minimized.

Copy link

@xpl xpl commented Sep 29, 2018

You can also try this:

def resample (df, column, resolution, how='last'):
    if type (df.index) != pd.DatetimeIndex: df.set_index (column, inplace=True)
    df.index = df.index.floor (resolution)
    return getattr (df.groupby (column), how) ()

Examples:

resample (balances, 'timestamp', '1s', 'last')
resample (trades[['timestamp', 'amount']], 'timestamp', '1s', 'sum')
@eromoe

This comment has been minimized.

Copy link

@eromoe eromoe commented May 13, 2019

It is better to have fill_method='drop' , because missing data should not be fill by zero by default . This caused a huge problem in my project .

@Peque

This comment has been minimized.

Copy link
Contributor

@Peque Peque commented Jun 10, 2019

@jreback You suggested (long time ago in this thread):

s.resample('1s').dropna()

How would you recommend to do this nowadays? That way you suggested seems not to work with the latest Pandas version.

I would like to have a series resampled/grouped by '2D' and then .sum() the results to end up with only the dates in which there was some data in the original series (i.e.: I would rather have NaN if there is no data to sum so that I can drop those values later).

An example:

>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0, 1, 2, 3])
>>> series
2019-01-01    0
2019-01-05    1
2019-01-06    2
2019-01-09    3
dtype: int64
>>> series.resample('2D').sum()
2019-01-01    0
2019-01-03    0
2019-01-05    3
2019-01-07    0
2019-01-09    3
Freq: 2D, dtype: int64

Instead of the result I got, I was looking for:

2019-01-01    0
2019-01-05    3
2019-01-09    3
Freq: 2D, dtype: int64

Or at least (I don't mind if the full resampled index gets filled in memory):

2019-01-01    0
2019-01-03    Nan
2019-01-05    3
2019-01-07    Nan
2019-01-09    3
Freq: 2D, dtype: int64
@Peque

This comment has been minimized.

Copy link
Contributor

@Peque Peque commented Jun 11, 2019

So I came up with this, using min_count:

>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01    0.0
2019-01-05    3.0
2019-01-09    3.0
dtype: float64

Still fills the empty spaces with NaNs, but gets the job done for short series which can easily fit in memory.

@maowerner

This comment has been minimized.

Copy link

@maowerner maowerner commented Jul 25, 2019

@TomAugspurger @jreback

Still open if you want to take a shot at fixing it!

Could you give me a few pointers how to start? I would like to try implementing

Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)
@maowerner

This comment has been minimized.

Copy link

@maowerner maowerner commented Jul 26, 2019

Hmm, right now I don't really see how to do this cleanly. The Resampler is created in

r = resample(

However, the fill_method parameter is only entering in the following function call.
r, how=how, fill_method=fill_method, limit=limit

Furthermore it is deprecated in favor of

.resample().method().fillmethod()

Making the Resampler dependent on fill_method would only reintroduce the coupling into the code. And fill_method = 'drop' sounds like a contradiction anyways.

I think it might be cleaner to bypass resample and call round() and groupby as proposed above. It would however be nice to have reference to this in resample.fillna().

Do share this opinion? Is there somewhere else where this might be implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round = '1s') be useful in other cases as well?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Jul 29, 2019

@maowerner

This comment has been minimized.

Copy link

@maowerner maowerner commented Jul 30, 2019

So I tried this out with simply grouping by the rounded value. This works perfectly fine without intermediate objects. Below is a comparison with the workaround of @Peque I would like to mention this (or a similar) example in the documentation of resample. I think I am not the only one who was mislead to think that it is the correct function to group sparse data.

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import random

np.random.seed(41)

i = pd.date_range('20140101',periods=100000000,freq='s')
s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))
%timeit s.\
    resample('1H').\
    sum(min_count=1).\
    dropna()
2.4 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit s.\
    groupby(s.index.floor('1H')).\
    sum()
1.67 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit s.\
    groupby(s.index.round('1H')).\
    sum()
1.72 ms ± 9.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.