New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: groupby-fillna perf, implement in cython #11296

Closed
squeniart opened this Issue Oct 12, 2015 · 9 comments

Comments

Projects
None yet
4 participants
@squeniart

squeniart commented Oct 12, 2015

Hello,

I work on a dataframe with a multi index (Date, InputTime) and this dataframe may contain some NaN values in the columns (Value, Id). I want to fill forward value but by Date only and I don't find anyway to do this a in a very efficient way. I'm using Pandas 0.16.2 and numpy 1.9.2.

Here is the type of dataframe I have :

Dataframe example
image

And here is the result I want :

image

So to properly fillback by date I can use groupby(level=0) function. The groupby call is fast but the fill forward function applied on the "group by dataframe" is really too slow.

Here is the code I use to compare simple fill forward (which doesn't give the expected result but is run very quickly) and expected fill forward by date (which give expected result but is really too slow).

import numpy as np
import pandas as pd
import datetime as dt

# Show pandas & numpy versions
print('pandas '+pd.__version__)
print('numpy '+np.__version__)

# Build a big list of (Date,InputTime,Value,Id)
listdata = []
d = dt.datetime(2001,10,6,5)
for i in range(0,100000):
    listdata.append((d.date(), d, 2*i if i%3==1 else np.NaN, i if i%3==1 else np.NaN))
    d = d + dt.timedelta(hours=8)

# Create the dataframe with Date and InputTime as index
df = pd.DataFrame.from_records(listdata, index=['Date','InputTime'], columns=['Date', 'InputTime', 'Value', 'Id'])

# Simple Fill forward on index
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].ffill()
end = dt.datetime.now()
print "Time to fill forward on index = " + str((end-start).total_seconds()) + " s"

# Fill forward on Date (first level of index)
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].groupby(level=0).ffill()
end = dt.datetime.now()
print "Time to fill forward on Date only = " + str((end-start).total_seconds()) + " s"

Here are the time results I have:

image

So, the fill foward on the group by dataframe is 10000 times slower than the simple fill forward. I cannot understand why pandas is running so slowly. I need to have comparable perfs with the simple fill forward so just a couple of milliseconds.

Could somebody address the performance issue? Or give me a solution to do this kind of action in a very efficient way?

Thanks

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 12, 2015

this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.

here's an easy way to do this.

In [45]: df
Out[45]: 
                                 Value     Id
Date       InputTime                         
2001-10-06 2001-10-06 05:00:00     NaN    NaN
           2001-10-06 13:00:00       2      1
           2001-10-06 21:00:00     NaN    NaN
2001-10-07 2001-10-07 05:00:00     NaN    NaN
           2001-10-07 13:00:00       8      4
           2001-10-07 21:00:00     NaN    NaN
...                                ...    ...
2093-01-07 2093-01-07 13:00:00  199988  99994
           2093-01-07 21:00:00     NaN    NaN
2093-01-08 2093-01-08 05:00:00     NaN    NaN
           2093-01-08 13:00:00  199994  99997
           2093-01-08 21:00:00     NaN    NaN
2093-01-09 2093-01-09 05:00:00     NaN    NaN

[100000 rows x 2 columns]

# you MUST be sorted
In [46]: df.index.is_lexsorted()
Out[46]: True

In [47]: df3 = df.reset_index()

we take the indexer of the first element of each group. note you cannot use .first here as that would skip the nans.

In [48]: indexer = df3.groupby('Date',as_index=False).nth(0).index

In [49]: indexer
Out[49]: 
Int64Index([    0,     3,     6,     9,    12,    15,    18,    21,    24,    27,
            ...
            99972, 99975, 99978, 99981, 99984, 99987, 99990, 99993, 99996, 99999], dtype='int64', length=33334)

In [50]: df3.loc[indexer].head()
Out[50]: 
         Date           InputTime  Value  Id
0  2001-10-06 2001-10-06 05:00:00    NaN NaN
3  2001-10-07 2001-10-07 05:00:00    NaN NaN
6  2001-10-08 2001-10-08 05:00:00    NaN NaN
9  2001-10-09 2001-10-09 05:00:00    NaN NaN
12 2001-10-10 2001-10-10 05:00:00    NaN NaN

Make these a non-NaN value that you don't have in your frame.

In [51]: df3.loc[indexer,'Value'] = -1

pad the NaNs, which by definition since sorted and -1 are in the first row will not propogate beyond groups. Then replace the -1 with nan again.

In [52]: df3.ffill().replace(-1,np.nan)
Out[52]: 
            Date           InputTime   Value     Id
0     2001-10-06 2001-10-06 05:00:00     NaN    NaN
1     2001-10-06 2001-10-06 13:00:00       2      1
2     2001-10-06 2001-10-06 21:00:00       2      1
3     2001-10-07 2001-10-07 05:00:00     NaN      1
4     2001-10-07 2001-10-07 13:00:00       8      4
5     2001-10-07 2001-10-07 21:00:00       8      4
...          ...                 ...     ...    ...
99994 2093-01-07 2093-01-07 13:00:00  199988  99994
99995 2093-01-07 2093-01-07 21:00:00  199988  99994
99996 2093-01-08 2093-01-08 05:00:00     NaN  99994
99997 2093-01-08 2093-01-08 13:00:00  199994  99997
99998 2093-01-08 2093-01-08 21:00:00  199994  99997
99999 2093-01-09 2093-01-09 05:00:00     NaN  99997

[100000 rows x 4 columns]

I did this on both columns to show the difference.

This is completely vectorized and will be quite efficient.

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 12, 2015

actually we don't have a fillna-groupby perf issue...so will leave this one open

@jreback jreback reopened this Oct 12, 2015

@jreback jreback added this to the Next Major Release milestone Oct 12, 2015

@jreback jreback changed the title from Pandas Fill forward Performance Issue to PERF: groupby-fillna perf Oct 12, 2015

@jreback jreback changed the title from PERF: groupby-fillna perf to PERF: groupby-fillna perf, implement in cython Oct 12, 2015

@squeniart

This comment has been minimized.

squeniart commented Oct 12, 2015

Hello,

I've tested your workaround but it can only work on the specific example I've given because it makes some assumptions on where are the NaN values. My example was built to get a big dataframe for the issue but this is not the exact reality of what are my data.

Suppose the very simple dataframe:

image

If I apply our code on this example, below is what I get:

image

As you can see, some non NaN values have been replaced by NaN values which is not expected. And the first value for a given date is not always NaN so the code doesn't behave as I would like.

Here is what I want to get after fillforward:

image

Furthermore, when I tested your code on the example, I have measured times over the second which is really too slow for my use case :-( Indeed, I need to make this operation a lot in my code.

Thanks for your help

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 12, 2015

pls don't post pictures of frames they are simply not helpful
instead show code to repro

your frame is likely not sorted

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 12, 2015

Error in the work-around

In [60]: df
Out[60]: 
                                Value
Date       Input Date                
2015-01-31 2015-02-01 09:00:00    NaN
           2015-02-01 10:00:00   5.00
           2015-03-01 09:00:00   5.25
           2015-03-01 10:00:00    NaN
           2015-04-01 08:00:00   5.50
2015-02-28 2015-03-01 09:00:00   6.00
           2015-03-01 10:00:00    NaN
           2015-04-01 07:00:00    NaN
           2015-04-01 08:00:00   6.30
2015-03-01 2015-04-01 07:00:00    NaN
           2015-04-01 08:00:00   7.00

In [61]: filler(df)
Out[61]: 
                                Value
Date       Input Date                
2015-01-31 2015-02-01 09:00:00    NaN
           2015-02-01 10:00:00   5.00
           2015-03-01 09:00:00   5.25
           2015-03-01 10:00:00   5.25
           2015-04-01 08:00:00   5.50
2015-02-28 2015-03-01 09:00:00   6.00
           2015-03-01 10:00:00   6.00
           2015-04-01 07:00:00   6.00
           2015-04-01 08:00:00   6.30
2015-03-01 2015-04-01 07:00:00    NaN
           2015-04-01 08:00:00   7.00

This is what I mean by copy-pastable

import pandas as pd
import numpy as np

Timestamp = pd.Timestamp
df = pd.DataFrame({'Value' : [np.nan,5,5.25,np.nan,5.5,6,np.nan,np.nan,6.3,np.nan,7]})
df.index = pd.MultiIndex.from_tuples([ (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-02-01 09:00:00')),
                                       (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-02-01 10:00:00')),
                                       (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-03-01 09:00:00')),
                                       (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-03-01 10:00:00')),
                                       (Timestamp('2015-01-31 00:00:00', offset='M'), Timestamp('2015-04-01 08:00:00')),
                                       (Timestamp('2015-02-28 00:00:00', offset='M'), Timestamp('2015-03-01 09:00:00')),
                                       (Timestamp('2015-02-28 00:00:00', offset='M'), Timestamp('2015-03-01 10:00:00')),
                                       (Timestamp('2015-02-28 00:00:00', offset='M'), Timestamp('2015-04-01 07:00:00')),
                                       (Timestamp('2015-02-28 00:00:00', offset='M'), Timestamp('2015-04-01 08:00:00')),
                                       (Timestamp('2015-03-01 00:00:00', offset='M'), Timestamp('2015-04-01 07:00:00')),
                                       (Timestamp('2015-03-01 00:00:00', offset='M'), Timestamp('2015-04-01 08:00:00'))],
                                     names=['Date','Input Date'])


def filler(x):
   names = x.index.names
   x = x.reset_index()
   indexer = x.groupby('Date',as_index=False).nth(0).index
   indexer = indexer[x.loc[indexer,'Value'].isnull()]
   x.loc[indexer,'Value'] = -1
   return x.ffill().replace(-1,np.nan).set_index(names)

And on the original frame

In [75]: %timeit filler(df)
1 loops, best of 3: 634 ms per loop

In [77]: df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (2001-10-06 00:00:00, 2001-10-06 05:00:00) to (2093-01-09 00:00:00, 2093-01-09 05:00:00)
Data columns (total 2 columns):
Value    33333 non-null float64
Id       33333 non-null float64
dtypes: float64(2)
memory usage: 3.3+ MB
@squeniart

This comment has been minimized.

squeniart commented Oct 13, 2015

Sorry for the pics. I understand your point and I will provide the code to let you easily create the dataframes next time.

I confirm your last workaround is working well since I get the expected results. And the performances are better compared to:

  • a simple rows iteration algorithm to fill NA looking except when the date is changing
  • the ffill apply on the groupby result

But the performances are still 100 times slower compared to a simple fill forward and unfortunately, this is preventing me from using pandas in my project (~20/30 ms could be acceptable time). Do you think this performance issue could be addressed in a near future?

Thank you a lot for your help!

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 13, 2015

@squeniart well, pull-requests are accepted.

If you are constantly calling this in a time sensistive way then you are simply doing it wrong. Use caching or other techniques or roll-your own.

This is very easy to do in numba/cython.

@vitteloil

This comment has been minimized.

vitteloil commented Nov 17, 2015

Hi, I'm from the same team as @squeniart .

I tried to see where the culprit is, in pandas. It appears, when using groupby().fillna(method='ffill')), the cythonized code pad_2d (generated by generate_code.py) IS applied.

Problem is, in this particular example in the first message of this issue, the pad_2d_XXX code is called 66668 times, on series with only 3 elements in it. All the underlying code is in python, and is also very slow.

  1. grouping operations = 30% of the total time
  2. fillna operations = 30% of the total time
  3. merge&concat operations = 40% of the total time

figures taken from cProfiling dataframe.ffill(), profile output available here https://drive.google.com/file/d/0B3pyL0DQV74ic1p4dW5FaS12QVE/view?usp=sharing

I wonder if that would be acceptable, in a pull request, to add a method which does what we want to do. This method, for example, dataframe.ffill_reset(Column) would take a column name as a parameter, and would fill forward all the other columns, according to this Column argument. Every time the Column value changes, the fill forward stops and resets.

For example this cython function which would be the core of this new ffill_reset() function, and would be broke down for the different data types (from bool to floats, etc..):

def xpropagate_int64(ndarray[uint64_t, ndim=1] vdate,
                            ndarray[int64_t, ndim=1] vdata):

    # Set prev value to NA
    cdef uint64_t dateprev = 0
    cdef int64_t valprev
    cdef Py_ssize_t vsize = (<object> vdata).size

    # Go through date axis and fill forward NA values
    for i in range(vsize):
        # Is it a new date?
        date = vdate[i]
        val = vdata[i]

        if date != dateprev:
            valprev = val
            dateprev = date
            continue

        # this is the same date than the one before
        # and the value is NA => fill with previous value
        # val != val is how we test for NaN
        if val != val:
            vdata[i] = valprev
            continue

        # this is the same date than the one before
        # and the value is not NA => just keep in mind this value
        valprev = val

Thanks for your suggestions and help on the matter.
Henri

@jreback jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017

@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

@jreback jreback added this to Perf in Interesting Things Nov 26, 2017

@mroeschke mroeschke referenced this issue Jan 10, 2018

Open

PERF: Discrepancy in groupby methods #19165

4 of 7 tasks complete
@WillAyd

This comment has been minimized.

Member

WillAyd commented Feb 10, 2018

@jreback while I'm working on Cython optimizations I can take a look at this one. Just curious if we view the fact that ffill and bfillretain the grouping in it's own column as a feature or something up for discussion. To illustrate:

In []: df = pd.DataFrame({'key': ['a']*5, 'val': range(5)})
In []: df.groupby('key').rank()
Out []:
   val
0  1.0
1  2.0
2  3.0
3  4.0
4  5.0

In []: df.groupby('key').ffill()  # retains key in output; same for bfill
Out []:
  key  val
0   a    0
1   a    1
2   a    2
3   a    3
4   a    4

If bfill and ffill didn't return the grouping in a fashion similar to rank then we could leverage the same call signatures as the rest of the transformations

@WillAyd WillAyd referenced this issue Feb 13, 2018

Merged

Cythonized GroupBy Fill #19673

4 of 4 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 23, 2018

@jreback jreback removed this from Perf in Interesting Things Mar 31, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment