BUG: groupby().ffill() adds group labels as extra column #21521

adbull · 2018-06-18T11:40:31Z

Code Sample, a copy-pastable example if possible

Input:

pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()

Output:

   NaN  3  4
1    5  0  0
2    6  0  0

Problem description

groupby().ffill() adds an additional column to the dataframe, containing a copy of the group labels. This is a regression in pandas v0.23.0 (#19673).

Expected Output

   3  4
1  0  0
2  0  0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.14-200.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.0
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.3
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.2
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

uds5501 · 2018-06-18T14:50:59Z

@adbull I think you should try and build from master. The given problem no longer persists in the current version of pandas

>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
   3  4
1  0  0
2  0  0

gfyoung · 2018-06-18T23:46:43Z

@uds5501 : Thanks for looking into this! Would you like to add a test to close?

gfyoung · 2018-06-18T23:48:34Z

~~Marking for 0.23.2, as this is a relatively trivial thing that we should be able to get in (I can bring this to the finish line if others are unable to do so).~~

uds5501 · 2018-06-19T06:46:08Z

@gfyoung actually no. I just fetched recent version of master and can now actually reproduce this error. I am sorry for the confusion.

>>> import pandas as pd
>>> pd.__version__
'0.24.0.dev0+122.g6131a59'
>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
   NaN  3  4
1    5  0  0
2    6  0  0

I have honestly no idea how did I got the one which i reported before but the error is reproducable

gfyoung · 2018-06-19T06:59:46Z

@uds5501 : Sorry, I wasn't able to check until now, and yes, I can reproduce too. This could be a regression though potentially.

@adbull : Investigation and PR are welcome!

aggarwalvinayak · 2018-06-19T16:42:49Z

Hey, can i work on this issue? This is my first time i am contributing to an open source project. I have some experience using pandas dataframe

gfyoung · 2018-06-19T16:47:38Z

@aggarwalvinayak : By all means! Go for it!

aggarwalvinayak · 2018-06-21T07:42:33Z

Can i get some help getting started with this. Like where to found the implementation of groupby and ffill be located and how should i proceed in making this work.
I did found this wasn't a bug in my previous version of pandas 0.20.1 but is a problem in current master

gfyoung · 2018-06-21T08:10:41Z

@aggarwalvinayak : Oh, that is very useful information! You can proceed as follows:

Can you figure out the most recent version / release of pandas where this example works?
Are you familiar with git bisect ? If not, have a look here: https://git-scm.com/docs/git-bisect

The goal is to figure out the commit where this example starts to fail. Then we can debug from there.

adbull · 2018-06-21T11:43:05Z

As in the original report, this is a regression between 0.22.0 and 0.23.0, presumably due to the reimplementation of groupby().ffill() in cython (#19673).

aggarwalvinayak · 2018-06-21T13:43:11Z

@gfyoung d87ca1c723154b09a005f865a06a38d4bb82917c was the first commit in which this problem existed and the commit message was Cythonized GroupBy Fill and was aimed at improving performance of GroupBy.ffill & GroupBy.bfill on issue #11296 and #19673
@WillAyd Could you help me with this as you made these changes. It would help me alot to look into this problem further as it my first issue and i dont have much experience

WillAyd · 2018-06-21T14:13:53Z

@aggarwalvinayak sure as you look at it and have questions feel free to ask

WillAyd · 2018-06-21T14:17:06Z

Just as a heads up - this "good first issue" label was added when we thought this was just going to be a test case. I've removed it as it appears to be a little more complex than that. Absolutely welcome to diagnose and debug but just want to be clear that it may not be as simple as originally thought

HarryVolek · 2018-06-21T15:20:51Z

The error occurs when the number of columns specified in the groupby equals number of rows in the dataframe while one of the columns is not contained in the dataframe. If the number of columns in the groupby is not equivalent to the number of rows in the dataframe a keyerror is raised. Why is it desired for nothing to happen if the number of columns specified equals the number of rows in the dataframe (and one of the columns specified is not contained in the dataframe) as opposed to a keyerror? I couldn't find the reason in the groupby docs.

gfyoung · 2018-06-21T16:52:16Z

Marking this for 0.23.2, as the regression was introduced in 0.23.0.

WillAyd · 2018-06-21T18:13:05Z

IMO this is not a regression. If you do:

pd.DataFrame(0, [1,2], [3,4]).groupby([3, 4]).ffill()

Then you do not get the additional column. 5 and 6 are not valid labels, so if anything this should be raising a KeyError.

gfyoung · 2018-06-21T18:21:23Z

@WillAyd: Feel free to relabel if that is the case. @jreback: Thoughts?

adbull · 2018-06-21T19:21:03Z

This is a regression. groupby arguments do not have to be column keys, they can also be group labels, e.g. the following is a valid groupby call:

>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby([1,1,2,2]).ffill()
   NaN    0
0    1  1.0
1    1  1.0
2    2  NaN
3    2  NaN

In 0.22.0, the correct result was returned, with no extra column; in 0.23.0, the extra column gets added.

@HarryVolek The special logic when the argument is the same length as the df, and contains an entry that is not a column key, is to support this use case. Possible this could be better documented?

adbull · 2018-07-03T13:20:44Z

Looks like the issue also occurs when grouping by an index level:

>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]],[0],[1,1,2,2]).T.groupby(level=0).ffill()
   NaN    0
1    1  1.0
1    1  1.0
2    2  NaN
2    2  NaN

gfyoung added Groupby Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jun 18, 2018

gfyoung added the Testing pandas testing functions or related to the test suite label Jun 18, 2018

gfyoung added this to the 0.23.2 milestone Jun 18, 2018

gfyoung added the good first issue label Jun 18, 2018

gfyoung added Bug and removed Testing pandas testing functions or related to the test suite labels Jun 19, 2018

gfyoung removed this from the 0.23.2 milestone Jun 19, 2018

gfyoung added Regression Functionality that used to work in a prior pandas version and removed Bug labels Jun 21, 2018

WillAyd removed the good first issue label Jun 21, 2018

gfyoung added this to the 0.23.2 milestone Jun 21, 2018

jreback added this to the 0.23.3 milestone Jun 26, 2018

jreback modified the milestones: 0.23.4, 0.23.5 Aug 2, 2018

jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

d13d18f

adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

2942818

adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

b48439a

adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

96e75af

adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

100fa27

adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

0624dee

adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

f31ff65

adbull mentioned this issue Apr 20, 2019

BUG: groupby ffill adds labels as extra column (#21521) #26162

Merged

4 tasks

adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

e010ffe

adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

a6b638c

jreback modified the milestones: Contributions Welcome, 0.25.0 Apr 21, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 22, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

d969168

adbull pushed a commit to adbull/pandas that referenced this issue Apr 23, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

0252697

adbull pushed a commit to adbull/pandas that referenced this issue Apr 25, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

137e1fd

adbull pushed a commit to adbull/pandas that referenced this issue Apr 27, 2019

BUG: groupby ffill adds labels as extra column (pandas-dev#21521)

56b75ae

adbull pushed a commit to adbull/pandas that referenced this issue Apr 27, 2019

API: groupby ffill adds labels as extra column (pandas-dev#21521)

dd793c5

adbull pushed a commit to adbull/pandas that referenced this issue Apr 28, 2019

API: groupby ffill adds labels as extra column (pandas-dev#21521)

d3972c4

adbull pushed a commit to adbull/pandas that referenced this issue May 12, 2019

API: groupby ffill adds labels as extra column (pandas-dev#21521)

3514b45

adbull pushed a commit to adbull/pandas that referenced this issue May 13, 2019

API: groupby ffill adds labels as extra column (pandas-dev#21521)

01057a4

WillAyd closed this as completed in #26162 May 15, 2019

WillAyd pushed a commit that referenced this issue May 15, 2019

API: groupby ffill adds labels as extra column (#21521) (#26162)

3b24fb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby().ffill() adds group labels as extra column #21521

BUG: groupby().ffill() adds group labels as extra column #21521

adbull commented Jun 18, 2018 •

edited

Loading

INSTALLED VERSIONS

uds5501 commented Jun 18, 2018

gfyoung commented Jun 18, 2018

gfyoung commented Jun 18, 2018 •

edited

Loading

uds5501 commented Jun 19, 2018

gfyoung commented Jun 19, 2018

aggarwalvinayak commented Jun 19, 2018

gfyoung commented Jun 19, 2018

aggarwalvinayak commented Jun 21, 2018 •

edited

Loading

gfyoung commented Jun 21, 2018 •

edited

Loading

adbull commented Jun 21, 2018

aggarwalvinayak commented Jun 21, 2018 •

edited

Loading

WillAyd commented Jun 21, 2018

WillAyd commented Jun 21, 2018

HarryVolek commented Jun 21, 2018 •

edited

Loading

gfyoung commented Jun 21, 2018

WillAyd commented Jun 21, 2018

gfyoung commented Jun 21, 2018

adbull commented Jun 21, 2018 •

edited

Loading

adbull commented Jul 3, 2018

BUG: groupby().ffill() adds group labels as extra column #21521

BUG: groupby().ffill() adds group labels as extra column #21521

Comments

adbull commented Jun 18, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

uds5501 commented Jun 18, 2018

gfyoung commented Jun 18, 2018

gfyoung commented Jun 18, 2018 • edited Loading

uds5501 commented Jun 19, 2018

gfyoung commented Jun 19, 2018

aggarwalvinayak commented Jun 19, 2018

gfyoung commented Jun 19, 2018

aggarwalvinayak commented Jun 21, 2018 • edited Loading

gfyoung commented Jun 21, 2018 • edited Loading

adbull commented Jun 21, 2018

aggarwalvinayak commented Jun 21, 2018 • edited Loading

WillAyd commented Jun 21, 2018

WillAyd commented Jun 21, 2018

HarryVolek commented Jun 21, 2018 • edited Loading

gfyoung commented Jun 21, 2018

WillAyd commented Jun 21, 2018

gfyoung commented Jun 21, 2018

adbull commented Jun 21, 2018 • edited Loading

adbull commented Jul 3, 2018

adbull commented Jun 18, 2018 •

edited

Loading

Output of `pd.show_versions()`

gfyoung commented Jun 18, 2018 •

edited

Loading

aggarwalvinayak commented Jun 21, 2018 •

edited

Loading

gfyoung commented Jun 21, 2018 •

edited

Loading

aggarwalvinayak commented Jun 21, 2018 •

edited

Loading

HarryVolek commented Jun 21, 2018 •

edited

Loading

adbull commented Jun 21, 2018 •

edited

Loading