Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby().ffill() adds group labels as extra column #21521

Closed
adbull opened this issue Jun 18, 2018 · 22 comments · Fixed by #26162
Closed

BUG: groupby().ffill() adds group labels as extra column #21521

adbull opened this issue Jun 18, 2018 · 22 comments · Fixed by #26162
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@adbull
Copy link
Contributor

adbull commented Jun 18, 2018

Code Sample, a copy-pastable example if possible

Input:

pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()

Output:

   NaN  3  4
1    5  0  0
2    6  0  0

Problem description

groupby().ffill() adds an additional column to the dataframe, containing a copy of the group labels. This is a regression in pandas v0.23.0 (#19673).

Expected Output

   3  4
1  0  0
2  0  0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.14-200.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.0
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.3
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.2
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

@uds5501
Copy link
Contributor

uds5501 commented Jun 18, 2018

@adbull I think you should try and build from master. The given problem no longer persists in the current version of pandas

>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
   3  4
1  0  0
2  0  0

@gfyoung gfyoung added Groupby Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jun 18, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 18, 2018

@uds5501 : Thanks for looking into this! Would you like to add a test to close?

@gfyoung gfyoung added the Testing pandas testing functions or related to the test suite label Jun 18, 2018
@gfyoung gfyoung added this to the 0.23.2 milestone Jun 18, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 18, 2018

Marking for 0.23.2, as this is a relatively trivial thing that we should be able to get in (I can bring this to the finish line if others are unable to do so).

@uds5501
Copy link
Contributor

uds5501 commented Jun 19, 2018

@gfyoung actually no. I just fetched recent version of master and can now actually reproduce this error. I am sorry for the confusion.

>>> import pandas as pd
>>> pd.__version__
'0.24.0.dev0+122.g6131a59'
>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
   NaN  3  4
1    5  0  0
2    6  0  0

I have honestly no idea how did I got the one which i reported before but the error is reproducable

@gfyoung gfyoung added Bug and removed Testing pandas testing functions or related to the test suite labels Jun 19, 2018
@gfyoung gfyoung removed this from the 0.23.2 milestone Jun 19, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 19, 2018

@uds5501 : Sorry, I wasn't able to check until now, and yes, I can reproduce too. This could be a regression though potentially.

@adbull : Investigation and PR are welcome!

@aggarwalvinayak
Copy link

Hey, can i work on this issue? This is my first time i am contributing to an open source project. I have some experience using pandas dataframe

@gfyoung
Copy link
Member

gfyoung commented Jun 19, 2018

@aggarwalvinayak : By all means! Go for it!

@aggarwalvinayak
Copy link

aggarwalvinayak commented Jun 21, 2018

Can i get some help getting started with this. Like where to found the implementation of groupby and ffill be located and how should i proceed in making this work.
I did found this wasn't a bug in my previous version of pandas 0.20.1 but is a problem in current master

@gfyoung gfyoung added Regression Functionality that used to work in a prior pandas version and removed Bug labels Jun 21, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 21, 2018

@aggarwalvinayak : Oh, that is very useful information! You can proceed as follows:

  1. Can you figure out the most recent version / release of pandas where this example works?
  2. Are you familiar with git bisect ? If not, have a look here: https://git-scm.com/docs/git-bisect

The goal is to figure out the commit where this example starts to fail. Then we can debug from there.

@adbull
Copy link
Contributor Author

adbull commented Jun 21, 2018

As in the original report, this is a regression between 0.22.0 and 0.23.0, presumably due to the reimplementation of groupby().ffill() in cython (#19673).

@aggarwalvinayak
Copy link

aggarwalvinayak commented Jun 21, 2018

@gfyoung d87ca1c723154b09a005f865a06a38d4bb82917c was the first commit in which this problem existed and the commit message was Cythonized GroupBy Fill and was aimed at improving performance of GroupBy.ffill & GroupBy.bfill on issue #11296 and #19673
@WillAyd Could you help me with this as you made these changes. It would help me alot to look into this problem further as it my first issue and i dont have much experience

@WillAyd
Copy link
Member

WillAyd commented Jun 21, 2018

@aggarwalvinayak sure as you look at it and have questions feel free to ask

@WillAyd
Copy link
Member

WillAyd commented Jun 21, 2018

Just as a heads up - this "good first issue" label was added when we thought this was just going to be a test case. I've removed it as it appears to be a little more complex than that. Absolutely welcome to diagnose and debug but just want to be clear that it may not be as simple as originally thought

@HarryVolek
Copy link

HarryVolek commented Jun 21, 2018

The error occurs when the number of columns specified in the groupby equals number of rows in the dataframe while one of the columns is not contained in the dataframe. If the number of columns in the groupby is not equivalent to the number of rows in the dataframe a keyerror is raised. Why is it desired for nothing to happen if the number of columns specified equals the number of rows in the dataframe (and one of the columns specified is not contained in the dataframe) as opposed to a keyerror? I couldn't find the reason in the groupby docs.

@gfyoung gfyoung added this to the 0.23.2 milestone Jun 21, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 21, 2018

Marking this for 0.23.2, as the regression was introduced in 0.23.0.

@WillAyd
Copy link
Member

WillAyd commented Jun 21, 2018

IMO this is not a regression. If you do:

pd.DataFrame(0, [1,2], [3,4]).groupby([3, 4]).ffill()

Then you do not get the additional column. 5 and 6 are not valid labels, so if anything this should be raising a KeyError.

@gfyoung
Copy link
Member

gfyoung commented Jun 21, 2018

@WillAyd: Feel free to relabel if that is the case. @jreback: Thoughts?

@adbull
Copy link
Contributor Author

adbull commented Jun 21, 2018

This is a regression. groupby arguments do not have to be column keys, they can also be group labels, e.g. the following is a valid groupby call:

>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby([1,1,2,2]).ffill()
   NaN    0
0    1  1.0
1    1  1.0
2    2  NaN
3    2  NaN

In 0.22.0, the correct result was returned, with no extra column; in 0.23.0, the extra column gets added.

@HarryVolek The special logic when the argument is the same length as the df, and contains an entry that is not a column key, is to support this use case. Possible this could be better documented?

@jreback jreback added this to the 0.23.3 milestone Jun 26, 2018
@adbull
Copy link
Contributor Author

adbull commented Jul 3, 2018

Looks like the issue also occurs when grouping by an index level:

>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]],[0],[1,1,2,2]).T.groupby(level=0).ffill()
   NaN    0
1    1  1.0
1    1  1.0
2    2  NaN
2    2  NaN

@jreback jreback modified the milestones: 0.23.4, 0.23.5 Aug 2, 2018
@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018
adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019
@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 Apr 21, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 22, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 23, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 25, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 27, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 27, 2019
adbull pushed a commit to adbull/pandas that referenced this issue Apr 28, 2019
adbull pushed a commit to adbull/pandas that referenced this issue May 12, 2019
adbull pushed a commit to adbull/pandas that referenced this issue May 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants