Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Different initialization methods lead to different dtypes (DataFrame) #42971

Open
2 of 3 tasks
RileyLazarou opened this issue Aug 10, 2021 · 8 comments
Open
2 of 3 tasks
Labels
API - Consistency Internal Consistency of API/Behavior Bug Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions

Comments

@RileyLazarou
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

df1 = pd.DataFrame(columns=["a", "b", "c"])
print(df1.groupby("a").sum().columns)
# => Index([], dtype='object')
df2 = pd.DataFrame({"a": [], "b": [], "c": []})
print(df2.groupby("a").sum().columns)
# => Index(['b', 'c'], dtype='object')

Problem description

groupby-ing and summing an empty dataframe led to dropped columns (df1 above); this doesn't occur with non-empty dataframes. This changing the columns of a dataframe based on its content is counter-intuitive and leads to key errors. The expected behaviour is shown above with df2, and the fact that two empty dataframes show different behaviours when grouped and summed suggests that this isn't intended behaviour.

Expected Output

Output of the above snippet:

Index([], dtype='object')
Index(['b', 'c'], dtype='object')

@RileyLazarou RileyLazarou added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2021
@phofl
Copy link
Member

phofl commented Aug 10, 2021

Hi, thanks for your report.

This is actuall quite straighforward and has nothing to do with groupby itself. df1 has dtype object while df2 has dtype float. If you set numeric_only=False you will have your columns as expected

@phofl phofl added Constructors Series/DataFrame/Index/pd.array Constructors Groupby Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2021
@sergiykhan
Copy link

The inconsistency appears to be related to the 'float' data type that you have in df2. Here is different example

df1 = pd.DataFrame(columns=["a", "b", "c"], dtype='float')
print( df1.groupby("a").sum().columns )
# Index(['b', 'c'], dtype='object')

df2 = pd.DataFrame(columns=["a", "b", "c"], dtype='object')
print( df2.groupby("a").sum().columns )
# Index([], dtype='object')

@phofl
Copy link
Member

phofl commented Aug 10, 2021

The dtype inference might be wrong here, but I don't know the history here and if this is intended

@RileyLazarou
Copy link
Author

@phofl thanks for the quick reply! I had no idea that these two methods of instantiating empty dataframes led to different dtypes

>>> pd.DataFrame(columns=["a"]).dtypes
a    object
dtype: object
>>> pd.DataFrame({"a": []}).dtypes
a    float64
dtype: object

@phofl phofl reopened this Aug 10, 2021
@phofl
Copy link
Member

phofl commented Aug 10, 2021

Reopening since the different dtype look strange

@phofl phofl changed the title BUG: groupby and sum drops columns and has undefined behaviour BUG: Different initialization methods lead to different dtypes (DataFrame) Aug 10, 2021
@phofl phofl added Bug DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions and removed Groupby labels Aug 10, 2021
@debnathshoham
Copy link
Member

take

@debnathshoham
Copy link
Member

On further investigation, there is a mismatch in the .index as well

>>> pd.DataFrame(columns=["a"]).index
Index([], dtype='object')
>>> pd.DataFrame({"a": []}).index
RangeIndex(start=0, stop=0, step=1)

@simonjayhawkins simonjayhawkins added the API - Consistency Internal Consistency of API/Behavior label Aug 16, 2021
@debnathshoham
Copy link
Member

this has too many moving parts, and seems like a lot of dependent tests.
I will unassign myself.

@debnathshoham debnathshoham removed their assignment Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants