BUG: Different initialization methods lead to different dtypes (DataFrame) #42971

RileyLazarou · 2021-08-10T16:56:40Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd

df1 = pd.DataFrame(columns=["a", "b", "c"])
print(df1.groupby("a").sum().columns)
# => Index([], dtype='object')
df2 = pd.DataFrame({"a": [], "b": [], "c": []})
print(df2.groupby("a").sum().columns)
# => Index(['b', 'c'], dtype='object')

Problem description

groupby-ing and summing an empty dataframe led to dropped columns (df1 above); this doesn't occur with non-empty dataframes. This changing the columns of a dataframe based on its content is counter-intuitive and leads to key errors. The expected behaviour is shown above with df2, and the fact that two empty dataframes show different behaviours when grouped and summed suggests that this isn't intended behaviour.

Expected Output

Output of the above snippet:

Index([], dtype='object')
Index(['b', 'c'], dtype='object')

The text was updated successfully, but these errors were encountered:

phofl · 2021-08-10T18:57:05Z

Hi, thanks for your report.

This is actuall quite straighforward and has nothing to do with groupby itself. df1 has dtype object while df2 has dtype float. If you set numeric_only=False you will have your columns as expected

sergiykhan · 2021-08-10T18:57:44Z

The inconsistency appears to be related to the 'float' data type that you have in df2. Here is different example

df1 = pd.DataFrame(columns=["a", "b", "c"], dtype='float')
print( df1.groupby("a").sum().columns )
# Index(['b', 'c'], dtype='object')

df2 = pd.DataFrame(columns=["a", "b", "c"], dtype='object')
print( df2.groupby("a").sum().columns )
# Index([], dtype='object')

phofl · 2021-08-10T19:04:05Z

The dtype inference might be wrong here, but I don't know the history here and if this is intended

RileyLazarou · 2021-08-10T19:30:50Z

@phofl thanks for the quick reply! I had no idea that these two methods of instantiating empty dataframes led to different dtypes

>>> pd.DataFrame(columns=["a"]).dtypes
a    object
dtype: object
>>> pd.DataFrame({"a": []}).dtypes
a    float64
dtype: object

phofl · 2021-08-10T19:31:50Z

Reopening since the different dtype look strange

debnathshoham · 2021-08-12T18:35:14Z

take

debnathshoham · 2021-08-12T19:36:04Z

On further investigation, there is a mismatch in the .index as well

>>> pd.DataFrame(columns=["a"]).index
Index([], dtype='object')
>>> pd.DataFrame({"a": []}).index
RangeIndex(start=0, stop=0, step=1)

debnathshoham · 2021-08-19T13:37:59Z

this has too many moving parts, and seems like a lot of dependent tests.
I will unassign myself.

RileyLazarou added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2021

phofl added Constructors Series/DataFrame/Index/pd.array Constructors Groupby Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2021

RileyLazarou closed this as completed Aug 10, 2021

phofl reopened this Aug 10, 2021

phofl removed the Usage Question label Aug 10, 2021

phofl changed the title ~~BUG: groupby and sum drops columns and has undefined behaviour~~ BUG: Different initialization methods lead to different dtypes (DataFrame) Aug 10, 2021

phofl added Bug DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions and removed Groupby labels Aug 10, 2021

github-actions bot assigned debnathshoham Aug 12, 2021

debnathshoham mentioned this issue Aug 13, 2021

BUG: seperate df dtypes from different initialization #43019

Closed

4 tasks

simonjayhawkins added the API - Consistency Internal Consistency of API/Behavior label Aug 16, 2021

debnathshoham removed their assignment Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Different initialization methods lead to different dtypes (DataFrame) #42971

BUG: Different initialization methods lead to different dtypes (DataFrame) #42971

RileyLazarou commented Aug 10, 2021

phofl commented Aug 10, 2021

sergiykhan commented Aug 10, 2021

phofl commented Aug 10, 2021

RileyLazarou commented Aug 10, 2021

phofl commented Aug 10, 2021

debnathshoham commented Aug 12, 2021

debnathshoham commented Aug 12, 2021

debnathshoham commented Aug 19, 2021

BUG: Different initialization methods lead to different dtypes (DataFrame) #42971

BUG: Different initialization methods lead to different dtypes (DataFrame) #42971

Comments

RileyLazarou commented Aug 10, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of the above snippet:

phofl commented Aug 10, 2021

sergiykhan commented Aug 10, 2021

phofl commented Aug 10, 2021

RileyLazarou commented Aug 10, 2021

phofl commented Aug 10, 2021

debnathshoham commented Aug 12, 2021

debnathshoham commented Aug 12, 2021

debnathshoham commented Aug 19, 2021