BUG: Group-by on an empty data object dtype loses the index name (cython aggregation is ok) #8093

carterk · 2014-08-22T13:32:06Z

If pd.read_sql is used to load a data frame with the results of an SQL query that returns no results, the columns in the data frame will be of type 'object'. That type cannot be aggregated, so a subsequent group-by operation on that empty data frame will drop all the columns. So instead of 'profit' in the below example being an empty series, an attribute error is thrown because the columns 'revenue' and 'expenses' cannot be found in the data frame.

Two things I can think of that could fix this:

Have pd.read_sql populate the data frame with empty columns of the correct type even if the SQL query returns no results. Then the group-by would not drop the columns because they are of a type that can be aggregated.
Have an option in groupby to not drop columns of types that cannot be aggregated: maybe a drop_non_agg flag. I think not dropping columns of types that cannot be aggregated should be the default behaviour. Columns with data that cannot be aggregated can just be populated with null after a group-by.

I think 1) probably should be implemented, and 2) is kind of a design decision.

You can run this code to reproduce the issue.

import pandas as pd
import sqlite3 as lite
import sys

finance = (
    (2, 132, 65),
    (6, 142, 86),
    (3, 183, 34),
    (3, 147, 46)
)

con = lite.connect('test.db')
cur = con.cursor()

cur.execute("DROP TABLE IF EXISTS finance")
cur.execute("CREATE TABLE finance(day_of_week INT, revenue FLOAT, expenses FLOAT)")
cur.executemany("INSERT INTO finance VALUES(?, ?, ?)", finance)

# remove the 'WHERE' clause, and the error won't be thrown
my_query = '''
    SELECT *
    FROM finance
    WHERE day_of_week = 5
    '''

df = pd.read_sql(my_query, con)

df_gb = df[df.day_of_week == 5].groupby('day_of_week').sum().reset_index()

profit = df_gb.revenue - df_gb.expenses # AttributeError thrown here

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2014-08-22T14:49:20Z

The problem is that pandas, when reading a query, does not know anything about the table structure itself. It constructs the resulting frame only from the returned values from the query. And if there are no values, it cannot determine the dtype.

jorisvandenbossche · 2014-08-22T14:55:02Z

By the way, if I run this with pandas 0.14.1, I don't get a AttributeError. df has 'object' dtypes, but after the groupby, df_gb has columns of 'float' dtype.

carterk · 2014-08-22T15:08:49Z

I'm running this on pandas 0.13.1

df.info() gives:

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
day_of_week    0 non-null object
revenue        0 non-null object
expenses       0 non-null object
dtypes: object(3)None

df_gb.info() gives:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 1 columns):
index    0 non-null object
dtypes: object(1)None

Is it different for you?

jorisvandenbossche · 2014-08-22T15:10:54Z

yes:

In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
day_of_week    0 non-null object
revenue        0 non-null object
expenses       0 non-null object
dtypes: object(3)

In [24]: df_gb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 3 columns):
index       0 non-null int64
revenue     0 non-null float64
expenses    0 non-null float64
dtypes: float64(2), int64(1)

But I don't know which of both is correct. As converting the object column to float is also strange I think.

carterk · 2014-08-22T15:26:31Z

Any idea why those 'object' columns are 'float64' after the group-by? Where does the 'day_of_week' column go? This still isn't what I would expect/want from the group-by operation.

jorisvandenbossche · 2014-08-22T15:29:45Z

@jreback object columns that get converted to float, is that OK?

You can keep the 'day_of_week' column by providing as_index=False: df.groupby('day_of_week', as_index=False).sum() (although I would think it should also work with reset_index, but for some reason it loses the index name, this seems a bug to me)

jreback · 2014-08-22T15:31:47Z

@carterk yes, that is by definition, the index of a returned groupby uses the grouper.

carterk · 2014-08-22T15:45:56Z

@jreback Yeah, I thought reset_index handled that. Maybe not. Either way I'm still interested to know why the 'object' columns turned into 'float' columns. Maybe in 0.14.1 groupby now treats the 'object' type as aggregatable and converts it to 'float'? Which is kind of unexpected behaviour, but would also be a decent solution to the issue.

jorisvandenbossche · 2014-08-22T15:58:53Z

@jreback there are two things I don't understand/are a bit strange:

(empty) object columns are converted to float -> expected? (changed from 0.13.1 to 0.14.1)
groupby looses the index name (the name of the column it sets as idnex)

jreback · 2014-08-22T16:02:10Z

neither of those are true

the result of the input array deteomes the dtype - they r not coerced

name of the groupby column is preserved

I suspect the input to the frame creation is not exactly right - save that and u will see

jorisvandenbossche · 2014-08-22T16:14:25Z

What do you exactly mean with 'input to frame creation is not exactly right'?

In [66]: df = pd.DataFrame(columns=list('ABC'))

In [67]: df
Out[67]: 
Empty DataFrame
Columns: [A, B, C]
Index: []

In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
A    0 non-null object
B    0 non-null object
C    0 non-null object
dtypes: object(3)

In [69]: grouped = df.groupby('A').sum()

In [70]: grouped
Out[70]: 
Empty DataFrame
Columns: [B, C]
Index: []

In [71]: grouped.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 2 columns):
B    0 non-null float64
C    0 non-null float64
dtypes: float64(2)

In [72]: grouped.index.name

jreback · 2014-08-22T16:58:11Z

In [119]: df = pd.DataFrame(columns=list('ABC'),dtype='float64')

In [120]: df.groupby('A').sum().index.name
Out[120]: 'A'

In [121]: df = pd.DataFrame(columns=list('ABC'))

In [122]: df.groupby('A').sum().index.name

I think that the object dtype causes a python aggregation (while the float is a cython aggregation). somewhere the name is getting lost. call this a bug.

Licht-T · 2017-10-24T16:08:54Z

@TomAugspurger This seems already fixed.

In [1]: import pandas as pd

In [2]: print(pd.__version__)
0.21.0.dev+627.ge001500cb.dirty

In [3]: df = pd.DataFrame(columns=list('ABC'))
   ...: df.groupby('A').sum().index.name
   ...:
Out[3]: 'A'

TomAugspurger · 2017-11-01T15:32:57Z

Thanks, it'd be nice to ensure we have a regression test in place.

…

On Tue, Oct 24, 2017 at 11:08 AM, Licht Takeuchi ***@***.***> wrote: @TomAugspurger <https://github.com/tomaugspurger> This seems fixed. In [1]: import pandas as pd In [2]: print(pd.__version__)0.21.0.dev+627.ge001500cb.dirty In [3]: df = pd.DataFrame(columns=list('ABC')) ...: df.groupby('A').sum().index.name ...: Out[3]: 'A' — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8093 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIpFjvpbZnqDd4cv5dtNizOT4gk1rks5svgubgaJpZM4CaIHr> .

Licht-T · 2017-11-03T15:10:10Z

@TomAugspurger Okay. I'll do that.

jreback added Bug labels Aug 22, 2014

jreback added this to the 0.15.0 milestone Aug 22, 2014

jreback changed the title ~~Group-by on an empty data frame populated by an SQL query that returns no results.~~ BUG: Group-by on an empty data object dtype loses the index name (cython aggregation is ok) Aug 22, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback added Difficulty Novice labels Apr 12, 2015

jreback mentioned this issue Apr 12, 2015

BUG: losing Index/Series names master issue #9862

Closed

12 tasks

jreback mentioned this issue Feb 17, 2016

BUG: inconsistent name for returned Series in groupby #12363

Closed

TomAugspurger added the good first issue label Oct 11, 2017

Licht-T mentioned this issue Nov 3, 2017

TST: Add regression test for empty DataFrame groupby #18097

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.22.0 Nov 4, 2017

jreback closed this as completed in #18097 Nov 4, 2017

thomashopf mentioned this issue Mar 30, 2018

KeyError: 'alignment_id' debbiemarkslab/EVcouplings#151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Group-by on an empty data object dtype loses the index name (cython aggregation is ok) #8093

BUG: Group-by on an empty data object dtype loses the index name (cython aggregation is ok) #8093

carterk commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

carterk commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

carterk commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

jreback commented Aug 22, 2014

carterk commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

jreback commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

jreback commented Aug 22, 2014

Licht-T commented Oct 24, 2017 •

edited

TomAugspurger commented Nov 1, 2017 via email

Licht-T commented Nov 3, 2017

BUG: Group-by on an empty data object dtype loses the index name (cython aggregation is ok) #8093

BUG: Group-by on an empty data object dtype loses the index name (cython aggregation is ok) #8093

Comments

carterk commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

carterk commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

carterk commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

jreback commented Aug 22, 2014

carterk commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

jreback commented Aug 22, 2014

jorisvandenbossche commented Aug 22, 2014

jreback commented Aug 22, 2014

Licht-T commented Oct 24, 2017 • edited

TomAugspurger commented Nov 1, 2017 via email

Licht-T commented Nov 3, 2017

Licht-T commented Oct 24, 2017 •

edited