Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Group-by on an empty data object dtype loses the index name (cython aggregation is ok) #8093

Closed
carterk opened this issue Aug 22, 2014 · 15 comments · Fixed by #18097
Closed

Comments

@carterk
Copy link

carterk commented Aug 22, 2014

If pd.read_sql is used to load a data frame with the results of an SQL query that returns no results, the columns in the data frame will be of type 'object'. That type cannot be aggregated, so a subsequent group-by operation on that empty data frame will drop all the columns. So instead of 'profit' in the below example being an empty series, an attribute error is thrown because the columns 'revenue' and 'expenses' cannot be found in the data frame.

Two things I can think of that could fix this:

  1. Have pd.read_sql populate the data frame with empty columns of the correct type even if the SQL query returns no results. Then the group-by would not drop the columns because they are of a type that can be aggregated.
  2. Have an option in groupby to not drop columns of types that cannot be aggregated: maybe a drop_non_agg flag. I think not dropping columns of types that cannot be aggregated should be the default behaviour. Columns with data that cannot be aggregated can just be populated with null after a group-by.

I think 1) probably should be implemented, and 2) is kind of a design decision.

You can run this code to reproduce the issue.

import pandas as pd
import sqlite3 as lite
import sys

finance = (
    (2, 132, 65),
    (6, 142, 86),
    (3, 183, 34),
    (3, 147, 46)
)

con = lite.connect('test.db')
cur = con.cursor()

cur.execute("DROP TABLE IF EXISTS finance")
cur.execute("CREATE TABLE finance(day_of_week INT, revenue FLOAT, expenses FLOAT)")
cur.executemany("INSERT INTO finance VALUES(?, ?, ?)", finance)

# remove the 'WHERE' clause, and the error won't be thrown
my_query = '''
    SELECT *
    FROM finance
    WHERE day_of_week = 5
    '''

df = pd.read_sql(my_query, con)

df_gb = df[df.day_of_week == 5].groupby('day_of_week').sum().reset_index()

profit = df_gb.revenue - df_gb.expenses # AttributeError thrown here
@jorisvandenbossche
Copy link
Member

The problem is that pandas, when reading a query, does not know anything about the table structure itself. It constructs the resulting frame only from the returned values from the query. And if there are no values, it cannot determine the dtype.

@jorisvandenbossche
Copy link
Member

By the way, if I run this with pandas 0.14.1, I don't get a AttributeError. df has 'object' dtypes, but after the groupby, df_gb has columns of 'float' dtype.

@carterk
Copy link
Author

carterk commented Aug 22, 2014

I'm running this on pandas 0.13.1

df.info() gives:

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
day_of_week    0 non-null object
revenue        0 non-null object
expenses       0 non-null object
dtypes: object(3)None

df_gb.info() gives:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 1 columns):
index    0 non-null object
dtypes: object(1)None

Is it different for you?

@jorisvandenbossche
Copy link
Member

yes:

In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
day_of_week    0 non-null object
revenue        0 non-null object
expenses       0 non-null object
dtypes: object(3)

In [24]: df_gb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 3 columns):
index       0 non-null int64
revenue     0 non-null float64
expenses    0 non-null float64
dtypes: float64(2), int64(1)

But I don't know which of both is correct. As converting the object column to float is also strange I think.

@carterk
Copy link
Author

carterk commented Aug 22, 2014

Any idea why those 'object' columns are 'float64' after the group-by? Where does the 'day_of_week' column go? This still isn't what I would expect/want from the group-by operation.

@jorisvandenbossche
Copy link
Member

@jreback object columns that get converted to float, is that OK?

You can keep the 'day_of_week' column by providing as_index=False: df.groupby('day_of_week', as_index=False).sum() (although I would think it should also work with reset_index, but for some reason it loses the index name, this seems a bug to me)

@jreback
Copy link
Contributor

jreback commented Aug 22, 2014

@carterk yes, that is by definition, the index of a returned groupby uses the grouper.

@carterk
Copy link
Author

carterk commented Aug 22, 2014

@jreback Yeah, I thought reset_index handled that. Maybe not. Either way I'm still interested to know why the 'object' columns turned into 'float' columns. Maybe in 0.14.1 groupby now treats the 'object' type as aggregatable and converts it to 'float'? Which is kind of unexpected behaviour, but would also be a decent solution to the issue.

@jorisvandenbossche
Copy link
Member

@jreback there are two things I don't understand/are a bit strange:

  • (empty) object columns are converted to float -> expected? (changed from 0.13.1 to 0.14.1)
  • groupby looses the index name (the name of the column it sets as idnex)

@jreback
Copy link
Contributor

jreback commented Aug 22, 2014

neither of those are true

the result of the input array deteomes the dtype - they r not coerced

name of the groupby column is preserved

I suspect the input to the frame creation is not exactly right - save that and u will see

@jorisvandenbossche
Copy link
Member

What do you exactly mean with 'input to frame creation is not exactly right'?

In [66]: df = pd.DataFrame(columns=list('ABC'))

In [67]: df
Out[67]: 
Empty DataFrame
Columns: [A, B, C]
Index: []

In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
A    0 non-null object
B    0 non-null object
C    0 non-null object
dtypes: object(3)

In [69]: grouped = df.groupby('A').sum()

In [70]: grouped
Out[70]: 
Empty DataFrame
Columns: [B, C]
Index: []

In [71]: grouped.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 2 columns):
B    0 non-null float64
C    0 non-null float64
dtypes: float64(2)

In [72]: grouped.index.name

@jreback
Copy link
Contributor

jreback commented Aug 22, 2014

In [119]: df = pd.DataFrame(columns=list('ABC'),dtype='float64')

In [120]: df.groupby('A').sum().index.name
Out[120]: 'A'

In [121]: df = pd.DataFrame(columns=list('ABC'))

In [122]: df.groupby('A').sum().index.name

I think that the object dtype causes a python aggregation (while the float is a cython aggregation). somewhere the name is getting lost. call this a bug.

@jreback jreback added this to the 0.15.0 milestone Aug 22, 2014
@jreback jreback changed the title Group-by on an empty data frame populated by an SQL query that returns no results. BUG: Group-by on an empty data object dtype loses the index name (cython aggregation is ok) Aug 22, 2014
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@Licht-T
Copy link
Contributor

Licht-T commented Oct 24, 2017

@TomAugspurger This seems already fixed.

In [1]: import pandas as pd

In [2]: print(pd.__version__)
0.21.0.dev+627.ge001500cb.dirty

In [3]: df = pd.DataFrame(columns=list('ABC'))
   ...: df.groupby('A').sum().index.name
   ...:
Out[3]: 'A'

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 1, 2017 via email

@Licht-T
Copy link
Contributor

Licht-T commented Nov 3, 2017

@TomAugspurger Okay. I'll do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants