New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending Pandas dataframes in for loop results in ValueError #13524

Closed
lvphj opened this Issue Jun 28, 2016 · 13 comments

Comments

Projects
None yet
6 participants
@lvphj

lvphj commented Jun 28, 2016

I recently posted this on StackOverflow. It seems to be a bug so I am posting here as well.

I want to generate a dataframe that is created by appended several separate dataframes generated in a for loop. Each individual dataframe consists of a name column, a range of integers and a column identifying a category to which the integer belongs (e.g. quintile 1 to 5). If I generate each dataframe individually and then append one to the other to create a 'master' dataframe then there are no problems. However, when I use a loop to create each individual dataframe then trying to append a dataframe to the master dataframe results in:

ValueError: incompatible categories in categorical concat

A work-around (suggested by jezrael) involved appending each dataframe to a list of dataframes and concatenating them using pd.concat.

I've written a simplified loop to illustrate:

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

# Define column names
colNames = ('a','b','c')

# Define a dataframe with the required column names
masterDF = pd.DataFrame(columns = colNames)

# A list of the group names
names = ['Group1','Group2','Group3']

# Create a dataframe for each group
for i in names:
    tempDF = pd.DataFrame(columns = colNames)
    tempDF['a'] = np.arange(1,11,1)
    tempDF['b'] = i
    tempDF['c'] = pd.cut(np.arange(1,11,1),
                        bins = np.linspace(0,10,6),
                        labels = [1,2,3,4,5])
    print(tempDF)
    print('\n')

    # Try to append temporary DF to master DF
    masterDF = masterDF.append(tempDF,ignore_index=True)

print(masterDF)

Expected Output

     a       b  c
 0   1  Group1  1
 1   2  Group1  1
 2   3  Group1  2
 3   4  Group1  2
 4   5  Group1  3
 5   6  Group1  3
 6   7  Group1  4
 7   8  Group1  4
 8   9  Group1  5
 9  10  Group1  5
10  11  Group2  1
11  12  Group2  1
12  13  Group2  2
13  14  Group2  2
...
28  29  Group3  5
29  30  Group3  5

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: None
pip: 1.5.6
setuptools: 20.1.1
Cython: None
numpy: 1.11.0
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.1.1
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.0
openpyxl: 2.3.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: 0.7.4.None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jun 28, 2016

Contributor

cut returns a categorcial. By design, you can't append new categories

In [6]: s1 = pd.Series(['a', 'b']).astype('category')

In [7]: s1.append(pd.Series(['b', 'c']).astype('category'))
...
ValueError: incompatible categories in categorical concat

I believe your code would work if you change the pd.cut(...) to pd.cut(...).categories.

Contributor

TomAugspurger commented Jun 28, 2016

cut returns a categorcial. By design, you can't append new categories

In [6]: s1 = pd.Series(['a', 'b']).astype('category')

In [7]: s1.append(pd.Series(['b', 'c']).astype('category'))
...
ValueError: incompatible categories in categorical concat

I believe your code would work if you change the pd.cut(...) to pd.cut(...).categories.

@lvphj

This comment has been minimized.

Show comment
Hide comment
@lvphj

lvphj Jun 28, 2016

If you change your example code slightly so there are no NEW categories being added:

In [6]: s1 = pd.Series(['a', 'b']).astype('category')
In [7]: s1.append(pd.Series(['b', 'a']).astype('category'))

then it runs OK. In the original problem, the pd.cut() function generates the same categories in each dataframe, namely 1 to 5, so no new categories are being added.

lvphj commented Jun 28, 2016

If you change your example code slightly so there are no NEW categories being added:

In [6]: s1 = pd.Series(['a', 'b']).astype('category')
In [7]: s1.append(pd.Series(['b', 'a']).astype('category'))

then it runs OK. In the original problem, the pd.cut() function generates the same categories in each dataframe, namely 1 to 5, so no new categories are being added.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jun 28, 2016

Contributor

Gotcha, here's a simpler example.

In [44]: df = pd.DataFrame(columns=['a'])

In [45]: a = pd.DataFrame({"a": pd.Categorical([1, 2, 3], ordered=True)})

In [46]: a.a
Out[46]:
0    1
1    2
2    3
Name: a, dtype: category
Categories (3, int64): [1 < 2 < 3]

In [47]: df.append(a).a
Out[47]:
0    1
1    2
2    3
Name: a, dtype: category
Categories (3, int64): [1, 2, 3]

So the orderedness of a is lost in the append. @lvphj any interest in digging through the traceback to see where it's lost?

Contributor

TomAugspurger commented Jun 28, 2016

Gotcha, here's a simpler example.

In [44]: df = pd.DataFrame(columns=['a'])

In [45]: a = pd.DataFrame({"a": pd.Categorical([1, 2, 3], ordered=True)})

In [46]: a.a
Out[46]:
0    1
1    2
2    3
Name: a, dtype: category
Categories (3, int64): [1 < 2 < 3]

In [47]: df.append(a).a
Out[47]:
0    1
1    2
2    3
Name: a, dtype: category
Categories (3, int64): [1, 2, 3]

So the orderedness of a is lost in the append. @lvphj any interest in digging through the traceback to see where it's lost?

@TomAugspurger TomAugspurger added this to the 0.19.0 milestone Jun 28, 2016

@lvphj

This comment has been minimized.

Show comment
Hide comment
@lvphj

lvphj Jun 28, 2016

Certainly interested – but may not have the skill set.

lvphj commented Jun 28, 2016

Certainly interested – but may not have the skill set.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jun 28, 2016

Contributor

👍 just post here if you have any questions. Either way, thanks for the report.

Just a hunch, but I would start looking in https://github.com/pydata/pandas/blob/1a9abc44bbfd65675fd99701fe33aad8805ab147/pandas/types/concat.py#L147

Contributor

TomAugspurger commented Jun 28, 2016

👍 just post here if you have any questions. Either way, thanks for the report.

Just a hunch, but I would start looking in https://github.com/pydata/pandas/blob/1a9abc44bbfd65675fd99701fe33aad8805ab147/pandas/types/concat.py#L147

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 28, 2016

Contributor

this is by definition. you need union_categorical

Contributor

jreback commented Jun 28, 2016

this is by definition. you need union_categorical

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jun 28, 2016

Contributor

@jreback I think my last example should work, no? E.g. df.append(a) should have an ordered categorical if a was ordered?

Contributor

TomAugspurger commented Jun 28, 2016

@jreback I think my last example should work, no? E.g. df.append(a) should have an ordered categorical if a was ordered?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 29, 2016

Member

Yes, I think that is correct.

It only seems to happen when you start with an empty frame, or append an empty frame:

In [29]: a.append(df).a  # <-- append empty frame
Out[29]:
0    1
1    2
2    3
Name: a, dtype: category
Categories (3, int64): [1, 2, 3]

In [30]: a.append(a).a
Out[30]:
0    1
1    2
2    3
0    1
1    2
2    3
Name: a, dtype: category
Categories (3, int64): [1 < 2 < 3]
Member

jorisvandenbossche commented Jun 29, 2016

Yes, I think that is correct.

It only seems to happen when you start with an empty frame, or append an empty frame:

In [29]: a.append(df).a  # <-- append empty frame
Out[29]:
0    1
1    2
2    3
Name: a, dtype: category
Categories (3, int64): [1, 2, 3]

In [30]: a.append(a).a
Out[30]:
0    1
1    2
2    3
0    1
1    2
2    3
Name: a, dtype: category
Categories (3, int64): [1 < 2 < 3]
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jun 29, 2016

Contributor

Hmm, is the empty set of categories ordered or not? 😄 pd.Categorical([]).ordered is False (by default).
Is this worth special casing so that empty_categorical.append(ordered_categorical) becomes ordered? I think so, but maybe not.

cc @janschulz, thoughts?

Contributor

TomAugspurger commented Jun 29, 2016

Hmm, is the empty set of categories ordered or not? 😄 pd.Categorical([]).ordered is False (by default).
Is this worth special casing so that empty_categorical.append(ordered_categorical) becomes ordered? I think so, but maybe not.

cc @janschulz, thoughts?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 29, 2016

Member

Well, if we say that an empty series is ordered=False, then it should actually raise an error instead of changing the order of the result :-)
But actually, in this case, you don't have an empty categorical, but just an empty frame without dtype info, so in this case it should ignore the fact that that part is ordered or not.

Member

jorisvandenbossche commented Jun 29, 2016

Well, if we say that an empty series is ordered=False, then it should actually raise an error instead of changing the order of the result :-)
But actually, in this case, you don't have an empty categorical, but just an empty frame without dtype info, so in this case it should ignore the fact that that part is ordered or not.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 29, 2016

Member

The problem is here: https://github.com/pydata/pandas/blob/1a9abc44bbfd65675fd99701fe33aad8805ab147/pandas/types/concat.py#L201 When concat is not dealing with only categoricals, but with a mixture of categoricals and object arrays, it takes the categories from the first categorical to concat, but not the other properties like ordered or not. Should be an easy fix to also pass ordered there.

Member

jorisvandenbossche commented Jun 29, 2016

The problem is here: https://github.com/pydata/pandas/blob/1a9abc44bbfd65675fd99701fe33aad8805ab147/pandas/types/concat.py#L201 When concat is not dealing with only categoricals, but with a mixture of categoricals and object arrays, it takes the categories from the first categorical to concat, but not the other properties like ordered or not. Should be an easy fix to also pass ordered there.

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Jul 2, 2016

Contributor
In [44]: df = pd.DataFrame(columns=['a'])
In [45]: a = pd.DataFrame({"a": pd.Categorical([1, 2, 3], ordered=True)})
In [47]: df.append(a).a

IMO there can be two interpretations:

  • If the df in the above append is seen as already having category column (and this is Category([]) then this should IMO error because Category([]) is ordered false (default) and has different Categories (= empty) and therefore this two shouldn't be appendable (as with two different categoricals in the column).
  • If the dataframe has no type information at all and this is seen as basically a set the column to the new type and it should retain the information in the appended categorical.
pandas.DataFrame({"a": pandas.Categorical([1,2,3])}).append(pandas.DataFrame({"a": pandas.Categorical([1,2])})) # -> Error as it should be...
pandas.DataFrame({"a": pandas.Categorical([])}).append(pandas.DataFrame({"a": pandas.Categorical([1,2])})) # -> Also errors...

The question is if an empty column is the same as a categorical column without any value

IMO that's the difference between this two dataframes:

# 1. defined column type, but no value
pandas.DataFrame({"a": [1]})[[False]].dtypes # -> dtype of a: int64
print(pandas.DataFrame(columns=["a"])).dtypes # -> dtype of a: object
as_float = pandas.DataFrame({"a": [1.]})[[False]] # -> float64
as_float.append(pandas.DataFrame({"a": [1]})).dtypes # -> kept as float64 as that takes both int and float

and

# 2. no defined value
empty_object = pandas.DataFrame(columns=["a"])
print(empty_object.dtypes) # -> object
empty_object.append(pandas.DataFrame({"a": [1]})).dtypes # -> Float64 ??? why not int64 or object?

the first is just the usual "cast to something which can take both" which is the rule for everything but categorical. The second seems to be the upcast rules for int + object? So if the second follows the "normal rules", then IMO appending a categorical should also follow the usual categorical rules, aka erroring.

Contributor

jankatins commented Jul 2, 2016

In [44]: df = pd.DataFrame(columns=['a'])
In [45]: a = pd.DataFrame({"a": pd.Categorical([1, 2, 3], ordered=True)})
In [47]: df.append(a).a

IMO there can be two interpretations:

  • If the df in the above append is seen as already having category column (and this is Category([]) then this should IMO error because Category([]) is ordered false (default) and has different Categories (= empty) and therefore this two shouldn't be appendable (as with two different categoricals in the column).
  • If the dataframe has no type information at all and this is seen as basically a set the column to the new type and it should retain the information in the appended categorical.
pandas.DataFrame({"a": pandas.Categorical([1,2,3])}).append(pandas.DataFrame({"a": pandas.Categorical([1,2])})) # -> Error as it should be...
pandas.DataFrame({"a": pandas.Categorical([])}).append(pandas.DataFrame({"a": pandas.Categorical([1,2])})) # -> Also errors...

The question is if an empty column is the same as a categorical column without any value

IMO that's the difference between this two dataframes:

# 1. defined column type, but no value
pandas.DataFrame({"a": [1]})[[False]].dtypes # -> dtype of a: int64
print(pandas.DataFrame(columns=["a"])).dtypes # -> dtype of a: object
as_float = pandas.DataFrame({"a": [1.]})[[False]] # -> float64
as_float.append(pandas.DataFrame({"a": [1]})).dtypes # -> kept as float64 as that takes both int and float

and

# 2. no defined value
empty_object = pandas.DataFrame(columns=["a"])
print(empty_object.dtypes) # -> object
empty_object.append(pandas.DataFrame({"a": [1]})).dtypes # -> Float64 ??? why not int64 or object?

the first is just the usual "cast to something which can take both" which is the rule for everything but categorical. The second seems to be the upcast rules for int + object? So if the second follows the "normal rules", then IMO appending a categorical should also follow the usual categorical rules, aka erroring.

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Jul 19, 2016

Member

I met the same problem in #13626 and wrote short summary of Series Indexdifferences.

How about following spec:

  • concat 2 categories -> use the rule of union_categorical
  • concat category and other dtype (which values are all in the category, including empty) -> category
    • this rule is applied regardless of order (if there is at least one category in concatenating values)
    • the property like ordered should be preserved.
  • concat category and other dtype (which values are not in the category) -> not category (dtype is infered)
Member

sinhrks commented Jul 19, 2016

I met the same problem in #13626 and wrote short summary of Series Indexdifferences.

How about following spec:

  • concat 2 categories -> use the rule of union_categorical
  • concat category and other dtype (which values are all in the category, including empty) -> category
    • this rule is applied regardless of order (if there is at least one category in concatenating values)
    • the property like ordered should be preserved.
  • concat category and other dtype (which values are not in the category) -> not category (dtype is infered)

@sinhrks sinhrks modified the milestones: 0.19.0, 0.20.0 Jul 19, 2016

jreback added a commit that referenced this issue Jul 29, 2016

ENH: union_categorical supports identical categories with ordered
xref #13410, #13524

Author: sinhrks <sinhrks@gmail.com>

Closes #13763 from sinhrks/union_categoricals_ordered and squashes the following commits:

9cadc4e [sinhrks] ENH: union_categorical supports identical categories with ordered
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment