Appending categorical data should be more flexible #12699

wtadler · 2016-03-23T06:17:32Z

I ran into this issue today, and it seems like it should be a fairly common situation. I have imported two dataframes (using pandas.read_stata) of categorical data that I want to concatenate. One of them might not have an instance of every category that the other one has, so pandas won't concatenate. It seems like it would be more flexible if it could add all missing categories.

I know that this inflexibility is in the documentation, but I wonder why it exists. Is there a good reason why pandas shouldn't automatically append new categories as they are encountered?

In

s = pd.Series(['a', 'b'], dtype="category")
s2 = pd.Series(['a', 'c'], dtype="category")
s.append(s2)

Expected Output

0    a
1    b
0    a
1    c
dtype: category
Categories (3, object): [a, b, c]

Actual Output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-e1bb85501be1> in <module>()
      4 s = pd.Series(['a', 'b'], dtype="category")
      5 s2 = pd.Series(['a', 'c'], dtype="category")
----> 6 s.append(s2)

/Users/will/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in append(self, to_append, verify_integrity)
   1575             to_concat = [self, to_append]
   1576         return concat(to_concat, ignore_index=False,
-> 1577                       verify_integrity=verify_integrity)
   1578 
   1579     def _binop(self, other, func, level=None, fill_value=None):

/Users/will/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    833                        verify_integrity=verify_integrity,
    834                        copy=copy)
--> 835     return op.get_result()
    836 
    837 

/Users/will/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in get_result(self)
    979             # stack blocks
    980             if self.axis == 0:
--> 981                 new_data = com._concat_compat([x._values for x in self.objs])
    982                 name = com._consensus_name_attr(self.objs)
    983                 return (Series(new_data, index=self.new_axes[0], name=name)

/Users/will/anaconda/lib/python2.7/site-packages/pandas/core/common.pyc in _concat_compat(to_concat, axis)
   2722     elif 'category' in typs:
   2723         from pandas.core.categorical import _concat_compat
-> 2724         return _concat_compat(to_concat, axis=axis)
   2725 
   2726     if not nonempty:

/Users/will/anaconda/lib/python2.7/site-packages/pandas/core/categorical.pyc in _concat_compat(to_concat, axis)
   1948     for x in categoricals[1:]:
   1949         if not categories.is_dtype_equal(x):
-> 1950             raise ValueError("incompatible categories in categorical concat")
   1951 
   1952     # we've already checked that all categoricals are the same, so if their

ValueError: incompatible categories in categorical concat

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.4
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.21
numpy: 1.10.2
scipy: 0.16.1
statsmodels: 0.5.0
xarray: None
IPython: 4.0.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.0
pytz: 2016.1
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.32.1

The text was updated successfully, but these errors were encountered:

jreback · 2016-03-23T11:43:21Z

xref #10409 . where on a merge we are NOT retaining category types (as its possible).

for other dtypes we are a bit friendlier in concat/merge ops, in that we will upcast to object if incompatible types (e.g. concat of a datetime64[ns] and a datetime64[ns, US/Eastern] for example.

I think that if the concat/merge dtypes match then we should preserve, otherwise cast to object.

The current rationale is that category serves 2 masters. 1) its a memory saver for duplicate entries, 2) is an actual factor/categorical type, where the set membership is important.

So I think would be ok to keep dtypes as appropriate and upcast if needed (only small change is needed in core/categorical.py/_concat_compat.

cc @JanSchulz
@sinhrks
@jorisvandenbossche

not the most elegant solution, but works for now. Pandas has already annoyed me this evening...

chengguangnan · 2016-03-24T04:21:45Z

I'd love see this handled by pandas. Another thing about categorical data is calling fillna('') will throw an exception, I have to cat.add_categories(['']).fillna('') I feel that it should be handled automatically as well.

jankatins · 2016-03-24T10:11:34Z

I have to cat.add_categories(['']).fillna('') I feel that it should be handled automatically as well.

No, please no: this is only correct for your use of a categorical, and is wrong for things like objects, integers and so on. It is also wrong when the categorical is ordered (NA is "outside the order", but "" would be included, but where? Is it "A" < "B" < "" or "" < "A" < "B"). It's also arbitrary because what if I want my fill value to be "--"?

Individual categoricals are alike to int, str or specific objects (or specific dtypes), in the same sense that a categorical has a special range of possible values (like int, which has a min and max int), maybe an order (like implicitly for int and explicitly for your own objects if you implement the right methods). You can't add an A to an array of ints without changing the array to object. The same should happen for categoricals: if you have a categorical A < B < C and add a D, it should either fail or change to an object array.

See #8640 for a Memory saving string type, which would not have this constraints...

So I think would be ok to keep dtypes as appropriate and upcast if needed (only small change is needed in core/categorical.py/_concat_compat.

My vote would go to "fail" because "explicit is better than implicit": I've seen too many (dict) lookups fail because of int/float conversations when a NA is included in an int column and in this case it even has the memory penalty. On the other hand it would be inconsistent with the rest of pandas which does upcasting...

tdhopper · 2016-04-20T21:35:29Z

For those who need to concatenate now, here's a quick example of how it could be done (in Python 3): https://gist.github.com/tdhopper/91f03250892c12c6e0d35ca6d2ade1ca

jreback · 2016-04-20T22:56:36Z

@tdhopper nice example. In reality the best way to do this (before I fix this bug here) is to coerce to object, then recategorize the result. Its slightly complicated if you want to maintain the same factorization after though, this is where you would do something like this

tdhopper · 2016-04-21T12:36:01Z

@jreback Should've explained that my complete dataframe is so big that I can't convert to objects without running out of ram.

jreback · 2016-04-21T12:38:18Z

Then you can do it iteratively, e.g. chunk it, concat, convert back to categoricals. kind of like an in-memory with an on-disk merged. dask can do these kinds of things pretty easily FYI. (doing 2 on-disk merges is quite a bit harder though).

benitocm · 2016-06-11T12:17:58Z

Please Jeff, could you elaborate a bit more about your last comment?

I would need to concat In disk and not loosing categorías

Thx

jreback · 2016-06-11T12:19:42Z

actually this is closed by #13361

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type Difficulty Advanced labels Mar 23, 2016

jreback added this to the 0.18.1 milestone Mar 23, 2016

wtadler referenced this issue in wtadler/attitudes-and-the-court Mar 23, 2016

preventing concatenation error by turning categorical data to strings

5ef5116

not the most elegant solution, but works for now. Pandas has already annoyed me this evening...

jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

jreback closed this as completed Jun 11, 2016

adbull mentioned this issue Jun 23, 2016

API/ENH: unprotected Categorical #13506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending categorical data should be more flexible #12699

Appending categorical data should be more flexible #12699

wtadler commented Mar 23, 2016

jreback commented Mar 23, 2016

chengguangnan commented Mar 24, 2016

jankatins commented Mar 24, 2016

tdhopper commented Apr 20, 2016

jreback commented Apr 20, 2016

tdhopper commented Apr 21, 2016 •

edited

Loading

jreback commented Apr 21, 2016

benitocm commented Jun 11, 2016

jreback commented Jun 11, 2016

Appending categorical data should be more flexible #12699

Appending categorical data should be more flexible #12699

Comments

wtadler commented Mar 23, 2016

In

Expected Output

Actual Output

output of pd.show_versions()

jreback commented Mar 23, 2016

chengguangnan commented Mar 24, 2016

jankatins commented Mar 24, 2016

tdhopper commented Apr 20, 2016

jreback commented Apr 20, 2016

tdhopper commented Apr 21, 2016 • edited Loading

jreback commented Apr 21, 2016

benitocm commented Jun 11, 2016

jreback commented Jun 11, 2016

output of `pd.show_versions()`

tdhopper commented Apr 21, 2016 •

edited

Loading