Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending categorical data should be more flexible #12699

Closed
wtadler opened this issue Mar 23, 2016 · 9 comments
Closed

Appending categorical data should be more flexible #12699

wtadler opened this issue Mar 23, 2016 · 9 comments
Labels
API Design Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@wtadler
Copy link

wtadler commented Mar 23, 2016

I ran into this issue today, and it seems like it should be a fairly common situation. I have imported two dataframes (using pandas.read_stata) of categorical data that I want to concatenate. One of them might not have an instance of every category that the other one has, so pandas won't concatenate. It seems like it would be more flexible if it could add all missing categories.

I know that this inflexibility is in the documentation, but I wonder why it exists. Is there a good reason why pandas shouldn't automatically append new categories as they are encountered?

In

s = pd.Series(['a', 'b'], dtype="category")
s2 = pd.Series(['a', 'c'], dtype="category")
s.append(s2)

Expected Output

0    a
1    b
0    a
1    c
dtype: category
Categories (3, object): [a, b, c]

Actual Output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-e1bb85501be1> in <module>()
      4 s = pd.Series(['a', 'b'], dtype="category")
      5 s2 = pd.Series(['a', 'c'], dtype="category")
----> 6 s.append(s2)

/Users/will/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in append(self, to_append, verify_integrity)
   1575             to_concat = [self, to_append]
   1576         return concat(to_concat, ignore_index=False,
-> 1577                       verify_integrity=verify_integrity)
   1578 
   1579     def _binop(self, other, func, level=None, fill_value=None):

/Users/will/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    833                        verify_integrity=verify_integrity,
    834                        copy=copy)
--> 835     return op.get_result()
    836 
    837 

/Users/will/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in get_result(self)
    979             # stack blocks
    980             if self.axis == 0:
--> 981                 new_data = com._concat_compat([x._values for x in self.objs])
    982                 name = com._consensus_name_attr(self.objs)
    983                 return (Series(new_data, index=self.new_axes[0], name=name)

/Users/will/anaconda/lib/python2.7/site-packages/pandas/core/common.pyc in _concat_compat(to_concat, axis)
   2722     elif 'category' in typs:
   2723         from pandas.core.categorical import _concat_compat
-> 2724         return _concat_compat(to_concat, axis=axis)
   2725 
   2726     if not nonempty:

/Users/will/anaconda/lib/python2.7/site-packages/pandas/core/categorical.pyc in _concat_compat(to_concat, axis)
   1948     for x in categoricals[1:]:
   1949         if not categories.is_dtype_equal(x):
-> 1950             raise ValueError("incompatible categories in categorical concat")
   1951 
   1952     # we've already checked that all categoricals are the same, so if their

ValueError: incompatible categories in categorical concat

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.4
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.21
numpy: 1.10.2
scipy: 0.16.1
statsmodels: 0.5.0
xarray: None
IPython: 4.0.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.0
pytz: 2016.1
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.32.1

@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

xref #10409 . where on a merge we are NOT retaining category types (as its possible).

for other dtypes we are a bit friendlier in concat/merge ops, in that we will upcast to object if incompatible types (e.g. concat of a datetime64[ns] and a datetime64[ns, US/Eastern] for example.

I think that if the concat/merge dtypes match then we should preserve, otherwise cast to object.

The current rationale is that category serves 2 masters. 1) its a memory saver for duplicate entries, 2) is an actual factor/categorical type, where the set membership is important.

So I think would be ok to keep dtypes as appropriate and upcast if needed (only small change is needed in core/categorical.py/_concat_compat.

cc @JanSchulz
@sinhrks
@jorisvandenbossche

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type Difficulty Advanced labels Mar 23, 2016
@jreback jreback added this to the 0.18.1 milestone Mar 23, 2016
wtadler referenced this issue in wtadler/attitudes-and-the-court Mar 23, 2016
not the most elegant solution, but works for now. Pandas has already annoyed me this evening...
@chengguangnan
Copy link

I'd love see this handled by pandas. Another thing about categorical data is calling fillna('') will throw an exception, I have to cat.add_categories(['']).fillna('') I feel that it should be handled automatically as well.

@jankatins
Copy link
Contributor

I have to cat.add_categories(['']).fillna('') I feel that it should be handled automatically as well.

No, please no: this is only correct for your use of a categorical, and is wrong for things like objects, integers and so on. It is also wrong when the categorical is ordered (NA is "outside the order", but "" would be included, but where? Is it "A" < "B" < "" or "" < "A" < "B"). It's also arbitrary because what if I want my fill value to be "--"?

Individual categoricals are alike to int, str or specific objects (or specific dtypes), in the same sense that a categorical has a special range of possible values (like int, which has a min and max int), maybe an order (like implicitly for int and explicitly for your own objects if you implement the right methods). You can't add an A to an array of ints without changing the array to object. The same should happen for categoricals: if you have a categorical A < B < C and add a D, it should either fail or change to an object array.

See #8640 for a Memory saving string type, which would not have this constraints...

So I think would be ok to keep dtypes as appropriate and upcast if needed (only small change is needed in core/categorical.py/_concat_compat.

My vote would go to "fail" because "explicit is better than implicit": I've seen too many (dict) lookups fail because of int/float conversations when a NA is included in an int column and in this case it even has the memory penalty. On the other hand it would be inconsistent with the rest of pandas which does upcasting...

@tdhopper
Copy link
Contributor

For those who need to concatenate now, here's a quick example of how it could be done (in Python 3): https://gist.github.com/tdhopper/91f03250892c12c6e0d35ca6d2ade1ca

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

@tdhopper nice example. In reality the best way to do this (before I fix this bug here) is to coerce to object, then recategorize the result. Its slightly complicated if you want to maintain the same factorization after though, this is where you would do something like this

@tdhopper
Copy link
Contributor

tdhopper commented Apr 21, 2016

@jreback Should've explained that my complete dataframe is so big that I can't convert to objects without running out of ram.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

Then you can do it iteratively, e.g. chunk it, concat, convert back to categoricals. kind of like an in-memory with an on-disk merged. dask can do these kinds of things pretty easily FYI. (doing 2 on-disk merges is quite a bit harder though).

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@benitocm
Copy link

Please Jeff, could you elaborate a bit more about your last comment?

I would need to concat In disk and not loosing categorías

Thx

@jreback
Copy link
Contributor

jreback commented Jun 11, 2016

actually this is closed by #13361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

6 participants