Skip to content

Commit

Permalink
ENH: concat and append now can handle unordered categories (#13767)
Browse files Browse the repository at this point in the history
Concatting categoricals with non-matching categories will now return object dtype instead of raising an error.

* ENH: concat and append now can handleunordered categories

* reomove union_categoricals kw from concat
  • Loading branch information
sinhrks authored and jorisvandenbossche committed Sep 7, 2016
1 parent 3f3839b commit ab4bd36
Show file tree
Hide file tree
Showing 9 changed files with 473 additions and 184 deletions.
58 changes: 53 additions & 5 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -675,12 +675,60 @@ be lexsorted, use ``sort_categories=True`` argument.
union_categoricals([a, b], sort_categories=True)
.. note::
``union_categoricals`` also works with the "easy" case of combining two
categoricals of the same categories and order information
(e.g. what you could also ``append`` for).

.. ipython:: python
a = pd.Categorical(["a", "b"], ordered=True)
b = pd.Categorical(["a", "b", "a"], ordered=True)
union_categoricals([a, b])
The below raises ``TypeError`` because the categories are ordered and not identical.

.. code-block:: ipython
In [1]: a = pd.Categorical(["a", "b"], ordered=True)
In [2]: b = pd.Categorical(["a", "b", "c"], ordered=True)
In [3]: union_categoricals([a, b])
Out[3]:
TypeError: to union ordered Categoricals, all categories must be the same
.. _categorical.concat:

Concatenation
~~~~~~~~~~~~~

This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects<merging.concat>` for general description.

By default, ``Series`` or ``DataFrame`` concatenation which contains the same categories
results in ``category`` dtype, otherwise results in ``object`` dtype.
Use ``.astype`` or ``union_categoricals`` to get ``category`` result.

.. ipython:: python
# same categories
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
pd.concat([s1, s2])
# different categories
s3 = pd.Series(['b', 'c'], dtype='category')
pd.concat([s1, s3])
pd.concat([s1, s3]).astype('category')
union_categoricals([s1.values, s3.values])
Following table summarizes the results of ``Categoricals`` related concatenations.

In addition to the "easy" case of combining two categoricals of the same
categories and order information (e.g. what you could also ``append`` for),
``union_categoricals`` only works with unordered categoricals and will
raise if any are ordered.
| arg1 | arg2 | result |
|---------|-------------------------------------------|---------|
| category | category (identical categories) | category |
| category | category (different categories, both not ordered) | object (dtype is inferred) |
| category | category (different categories, either one is ordered) | object (dtype is inferred) |
| category | not category | object (dtype is inferred) |
Getting Data In/Out
-------------------
Expand Down
27 changes: 14 additions & 13 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,34 +78,35 @@ some configurable handling of "what to do with the other axes":
::

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False)
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)

- ``objs``: a sequence or mapping of Series, DataFrame, or Panel objects. If a
- ``objs`` : a sequence or mapping of Series, DataFrame, or Panel objects. If a
dict is passed, the sorted keys will be used as the `keys` argument, unless
it is passed, in which case the values will be selected (see below). Any None
objects will be dropped silently unless they are all None in which case a
ValueError will be raised.
- ``axis``: {0, 1, ...}, default 0. The axis to concatenate along.
- ``join``: {'inner', 'outer'}, default 'outer'. How to handle indexes on
- ``axis`` : {0, 1, ...}, default 0. The axis to concatenate along.
- ``join`` : {'inner', 'outer'}, default 'outer'. How to handle indexes on
other axis(es). Outer for union and inner for intersection.
- ``join_axes``: list of Index objects. Specific indexes to use for the other
- ``ignore_index`` : boolean, default False. If True, do not use the index
values on the concatenation axis. The resulting axis will be labeled 0, ...,
n - 1. This is useful if you are concatenating objects where the
concatenation axis does not have meaningful indexing information. Note
the index values on the other axes are still respected in the join.
- ``join_axes`` : list of Index objects. Specific indexes to use for the other
n - 1 axes instead of performing inner/outer set logic.
- ``keys``: sequence, default None. Construct hierarchical index using the
- ``keys`` : sequence, default None. Construct hierarchical index using the
passed keys as the outermost level. If multiple levels passed, should
contain tuples.
- ``levels`` : list of sequences, default None. Specific levels (unique values)
to use for constructing a MultiIndex. Otherwise they will be inferred from the
keys.
- ``names``: list, default None. Names for the levels in the resulting
- ``names`` : list, default None. Names for the levels in the resulting
hierarchical index.
- ``verify_integrity``: boolean, default False. Check whether the new
- ``verify_integrity`` : boolean, default False. Check whether the new
concatenated axis contains duplicates. This can be very expensive relative
to the actual data concatenation.
- ``ignore_index`` : boolean, default False. If True, do not use the index
values on the concatenation axis. The resulting axis will be labeled 0, ...,
n - 1. This is useful if you are concatenating objects where the
concatenation axis does not have meaningful indexing information. Note
the index values on the other axes are still respected in the join.
- ``copy`` : boolean, default True. If False, do not copy data unnecessarily.

Without a little bit of context and example many of these arguments don't make
Expand Down
65 changes: 49 additions & 16 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ Highlights include:

- :func:`merge_asof` for asof-style time-series joining, see :ref:`here <whatsnew_0190.enhancements.asof_merge>`
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
- :func:`read_csv` now supports parsing ``Categorical`` data, see :ref:`here <whatsnew_0190.enhancements.read_csv_categorical>`
- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`here <whatsnew_0190.enhancements.union_categoricals>`
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
- ``PeriodIndex`` now has its own ``period`` dtype, and changed to be more consistent with other ``Index`` classes. See :ref:`here <whatsnew_0190.api.period>`
- Sparse data structures now gained enhanced support of ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>`
Expand Down Expand Up @@ -218,7 +220,7 @@ they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :is
data = '0,1,2\n3,4,5'
names = ['a', 'b', 'a']

Previous behaviour:
Previous Behavior:

.. code-block:: ipython

Expand All @@ -231,7 +233,7 @@ Previous behaviour:
The first ``a`` column contains the same data as the second ``a`` column, when it should have
contained the values ``[0, 3]``.

New behaviour:
New Behavior:

.. ipython :: python

Expand Down Expand Up @@ -277,6 +279,38 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification
df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
df['col3']

.. _whatsnew_0190.enhancements.union_categoricals:

Categorical Concatenation
^^^^^^^^^^^^^^^^^^^^^^^^^

- A function :func:`union_categoricals` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`, :issue:`:13763`, issue:`13846`)

.. ipython:: python

from pandas.types.concat import union_categoricals
a = pd.Categorical(["b", "c"])
b = pd.Categorical(["a", "b"])
union_categoricals([a, b])

- ``concat`` and ``append`` now can concat ``category`` dtypes wifht different
``categories`` as ``object`` dtype (:issue:`13524`)

Previous Behavior:

.. code-block:: ipython

In [1]: s1 = pd.Series(['a', 'b'], dtype='category')
In [2]: s2 = pd.Series(['b', 'c'], dtype='category')
In [3]: pd.concat([s1, s2])
ValueError: incompatible categories in categorical concat

New Behavior:

.. ipython:: python

pd.concat([s1, s2])

.. _whatsnew_0190.enhancements.semi_month_offsets:

Semi-Month Offsets
Expand Down Expand Up @@ -378,11 +412,11 @@ get_dummies dtypes

The ``pd.get_dummies`` function now returns dummy-encoded columns as small integers, rather than floats (:issue:`8725`). This should provide an improved memory footprint.

Previous behaviour:
Previous Behavior:

.. code-block:: ipython

In [1]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes
In [1]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes

Out[1]:
a float64
Expand All @@ -404,7 +438,7 @@ Other enhancements

- The ``.get_credentials()`` method of ``GbqConnector`` can now first try to fetch `the application default credentials <https://developers.google.com/identity/protocols/application-default-credentials>`__. See the :ref:`docs <io.bigquery_authentication>` for more details (:issue:`13577`).

- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behaviour remains to raising a ``NonExistentTimeError`` (:issue:`13057`)
- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behavior remains to raising a ``NonExistentTimeError`` (:issue:`13057`)
- ``pd.to_numeric()`` now accepts a ``downcast`` parameter, which will downcast the data if possible to smallest specified numerical dtype (:issue:`13352`)

.. ipython:: python
Expand Down Expand Up @@ -448,7 +482,6 @@ Other enhancements
- ``DataFrame`` has gained the ``.asof()`` method to return the last non-NaN values according to the selected subset (:issue:`13358`)
- The ``DataFrame`` constructor will now respect key ordering if a list of ``OrderedDict`` objects are passed in (:issue:`13304`)
- ``pd.read_html()`` has gained support for the ``decimal`` option (:issue:`12907`)
- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`, :issue:`:13763`, :issue:`13846`)
- ``Series`` has gained the properties ``.is_monotonic``, ``.is_monotonic_increasing``, ``.is_monotonic_decreasing``, similar to ``Index`` (:issue:`13336`)
- ``DataFrame.to_sql()`` now allows a single value as the SQL type for all columns (:issue:`11886`).
- ``Series.append`` now supports the ``ignore_index`` option (:issue:`13677`)
Expand Down Expand Up @@ -512,7 +545,7 @@ API changes
``Series.tolist()`` will now return Python types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``Series.tolist()`` will now return Python types in the output, mimicking NumPy ``.tolist()`` behaviour (:issue:`10904`)
``Series.tolist()`` will now return Python types in the output, mimicking NumPy ``.tolist()`` behavior (:issue:`10904`)


.. ipython:: python
Expand Down Expand Up @@ -547,7 +580,7 @@ including ``DataFrame`` (:issue:`1134`, :issue:`4581`, :issue:`13538`)

.. warning::
Until 0.18.1, comparing ``Series`` with the same length, would succeed even if
the ``.index`` are different (the result ignores ``.index``). As of 0.19.0, this will raises ``ValueError`` to be more strict. This section also describes how to keep previous behaviour or align different indexes, using the flexible comparison methods like ``.eq``.
the ``.index`` are different (the result ignores ``.index``). As of 0.19.0, this will raises ``ValueError`` to be more strict. This section also describes how to keep previous behavior or align different indexes, using the flexible comparison methods like ``.eq``.


As a result, ``Series`` and ``DataFrame`` operators behave as below:
Expand Down Expand Up @@ -615,7 +648,7 @@ Logical operators

Logical operators align both ``.index``.

Previous Behavior (``Series``), only left hand side ``index`` is kept:
Previous behavior (``Series``), only left hand side ``index`` is kept:

.. code-block:: ipython

Expand Down Expand Up @@ -935,7 +968,7 @@ Index ``+`` / ``-`` no longer used for set operations
Addition and subtraction of the base Index type and of DatetimeIndex
(not the numeric index types)
previously performed set operations (set union and difference). This
behaviour was already deprecated since 0.15.0 (in favor using the specific
behavior was already deprecated since 0.15.0 (in favor using the specific
``.union()`` and ``.difference()`` methods), and is now disabled. When
possible, ``+`` and ``-`` are now used for element-wise operations, for
example for concatenating strings or subtracting datetimes
Expand All @@ -956,13 +989,13 @@ The same operation will now perform element-wise addition:
pd.Index(['a', 'b']) + pd.Index(['a', 'c'])

Note that numeric Index objects already performed element-wise operations.
For example, the behaviour of adding two integer Indexes:
For example, the behavior of adding two integer Indexes:

.. ipython:: python

pd.Index([1, 2, 3]) + pd.Index([2, 3, 4])

is unchanged. The base ``Index`` is now made consistent with this behaviour.
is unchanged. The base ``Index`` is now made consistent with this behavior.

Further, because of this change, it is now possible to subtract two
DatetimeIndex objects resulting in a TimedeltaIndex:
Expand Down Expand Up @@ -1130,7 +1163,7 @@ the result of calling :func:`read_csv` without the ``chunksize=`` argument.

data = 'A,B\n0,1\n2,3\n4,5\n6,7'

Previous behaviour:
Previous Behavior:

.. code-block:: ipython

Expand All @@ -1142,7 +1175,7 @@ Previous behaviour:
0 4 5
1 6 7

New behaviour:
New Behavior:

.. ipython :: python

Expand Down Expand Up @@ -1268,7 +1301,7 @@ These types are the same on many platform, but for 64 bit python on Windows,
``np.int_`` is 32 bits, and ``np.intp`` is 64 bits. Changing this behavior improves performance for many
operations on that platform.

Previous behaviour:
Previous Behavior:

.. code-block:: ipython

Expand All @@ -1277,7 +1310,7 @@ Previous behaviour:
In [2]: i.get_indexer(['b', 'b', 'c']).dtype
Out[2]: dtype('int32')

New behaviour:
New Behavior:

.. code-block:: ipython

Expand Down
7 changes: 3 additions & 4 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -4787,10 +4787,9 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy):
[get_mgr_concatenation_plan(mgr, indexers)
for mgr, indexers in mgrs_indexers], concat_axis)

blocks = [make_block(concatenate_join_units(join_units, concat_axis,
copy=copy),
placement=placement)
for placement, join_units in concat_plan]
blocks = [make_block(
concatenate_join_units(join_units, concat_axis, copy=copy),
placement=placement) for placement, join_units in concat_plan]

return BlockManager(blocks, axes)

Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/series/test_combine_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,9 +185,9 @@ def test_concat_empty_series_dtypes(self):
'category')
self.assertEqual(pd.concat([Series(dtype='category'),
Series(dtype='float64')]).dtype,
np.object_)
'float64')
self.assertEqual(pd.concat([Series(dtype='category'),
Series(dtype='object')]).dtype, 'category')
Series(dtype='object')]).dtype, 'object')

# sparse
result = pd.concat([Series(dtype='float64').to_sparse(), Series(
Expand Down
Loading

0 comments on commit ab4bd36

Please sign in to comment.