BUG: concat on axis with both different and duplicate labels raising error #6963

jorisvandenbossche · 2014-04-25T11:30:38Z

When concatting two dataframes where there are a) there are duplicate columns in one of the dataframes, and b) there are non-overlapping column names in both, then you get a IndexError:

In [9]: df1 = pd.DataFrame(np.random.randn(3,3), columns=['A', 'A', 'B1'])
   ...: df2 = pd.DataFrame(np.random.randn(3,3), columns=['A', 'A', 'B2'])

In [10]: pd.concat([df1, df2])

Traceback (most recent call last):
  File "<ipython-input-10-f61a1ab4009e>", line 1, in <module>
    pd.concat([df1, df2])
...
  File "c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.py", line 765, in take
    taken = self.view(np.ndarray).take(indexer)
IndexError: index 3 is out of bounds for axis 0 with size 3

I don't know if it should work (although I suppose it should, as with only the duplicate columns it does work), but at least the error message is not really helpfull.

The text was updated successfully, but these errors were encountered:

jreback · 2014-04-25T12:39:04Z

cc @immerrr

I think after #6745 this will be straightforward to fix

immerrr · 2014-04-25T13:07:15Z

The main issue is how to align indices that both have duplicate items, as of now, indexing with dupes does strange things:

In [1]: pd.Index([1,1,2])
Out[1]: Int64Index([1, 1, 2], dtype='int64')

In [2]: _1.get_indexer_for(_1)
Out[2]: Int64Index([0, 1, 0, 1, 2], dtype='int64')

Apparently, for each non-unique element found in destination, get_indexer tries to insert all locs of this element. I can hardly thinkg of a use case when I'd want to do a reindex(['x', 'y']) and getting (['x', 'x', 'x', 'y']) instead would do.

jreback · 2014-04-25T13:12:28Z

The dup indexers came out of having duplicates indexers on a unique index, so you have to dup

In [1]: df = DataFrame(np.arange(10).reshape(5,2))

In [2]: df
Out[2]: 
   0  1
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

[5 rows x 2 columns]

In [5]: df.loc[:,[0,1,1,0]]
Out[5]: 
   0  1  1  0
0  0  1  1  0
1  2  3  3  2
2  4  5  5  4
3  6  7  7  6
4  8  9  9  8

[5 rows x 4 columns]

Now maybe outlaw that I suppose

immerrr · 2014-04-25T13:22:10Z

Maybe it would make more sense to require destination index to have the same count of duplicate entries for each element present in source, i.e.:

# e.g. these should be ok
pd.Index([1,1,2]).get_indexer_for([1,1,2]) # should be ok and return [0, 1, 2]
pd.Index([1,1,2]).get_indexer_for([2]) # return [2]
pd.Index([1,1,2]).get_indexer_for([1,2,1]) # return [0, 2, 1]

# but these should be forbidden
pd.Index([1,1,2]).get_indexer_for([1,2]) # which one of `1` did you want?
pd.Index([1,1,2]).get_indexer_for([1,1,1,2]) # which `1` should be duplicated?

UPD: or maybe cycle over duplicate elements like np.putmask does...

x = np.arange(5)
np.putmask(x, x>1, [-33, -44])
print x
array([  0,   1, -33, -44, -33])

# so that
pd.Index([1,1,2]).get_indexer_for([1,1,2,1,1]) # return [0,1,2,0,1] ?

jreback · 2014-04-25T13:25:29Z

right, so that's a duplicate of a duplicate; yes, that would need to be handled differently

jorisvandenbossche · 2017-09-16T10:42:40Z

Some more example from #17552 (with axis=1, different error messages, but the idea is the same: you only get the error when there are duplicate + different labels):

In [113]: df1 = pd.DataFrame(np.zeros((2,2)), index=[0, 1], columns=['a', 'b'])
     ...: df2 = pd.DataFrame(np.ones((2,2)), index=[2, 2], columns=['c', 'd'])

In [114]: pd.concat([df1, df2], axis=1)
...
ValueError: Shape of passed values is (4, 6), indices imply (4, 4)

In [115]: df1 = pd.DataFrame(np.zeros((2,2)), index=[0, 1], columns=['a', 'b'])
     ...: df2 = pd.DataFrame(np.ones((2,2)), index=pd.DatetimeIndex([2, 2]), columns=['c', 'd'])

In [116]: pd.concat([df1, df2], axis=1)
...
TypeError: 'NoneType' object is not iterable

xuancong84 · 2020-01-13T08:32:26Z

In general, for maximum usage compatibility, we should treat a table as a symmetric rank-2 tensor, (except for axis labels, e.g., 1st direction called 'column', 2nd direction called 'row'). And thus, no matter whether there are duplicate row indices or duplicate column names or both, we should always be able to perform concatenation along either axis without throwing any errors.

Below is a very elegant design that can grant maximum compatibility, suppose there are duplicate names in both row indices and column names:

For concatenation along axis 0, (pd.concat([df1, df2], axis=0), i.e., vertical concatenation of rows), the first two Column A's in df1 and df2 should align to each other, leaving empty values (or NaN) for the 3rd Column A in df1 because df1 does not have the 3rd Column A

For concatenation along axis 1, (pd.concat([df1, df2], axis=1), i.e., horizontal concatenation of columns), the first two Row 0's in df1 and df2 should align to each other, leaving empty values (or NaN) for the 3rd Row 0 in df1 because df1 does not have the 3rd Row 0

So in the most generic case, for concatenation along any axis, duplicate indices along that axis are kept and appended (e.g., 2 'A' plus 3 'A' = 5 'A'), duplicate indices along other axis are merged with multiplicity count consistency (e.g., 2 'A' merge with 3 'A', the first 2 'A' corresponds to each other, the last 'A' get empty/NaN values)

These should solve many DataFrame concatenation bugs and crashes in Pandas!

jreback · 2020-01-13T11:28:51Z

@xuancong84 this is anti pandas philosophy

having order dependencies is extremely fragile and would work unexpectedly if you happened to reorder columns or not

we must align columns on the labels; aligning some of them is just really odd

if you want to treat this as a tensor then drop the labels

-1 on this proposal

xuancong84 · 2020-01-14T03:45:09Z

@jreback Sorry, could not get what you mean? Would you be more specific by illustration with examples?

Regardless of how you feel, currently when I am developing the omnipotent data plotter for pandas (https://github.com/xuancong84/beiwe-visualizer ), I am experiencing lots of frustrating limitations of pandas. A lot of situations that could be handled easily could not be handled very well. You might want to take a look at my code to how troublesome I work around those to get things work.

Regarding order dependencies, I know it is not ideal because Python dictionary does not preserve order (unless you use OrderedDict). But in cases where there are duplicate names in both row indices and column names, that is the only way to make things work. Otherwise, Pandas just ends up with stupid crashes such as #28479 and #30772 where it should in principle work out correctly.

gitPrinz · 2020-03-17T13:22:52Z

I understand, it is not easy to fix?
Because we don't know how to order the columns?
I don't get it really, but i can image it is a problem.

Could there maybe be at least an Exception with a helpful message?

I stumbled on it today and took some time to understand the problem.
And others seem to have the same problem.

immerrr · 2020-03-17T14:04:26Z

Notificactions keep bringing me here :) I haven't touched pandas codebase for a while, so take my 2¢ with a grain of salt.

having order dependencies is extremely fragile

I agree with that: they are fragile and unreliable. And as a maintainer of other projects I get the sentiment of not adding stuff unless really necessary, even if it is conceptual stuff, like "in case of indexing non-unique indexes with non-unique indexer (re-reading this sentence hurts), matching is performed according to the order of the labels".

But from pandas end-user perspective I do see this as a UX papercut. Yes, most of the time you shouldn't care about the ordering of columns or rows, but there are cases where there is no way around. At which point you either verify the existing ordering, or sort by a given criterion, and then for a short period of time you can rely on a specific order to perform a specific operation.

A good example of this would be forward-/backward-filling of NAs: ordering among the filling axis will directly influence the outcome, so before applying that I would need to make sure that the data is ordered as I want it to be. The same approach could be applicable here: if you need to concatenate dataframes with non-unique labels and they are not in the order you want them to be, it's up to you to sort them in whatever order you like.

ivirshup · 2020-12-22T03:43:03Z

I have a few more cases of the error messages being much less helpful than they could be:

pd.concat([  # One dataframe has repeated column names
    pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((4, 3)), columns=list("abc")),
])

ValueError: Plan shapes are not aligned

pd.concat([  # Repeated columns (same amount) different column ordering
    pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((2, 4)), columns=list("abca")),
])

AssertionError: Number of manager items must equal union of block items

Full tracebacks

>>> import pandas as pd, numpy as np
>>> pd.concat([  # One dataframe has repeated column names
...     pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
...     pd.DataFrame(np.ones((4, 3)), columns=list("abc")),
... ])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 287, in concat
    return op.get_result()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 502, in get_result
    new_data = concatenate_block_managers(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 54, in concatenate_block_managers
    for placement, join_units in concat_plan:
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 561, in _combine_concat_plans
    raise ValueError("Plan shapes are not aligned")
ValueError: Plan shapes are not aligned
>>> pd.concat([  # Repeated columns (same amount) different column ordering
...     pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
...     pd.DataFrame(np.ones((4, 4)), columns=list("abca")),
... ])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 287, in concat
    return op.get_result()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 502, in get_result
    new_data = concatenate_block_managers(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 84, in concatenate_block_managers
    return BlockManager(blocks, axes)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 149, in __init__
    self._verify_integrity()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 331, in _verify_integrity
    raise AssertionError(
AssertionError: Number of manager items must equal union of block items
# manager items: 3, # tot_items: 4

I think it's a little strange that the following works, but the previous example don't:

>>> pd.concat([  # Repeated columns, same ordering
...     pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
...     pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
... ])
     a    a    b    c
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0

Could there be a check for this in concatenation which throws a better error?

If non-unique column names are to be disallowed it could be something simple, like this other error pandas throws:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

It could even be more specific and even name some of the repeated elements, if you wanted to get fancy.

If the case where ordering is preserved will be kept, it could be something like:

InvalidIndexError: Repeated column names {non-unique-columns} could not be uniquely aligned between DataFrames

jreback · 2020-12-22T03:48:46Z

@ivirshup hapoy to take a PR to have better error message

yeha duplicates along the same axis of concatenation are almost always an error

ivirshup · 2020-12-22T03:53:57Z

I'd be happy to make a PR. I feel like there might be code in pandas that does checks like these already. Any chance you could point me to places these might be for reference?

Also, I'm assuming you want to keep the existing behaviour of the following working for now?

pd.concat([  # Repeated columns, same order
    pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((2, 5)), columns=list("aabc")),
])

Is that right?

ivirshup · 2020-12-22T07:18:38Z

@jreback, I think this conflicts pretty directly with #36290, which allows duplicate items.

Also, I think there are some bugs in that implementation. Using the current release candidate:

import pandas as pd
import numpy as np

from string import ascii_lowercase
letters =  np.array(list(ascii_lowercase))

a_int = pd.DataFrame(np.arange(5), index=[0,1,2,3,3], columns=['a'])
b_int = pd.DataFrame(np.arange(5), index=[0,1,2,2,4], columns=['b'])

a_str = a_int.set_index(letters[a_int.index])
b_str = b_int.set_index(letters[b_int.index])

This works (the purpose of the PR, and the example in it's linked issue):

pd.concat([a_int, b_int], axis=1)

     a    b
0  0.0  0.0
1  1.0  1.0
2  2.0  2.0
2  2.0  3.0
3  3.0  NaN
3  4.0  NaN
4  NaN  4.0

This does not work, though I believe it's pretty equivalent to the previous example:

pd.concat([a_str, b_str], axis=1)

----> 1 pd.concat([a_str, b_str], axis=1)

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    297     )
    298 
--> 299     return op.get_result()
    300 
    301 

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/reshape/concat.py in get_result(self)
    526                 mgrs_indexers.append((obj._mgr, indexers))
    527 
--> 528             new_data = concatenate_block_managers(
    529                 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
    530             )

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
     87         blocks.append(b)
     88 
---> 89     return BlockManager(blocks, axes)
     90 
     91 

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    141 
    142         if do_integrity_check:
--> 143             self._verify_integrity()
    144 
    145         # Populate known_consolidate, blknos, and blklocs lazily

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    321         for block in self.blocks:
    322             if block.shape[1:] != mgr_shape[1:]:
--> 323                 raise construction_error(tot_items, block.shape[1:], self.axes)
    324         if len(self.items) != tot_items:
    325             raise AssertionError(

ValueError: Shape of passed values is (6, 2), indices imply (5, 2)

As an overall point, I think the target behaviour of that PR is wrong. Here's an example of why:

# Using pandas 1.2.0rc0
df1 = pd.DataFrame(np.arange(3), index=[0,1,1], columns=['a'])
df2 = pd.DataFrame(np.arange(3), index=[1,0,1], columns=['b'])

pd.concat([df1, df2], axis=1)

The results here rely on the ordering of the labels, (#6963 (comment)) which I agree is brittle.

I think there are two more reasonable option for the behaviour.

Union the indices, duplicates cause errors (my suggestion)
Mimic merge, take actually have the outer product of indices.

I'd note the current behaviour of concat interprets "inner" / "outer" much more like "intersection"/ "union" compared to merges "inner"/ "outer" operations. Mimicking merge could be a larger behaviour change.

1.2.0rc0 is doing something else

merge behaviour

merge behaviour compared with concat in 1.2.0rc0

merge "inner" and "outer" are equivalent for common repeated indices

In [11]: pd.merge(df1, df2, left_index=True, right_index=True, how="inner")                         
Out[11]: 
   a  b
0  0  1
1  1  0
1  1  2
1  2  0
1  2  2

In [12]: pd.merge(df1, df2, left_index=True, right_index=True, how="outer")                         
Out[12]: 
   a  b
0  0  1
1  1  0
1  1  2
1  2  0
1  2  2

and do not match current behaviour of concat

Current implementation otherwise basically works for outer joins if indices are only repeated in one DataFrame

Using definitions from above, e.g.:

a_int = pd.DataFrame(np.random.randn(5), index=[0,1,2,3,3], columns=['a'])
b_int = pd.DataFrame(np.random.randn(5), index=[0,1,2,2,4], columns=['b'])

In [4]: pd.merge(a_int, b_int, left_index=True, right_index=True, how="outer")                      
Out[4]: 
     a    b
0  0.0  0.0
1  1.0  1.0
2  2.0  2.0
2  2.0  3.0
3  3.0  NaN
3  4.0  NaN
4  NaN  4.0

In [89]: pd.concat([a_int, b_int], axis=1, join="outer")
Out[89]: 
     a    b
0  0.0  0.0
1  1.0  1.0
2  2.0  2.0
2  2.0  3.0
3  3.0  NaN
3  4.0  NaN
4  NaN  4.0

But not for inner joins

In [8]: pd.merge(a_int, b_int, left_index=True, right_index=True, how="inner")                      
Out[8]: 
   a  b
0  0  0
1  1  1
2  2  2
2  2  3

In [9]:  pd.concat([a_int, b_int], axis=1, join="inner")                                            
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-03adb33c977d> in <module>
----> 1 pd.concat([a_int, b_int], axis=1, join="inner")

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    297     )
    298 
--> 299     return op.get_result()
    300 
    301 

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/reshape/concat.py in get_result(self)
    526                 mgrs_indexers.append((obj._mgr, indexers))
    527 
--> 528             new_data = concatenate_block_managers(
    529                 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
    530             )

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
     87         blocks.append(b)
     88 
---> 89     return BlockManager(blocks, axes)
     90 
     91 

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    141 
    142         if do_integrity_check:
--> 143             self._verify_integrity()
    144 
    145         # Populate known_consolidate, blknos, and blklocs lazily

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    321         for block in self.blocks:
    322             if block.shape[1:] != mgr_shape[1:]:
--> 323                 raise construction_error(tot_items, block.shape[1:], self.axes)
    324         if len(self.items) != tot_items:
    325             raise AssertionError(

ValueError: Shape of passed values is (4, 2), indices imply (3, 2)

jreback · 2020-12-22T13:42:57Z

@ivirshup ahh i remember now. yeah handling duplicates is hard. so we can handle some of them. i am actually ok with raising on duplicates in either axis, but would have to see how much would break.

jorisvandenbossche changed the title ~~BUG: concat dataframes with both different and duplicate column names causing IndexError~~ BUG: concat dataframes with both different and duplicate column names causing IndexError Apr 25, 2014

jreback added Bug labels Apr 25, 2014

jreback added this to the 0.15.0 milestone Apr 25, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

jorisvandenbossche mentioned this issue Sep 16, 2017

On concatenating dataframes: NoneType object is not iterable #17552

Closed

jorisvandenbossche changed the title ~~BUG: concat dataframes with both different and duplicate column names causing IndexError~~ BUG: concat on axis with both different and duplicate labels raising error Sep 16, 2017

This was referenced Jan 7, 2020

ERR: concat of non-unique join axes should have better error #13084

Closed

DataFrameGroupBy.boxplot crashes if any group contains duplicate index #30772

Closed

xuancong84 mentioned this issue Apr 2, 2020

row sum and column mean works but row mean gives all NaN, on heterogenous data types #33202

Closed

ivirshup mentioned this issue Dec 22, 2020

Non unique column names throw obscure error in concat scverse/anndata#483

Closed

ivirshup mentioned this issue Dec 23, 2020

[BUG] Concat duplicates errors (or lack there of) #38654

Merged

6 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Dec 23, 2020

jreback closed this as completed in #38654 Dec 24, 2020

ivirshup mentioned this issue Dec 30, 2020

API/ ENH: Unambiguous indexing should be allowed, even if duplicates are present #38797

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: concat on axis with both different and duplicate labels raising error #6963

BUG: concat on axis with both different and duplicate labels raising error #6963

jorisvandenbossche commented Apr 25, 2014

jreback commented Apr 25, 2014

immerrr commented Apr 25, 2014

jreback commented Apr 25, 2014

immerrr commented Apr 25, 2014

jreback commented Apr 25, 2014

jorisvandenbossche commented Sep 16, 2017

xuancong84 commented Jan 13, 2020 •

edited

jreback commented Jan 13, 2020

xuancong84 commented Jan 14, 2020 •

edited

gitPrinz commented Mar 17, 2020

immerrr commented Mar 17, 2020

ivirshup commented Dec 22, 2020

jreback commented Dec 22, 2020

ivirshup commented Dec 22, 2020 •

edited

ivirshup commented Dec 22, 2020

merge "inner" and "outer" are equivalent for common repeated indices

Current implementation otherwise basically works for outer joins if indices are only repeated in one DataFrame

But not for inner joins

jreback commented Dec 22, 2020

BUG: concat on axis with both different and duplicate labels raising error #6963

BUG: concat on axis with both different and duplicate labels raising error #6963

Comments

jorisvandenbossche commented Apr 25, 2014

jreback commented Apr 25, 2014

immerrr commented Apr 25, 2014

jreback commented Apr 25, 2014

immerrr commented Apr 25, 2014

jreback commented Apr 25, 2014

jorisvandenbossche commented Sep 16, 2017

xuancong84 commented Jan 13, 2020 • edited

jreback commented Jan 13, 2020

xuancong84 commented Jan 14, 2020 • edited

gitPrinz commented Mar 17, 2020

immerrr commented Mar 17, 2020

ivirshup commented Dec 22, 2020

jreback commented Dec 22, 2020

ivirshup commented Dec 22, 2020 • edited

ivirshup commented Dec 22, 2020

merge "inner" and "outer" are equivalent for common repeated indices

Current implementation otherwise basically works for outer joins if indices are only repeated in one DataFrame

But not for inner joins

jreback commented Dec 22, 2020

xuancong84 commented Jan 13, 2020 •

edited

xuancong84 commented Jan 14, 2020 •

edited

ivirshup commented Dec 22, 2020 •

edited