Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: concat on axis with both different and duplicate labels raising error #6963

Closed
jorisvandenbossche opened this issue Apr 25, 2014 · 16 comments · Fixed by #38654
Closed
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jorisvandenbossche
Copy link
Member

When concatting two dataframes where there are a) there are duplicate columns in one of the dataframes, and b) there are non-overlapping column names in both, then you get a IndexError:

In [9]: df1 = pd.DataFrame(np.random.randn(3,3), columns=['A', 'A', 'B1'])
   ...: df2 = pd.DataFrame(np.random.randn(3,3), columns=['A', 'A', 'B2'])

In [10]: pd.concat([df1, df2])

Traceback (most recent call last):
  File "<ipython-input-10-f61a1ab4009e>", line 1, in <module>
    pd.concat([df1, df2])
...
  File "c:\users\vdbosscj\scipy\pandas-joris\pandas\core\index.py", line 765, in take
    taken = self.view(np.ndarray).take(indexer)
IndexError: index 3 is out of bounds for axis 0 with size 3

I don't know if it should work (although I suppose it should, as with only the duplicate columns it does work), but at least the error message is not really helpfull.

@jorisvandenbossche jorisvandenbossche changed the title BUG: concat dataframes with both different and duplicate column names causing IndexError BUG: concat dataframes with both different and duplicate column names causing IndexError Apr 25, 2014
@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

cc @immerrr

I think after #6745 this will be straightforward to fix

@jreback jreback added this to the 0.15.0 milestone Apr 25, 2014
@immerrr
Copy link
Contributor

immerrr commented Apr 25, 2014

The main issue is how to align indices that both have duplicate items, as of now, indexing with dupes does strange things:

In [1]: pd.Index([1,1,2])
Out[1]: Int64Index([1, 1, 2], dtype='int64')

In [2]: _1.get_indexer_for(_1)
Out[2]: Int64Index([0, 1, 0, 1, 2], dtype='int64')

Apparently, for each non-unique element found in destination, get_indexer tries to insert all locs of this element. I can hardly thinkg of a use case when I'd want to do a reindex(['x', 'y']) and getting (['x', 'x', 'x', 'y']) instead would do.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

The dup indexers came out of having duplicates indexers on a unique index, so you have to dup

In [1]: df = DataFrame(np.arange(10).reshape(5,2))

In [2]: df
Out[2]: 
   0  1
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

[5 rows x 2 columns]

In [5]: df.loc[:,[0,1,1,0]]
Out[5]: 
   0  1  1  0
0  0  1  1  0
1  2  3  3  2
2  4  5  5  4
3  6  7  7  6
4  8  9  9  8

[5 rows x 4 columns]

Now maybe outlaw that I suppose

@immerrr
Copy link
Contributor

immerrr commented Apr 25, 2014

Maybe it would make more sense to require destination index to have the same count of duplicate entries for each element present in source, i.e.:

# e.g. these should be ok
pd.Index([1,1,2]).get_indexer_for([1,1,2]) # should be ok and return [0, 1, 2]
pd.Index([1,1,2]).get_indexer_for([2]) # return [2]
pd.Index([1,1,2]).get_indexer_for([1,2,1]) # return [0, 2, 1]

# but these should be forbidden
pd.Index([1,1,2]).get_indexer_for([1,2]) # which one of `1` did you want?
pd.Index([1,1,2]).get_indexer_for([1,1,1,2]) # which `1` should be duplicated?

UPD: or maybe cycle over duplicate elements like np.putmask does...

x = np.arange(5)
np.putmask(x, x>1, [-33, -44])
print x
array([  0,   1, -33, -44, -33])

# so that
pd.Index([1,1,2]).get_indexer_for([1,1,2,1,1]) # return [0,1,2,0,1] ?

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

right, so that's a duplicate of a duplicate; yes, that would need to be handled differently

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jorisvandenbossche jorisvandenbossche changed the title BUG: concat dataframes with both different and duplicate column names causing IndexError BUG: concat on axis with both different and duplicate labels raising error Sep 16, 2017
@jorisvandenbossche
Copy link
Member Author

Some more example from #17552 (with axis=1, different error messages, but the idea is the same: you only get the error when there are duplicate + different labels):

In [113]: df1 = pd.DataFrame(np.zeros((2,2)), index=[0, 1], columns=['a', 'b'])
     ...: df2 = pd.DataFrame(np.ones((2,2)), index=[2, 2], columns=['c', 'd'])

In [114]: pd.concat([df1, df2], axis=1)
...
ValueError: Shape of passed values is (4, 6), indices imply (4, 4)

In [115]: df1 = pd.DataFrame(np.zeros((2,2)), index=[0, 1], columns=['a', 'b'])
     ...: df2 = pd.DataFrame(np.ones((2,2)), index=pd.DatetimeIndex([2, 2]), columns=['c', 'd'])

In [116]: pd.concat([df1, df2], axis=1)
...
TypeError: 'NoneType' object is not iterable

@xuancong84
Copy link

xuancong84 commented Jan 13, 2020

In general, for maximum usage compatibility, we should treat a table as a symmetric rank-2 tensor, (except for axis labels, e.g., 1st direction called 'column', 2nd direction called 'row'). And thus, no matter whether there are duplicate row indices or duplicate column names or both, we should always be able to perform concatenation along either axis without throwing any errors.

Below is a very elegant design that can grant maximum compatibility, suppose there are duplicate names in both row indices and column names:
image

  1. For concatenation along axis 0, (pd.concat([df1, df2], axis=0), i.e., vertical concatenation of rows), the first two Column A's in df1 and df2 should align to each other, leaving empty values (or NaN) for the 3rd Column A in df1 because df1 does not have the 3rd Column A

image

  1. For concatenation along axis 1, (pd.concat([df1, df2], axis=1), i.e., horizontal concatenation of columns), the first two Row 0's in df1 and df2 should align to each other, leaving empty values (or NaN) for the 3rd Row 0 in df1 because df1 does not have the 3rd Row 0

image

  1. So in the most generic case, for concatenation along any axis, duplicate indices along that axis are kept and appended (e.g., 2 'A' plus 3 'A' = 5 'A'), duplicate indices along other axis are merged with multiplicity count consistency (e.g., 2 'A' merge with 3 'A', the first 2 'A' corresponds to each other, the last 'A' get empty/NaN values)

These should solve many DataFrame concatenation bugs and crashes in Pandas!

@jreback
Copy link
Contributor

jreback commented Jan 13, 2020

@xuancong84 this is anti pandas philosophy

having order dependencies is extremely fragile and would work unexpectedly if you happened to reorder columns or not

we must align columns on the labels; aligning some of them is just really odd

if you want to treat this as a tensor then drop the labels

-1 on this proposal

@xuancong84
Copy link

xuancong84 commented Jan 14, 2020

@jreback Sorry, could not get what you mean? Would you be more specific by illustration with examples?

Regardless of how you feel, currently when I am developing the omnipotent data plotter for pandas (https://github.com/xuancong84/beiwe-visualizer ), I am experiencing lots of frustrating limitations of pandas. A lot of situations that could be handled easily could not be handled very well. You might want to take a look at my code to how troublesome I work around those to get things work.

Regarding order dependencies, I know it is not ideal because Python dictionary does not preserve order (unless you use OrderedDict). But in cases where there are duplicate names in both row indices and column names, that is the only way to make things work. Otherwise, Pandas just ends up with stupid crashes such as #28479 and #30772 where it should in principle work out correctly.

@gitPrinz
Copy link

I understand, it is not easy to fix?
Because we don't know how to order the columns?
I don't get it really, but i can image it is a problem.

Could there maybe be at least an Exception with a helpful message?

I stumbled on it today and took some time to understand the problem.
And others seem to have the same problem.

@immerrr
Copy link
Contributor

immerrr commented Mar 17, 2020

Notificactions keep bringing me here :) I haven't touched pandas codebase for a while, so take my 2¢ with a grain of salt.

having order dependencies is extremely fragile

I agree with that: they are fragile and unreliable. And as a maintainer of other projects I get the sentiment of not adding stuff unless really necessary, even if it is conceptual stuff, like "in case of indexing non-unique indexes with non-unique indexer (re-reading this sentence hurts), matching is performed according to the order of the labels".

But from pandas end-user perspective I do see this as a UX papercut. Yes, most of the time you shouldn't care about the ordering of columns or rows, but there are cases where there is no way around. At which point you either verify the existing ordering, or sort by a given criterion, and then for a short period of time you can rely on a specific order to perform a specific operation.

A good example of this would be forward-/backward-filling of NAs: ordering among the filling axis will directly influence the outcome, so before applying that I would need to make sure that the data is ordered as I want it to be. The same approach could be applicable here: if you need to concatenate dataframes with non-unique labels and they are not in the order you want them to be, it's up to you to sort them in whatever order you like.

@ivirshup
Copy link
Contributor

I have a few more cases of the error messages being much less helpful than they could be:

pd.concat([  # One dataframe has repeated column names
    pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((4, 3)), columns=list("abc")),
])
ValueError: Plan shapes are not aligned
pd.concat([  # Repeated columns (same amount) different column ordering
    pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((2, 4)), columns=list("abca")),
])
AssertionError: Number of manager items must equal union of block items
Full tracebacks
>>> import pandas as pd, numpy as np
>>> pd.concat([  # One dataframe has repeated column names
...     pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
...     pd.DataFrame(np.ones((4, 3)), columns=list("abc")),
... ])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 287, in concat
    return op.get_result()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 502, in get_result
    new_data = concatenate_block_managers(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 54, in concatenate_block_managers
    for placement, join_units in concat_plan:
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 561, in _combine_concat_plans
    raise ValueError("Plan shapes are not aligned")
ValueError: Plan shapes are not aligned
>>> pd.concat([  # Repeated columns (same amount) different column ordering
...     pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
...     pd.DataFrame(np.ones((4, 4)), columns=list("abca")),
... ])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 287, in concat
    return op.get_result()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 502, in get_result
    new_data = concatenate_block_managers(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 84, in concatenate_block_managers
    return BlockManager(blocks, axes)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 149, in __init__
    self._verify_integrity()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 331, in _verify_integrity
    raise AssertionError(
AssertionError: Number of manager items must equal union of block items
# manager items: 3, # tot_items: 4

I think it's a little strange that the following works, but the previous example don't:

>>> pd.concat([  # Repeated columns, same ordering
...     pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
...     pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
... ])
     a    a    b    c
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0

Could there be a check for this in concatenation which throws a better error?

If non-unique column names are to be disallowed it could be something simple, like this other error pandas throws:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

It could even be more specific and even name some of the repeated elements, if you wanted to get fancy.

If the case where ordering is preserved will be kept, it could be something like:

InvalidIndexError: Repeated column names {non-unique-columns} could not be uniquely aligned between DataFrames

@jreback
Copy link
Contributor

jreback commented Dec 22, 2020

@ivirshup hapoy to take a PR to have better error message

yeha duplicates along the same axis of concatenation are almost always an error

@ivirshup
Copy link
Contributor

ivirshup commented Dec 22, 2020

I'd be happy to make a PR. I feel like there might be code in pandas that does checks like these already. Any chance you could point me to places these might be for reference?

Also, I'm assuming you want to keep the existing behaviour of the following working for now?

pd.concat([  # Repeated columns, same order
    pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((2, 5)), columns=list("aabc")),
])

Is that right?

@ivirshup
Copy link
Contributor

@jreback, I think this conflicts pretty directly with #36290, which allows duplicate items.

Also, I think there are some bugs in that implementation. Using the current release candidate:

import pandas as pd
import numpy as np

from string import ascii_lowercase
letters =  np.array(list(ascii_lowercase))

a_int = pd.DataFrame(np.arange(5), index=[0,1,2,3,3], columns=['a'])
b_int = pd.DataFrame(np.arange(5), index=[0,1,2,2,4], columns=['b'])

a_str = a_int.set_index(letters[a_int.index])
b_str = b_int.set_index(letters[b_int.index])

This works (the purpose of the PR, and the example in it's linked issue):

pd.concat([a_int, b_int], axis=1)
     a    b
0  0.0  0.0
1  1.0  1.0
2  2.0  2.0
2  2.0  3.0
3  3.0  NaN
3  4.0  NaN
4  NaN  4.0

This does not work, though I believe it's pretty equivalent to the previous example:

pd.concat([a_str, b_str], axis=1)
----> 1 pd.concat([a_str, b_str], axis=1)

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    297     )
    298 
--> 299     return op.get_result()
    300 
    301 

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/reshape/concat.py in get_result(self)
    526                 mgrs_indexers.append((obj._mgr, indexers))
    527 
--> 528             new_data = concatenate_block_managers(
    529                 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
    530             )

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
     87         blocks.append(b)
     88 
---> 89     return BlockManager(blocks, axes)
     90 
     91 

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    141 
    142         if do_integrity_check:
--> 143             self._verify_integrity()
    144 
    145         # Populate known_consolidate, blknos, and blklocs lazily

~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    321         for block in self.blocks:
    322             if block.shape[1:] != mgr_shape[1:]:
--> 323                 raise construction_error(tot_items, block.shape[1:], self.axes)
    324         if len(self.items) != tot_items:
    325             raise AssertionError(

ValueError: Shape of passed values is (6, 2), indices imply (5, 2)

As an overall point, I think the target behaviour of that PR is wrong. Here's an example of why:

# Using pandas 1.2.0rc0
df1 = pd.DataFrame(np.arange(3), index=[0,1,1], columns=['a'])
df2 = pd.DataFrame(np.arange(3), index=[1,0,1], columns=['b'])

pd.concat([df1, df2], axis=1)
   a  b
0  0  1
1  1  0
1  2  2

The results here rely on the ordering of the labels, (#6963 (comment)) which I agree is brittle.

I think there are two more reasonable option for the behaviour.

  • Union the indices, duplicates cause errors (my suggestion)
  • Mimic merge, take actually have the outer product of indices.

I'd note the current behaviour of concat interprets "inner" / "outer" much more like "intersection"/ "union" compared to merges "inner"/ "outer" operations. Mimicking merge could be a larger behaviour change.

1.2.0rc0 is doing something else

merge behaviour

merge behaviour compared with concat in 1.2.0rc0

merge "inner" and "outer" are equivalent for common repeated indices

In [11]: pd.merge(df1, df2, left_index=True, right_index=True, how="inner")                         
Out[11]: 
   a  b
0  0  1
1  1  0
1  1  2
1  2  0
1  2  2

In [12]: pd.merge(df1, df2, left_index=True, right_index=True, how="outer")                         
Out[12]: 
   a  b
0  0  1
1  1  0
1  1  2
1  2  0
1  2  2

and do not match current behaviour of concat

Current implementation otherwise basically works for outer joins if indices are only repeated in one DataFrame

Using definitions from above, e.g.:

a_int = pd.DataFrame(np.random.randn(5), index=[0,1,2,3,3], columns=['a'])
b_int = pd.DataFrame(np.random.randn(5), index=[0,1,2,2,4], columns=['b'])
In [4]: pd.merge(a_int, b_int, left_index=True, right_index=True, how="outer")                      
Out[4]: 
     a    b
0  0.0  0.0
1  1.0  1.0
2  2.0  2.0
2  2.0  3.0
3  3.0  NaN
3  4.0  NaN
4  NaN  4.0

In [89]: pd.concat([a_int, b_int], axis=1, join="outer")
Out[89]: 
     a    b
0  0.0  0.0
1  1.0  1.0
2  2.0  2.0
2  2.0  3.0
3  3.0  NaN
3  4.0  NaN
4  NaN  4.0

But not for inner joins

In [8]: pd.merge(a_int, b_int, left_index=True, right_index=True, how="inner")                      
Out[8]: 
   a  b
0  0  0
1  1  1
2  2  2
2  2  3

In [9]:  pd.concat([a_int, b_int], axis=1, join="inner")                                            
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-03adb33c977d> in <module>
----> 1 pd.concat([a_int, b_int], axis=1, join="inner")

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    297     )
    298 
--> 299     return op.get_result()
    300 
    301 

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/reshape/concat.py in get_result(self)
    526                 mgrs_indexers.append((obj._mgr, indexers))
    527 
--> 528             new_data = concatenate_block_managers(
    529                 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
    530             )

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
     87         blocks.append(b)
     88 
---> 89     return BlockManager(blocks, axes)
     90 
     91 

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    141 
    142         if do_integrity_check:
--> 143             self._verify_integrity()
    144 
    145         # Populate known_consolidate, blknos, and blklocs lazily

~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    321         for block in self.blocks:
    322             if block.shape[1:] != mgr_shape[1:]:
--> 323                 raise construction_error(tot_items, block.shape[1:], self.axes)
    324         if len(self.items) != tot_items:
    325             raise AssertionError(

ValueError: Shape of passed values is (4, 2), indices imply (3, 2)

@jreback
Copy link
Contributor

jreback commented Dec 22, 2020

@ivirshup ahh i remember now. yeah handling duplicates is hard. so we can handle some of them. i am actually ok with raising on duplicates in either axis, but would have to see how much would break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants