[BUG] Concat duplicates errors (or lack there of) #38654

ivirshup · 2020-12-23T06:38:13Z

closes BUG: concat on axis with both different and duplicate labels raising error #6963
xref pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263, [BUG]: Fix ValueError in concat() when at least one Index has duplicates #36290
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

So here is a basic proposal for making concat throw a more useful error when there are duplicate values on the index not being concatenated along. Below are some examples of how this error message is improved. I'm including a comparison to 1.2.0rc1, since it includes a pr which was supposed to change some of these errors to working cases, but those cases are extremely limited. That's why the first commit of this PR reverts that change.

I haven't gone too far with this since it a) reverts a merged PR and b) affects the release candidate, so should be conservative. I would like to get feedback on whether I should proceed with this approach.

My thinking right now is that a better error and revert should go into 1.2 – no behaviour change from 1.1, just better errors. A question of whether pd.concat(..., join="inner") means an intersection of indices (the current and documented behaviour) or an actual inner join seems like something that could benefit from more discussion.

If they are set operations, it should probably error on duplicates. If they are relational algebra operations, they should work with duplicates. It would probably also be useful to then add more arguments to concat, like merges validate.

One dataframe has repeated column names (str)

pd.concat([
    pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((4, 3)), columns=list("abc")),
])

This PR

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Pandas v1.1.5

ValueError: Plan shapes are not aligned

pandas 1.2.0rc1

ValueError: Plan shapes are not aligned

One dataframe has repeated column names (int)

pd.concat([  # One dataframe has repeated column names
    pd.DataFrame(np.ones((4, 4)), columns=[1, 1, 2, 3]),
    pd.DataFrame(np.ones((4, 3)), columns=[1, 2, 3]),
)

This PR

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Pandas v1.1.5

ValueError: Plan shapes are not aligned

pandas 1.2.0rc1

ValueError: Plan shapes are not aligned

Repeated columns (same amount) different column ordering

pd.concat([
    pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((2, 4)), columns=list("abca")),
])

This PR:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

In pandas v1.15

AssertionError: Number of manager items must equal union of block items
# manager items: 3, # tot_items: 4

In pandas v1.2.0rc1

AssertionError: Number of manager items must equal union of block items
# manager items: 3, # tot_items: 4

Repeated columns of different values in each

This point is what the reverted PR was trying to fix. This demonstrates that it didn't seem to fix what it set out to, and that the intended fix introduced strange behaviour anyways. To do this, I'll use the reverted test case.

# Test case from previous PR
a_int = pd.DataFrame([1, 2, 3, 4], index=[0, 1, 1, 4], columns=["a"])
b_int = pd.DataFrame([6, 7, 8, 9], index=[0, 0, 1, 3], columns=["b"])

# Same, but with string indexes
a_str = a_int.set_index(letters[a_int.index])
b_str = b_int.set_index(letters[b_int.index])

Int index

pd.concat([a_int, b_int], axis=1)

This PR

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Pandas v1.1.5

ValueError: Shape of passed values is (8, 2), indices imply (6, 2)

pandas 1.2.0rc1

     a    b
0  1.0  6.0
0  1.0  7.0
1  2.0  8.0
1  3.0  8.0
3  NaN  9.0
4  4.0  NaN

This behaviour is a bit strange

First, the concat documentation defines the options here as set operations, like intersect and union. Is this a union, or is this an outer join?

If these are joins, then inner joins only seem to work when they are equivalent to intersections:

>>> pd.concat([a_int, b_int], axis=1, join="inner")
...
ValueError: Shape of passed values is (3, 2), indices imply (2, 2)

>>> pd.merge(a_int, b_int, how="inner", left_index=True, right_index=True)
   a  b
0  1  6
0  1  7
1  2  8
1  3  8

str index

This is mostly just to show the behaviour in 1.2.0rc1 is different for string and int indices

pd.concat([a_str, b_str], axis=1)

This PR

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Pandas v1.1.5

ValueError: Shape of passed values is (5, 2), indices imply (4, 2)

pandas 1.2.0rc1

ValueError: Shape of passed values is (5, 2), indices imply (4, 2)

… duplicates (pandas-dev#36290)" This reverts commit b32febd.

jreback · 2020-12-23T09:14:45Z

wow let me have a closer look

cc @jbrockmendel

jreback

i like this - pretty consistent

can u add test cases for each of the scenarios?

+1 on putting this in 1.2

jreback · 2020-12-23T09:29:27Z

cc @jorisvandenbossche

jbrockmendel · 2020-12-23T18:02:32Z

cc @phofl

jreback · 2020-12-23T22:38:21Z

I think i want to merge this for 1.2 @ivirshup can you add some tests cases here (we could merge w/o those).

jreback · 2020-12-23T22:38:28Z

cc @simonjayhawkins

phofl · 2020-12-23T22:40:34Z

We should document that concat does not support duplicates and close all related issues.
there is at least one more open

jreback · 2020-12-23T22:45:42Z

We should document that concat does not support duplicates and close all related issues.
there is at least one more open

yes abolutely.

jreback · 2020-12-24T14:34:47Z

@ivirshup can you update this with some tests that assert this.

jreback · 2020-12-24T15:07:53Z

ok i am going to merge this. @ivirshup pls open a new one and we can add tests.

jreback · 2020-12-24T15:08:17Z

@meeseeksdev backport 1.2.x

… there of)

…#38682) Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

ivirshup · 2020-12-28T03:45:58Z

Ah, sorry, took the weekend off for the holiday! I'll get on those tests.

Just a point to make on this, I had left the case where dataframes have equal indexes, even with duplicates. E.g:

pd.concat([
    pd.DataFrame(np.ones((3, 3)), columns=list("aab")),
    pd.DataFrame(np.ones((3, 3)), columns=list("aab"))
])

I believe this has worked for a while. It's more obvious what a user intended in this case – even if it's not quite a set or join operation. If this were to be disallowed, I think it could use a deprecation first.

jorisvandenbossche · 2020-12-28T13:11:55Z

Indeed, that case has always worked (I assume because the indexes are equal, we don't attempt any reindexing). If we wanted to change that, it would indeed need a deprecation cycle, so let's keep that as a separate issue/PR (but personally I think I am fine with keeping it working as is).

alindman · 2021-01-12T23:04:11Z

I would like to ask a question about this behavior on connection with the effect (or lack thereof) of verify_integrity. The situation in which I found this issue was combining two dataframes of timestamped values where some timestamps are repeated within the initial frames.

Here is a test case:

def test_concat(temp1,temp2):
    [display(frame) for frame in [temp1, temp2]]
    frames = [temp1, temp2]
    temp3 = pd.concat(frames,
                      axis = 1,
                      sort = True,
                      verify_integrity = False)
    display(temp3)
    
print('Case 1')
# Two frames with unique timestamps
test_concat(temp1 = pd.DataFrame(data = [[1,2],[3,4]], columns = ['A', 'B'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:01'])),
            temp2 = pd.DataFrame(data = [[5,6],[7,8]], columns = ['C', 'D'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:01'])))

print('Case 2')
# Same as case 1, except the first frame has identical timestamps on both rows
test_concat(temp1 = pd.DataFrame(data = [[1,2],[3,4]], columns = ['A', 'B'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:00'])),
            temp2 = pd.DataFrame(data = [[5,6],[7,8]], columns = ['C', 'D'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:01'])))

This returns the following output:

  | A | B
-- | -- | --
2020-01-01 10:00:00 | 1 | 2
2020-01-01 10:01:00 | 3 | 4


  | C | D
-- | -- | --
2020-01-01 10:00:00 | 5 | 6
2020-01-01 10:01:00 | 7 | 8


  | A | B | C | D
-- | -- | -- | -- | --
2020-01-01 10:00:00 | 1 | 2 | 5 | 6
2020-01-01 10:01:00 | 3 | 4 | 7 | 8

Case 2

  | A | B
-- | -- | --
2020-01-01 10:00:00 | 1 | 2
2020-01-01 10:00:00 | 3 | 4


  | C | D
-- | -- | --
2020-01-01 10:00:00 | 5 | 6
2020-01-01 10:01:00 | 7 | 8

ValueError                                Traceback (most recent call last)
<ipython-input-6-8812ae5e1d27> in <module>
     13 print('Case 2')
     14 test_concat(temp1 = pd.DataFrame(data = [[1,2],[3,4]], columns = ['A', 'B'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:00'])),
---> 15             temp2 = pd.DataFrame(data = [[5,6],[7,8]], columns = ['C', 'D'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:01'])))

<ipython-input-6-8812ae5e1d27> in test_concat(temp1, temp2)
      5                       axis = 1,
      6                       sort = True,
----> 7                       verify_integrity = False)
      8     display(temp3)
      9 

~/<path>/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    285     )
    286 
--> 287     return op.get_result()
    288 
    289 

~/<path>/lib/python3.6/site-packages/pandas/core/reshape/concat.py in get_result(self)
    501 
    502             new_data = concatenate_block_managers(
--> 503                 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy,
    504             )
    505             if not self.copy:

~/<path>/lib/python3.6/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
     82         blocks.append(b)
     83 
---> 84     return BlockManager(blocks, axes)
     85 
     86 

~/<path>/lib/python3.6/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    147 
    148         if do_integrity_check:
--> 149             self._verify_integrity()
    150 
    151         # Populate known_consolidate, blknos, and blklocs lazily

~/<path>/lib/python3.6/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    327         for block in self.blocks:
    328             if block.shape[1:] != mgr_shape[1:]:
--> 329                 raise construction_error(tot_items, block.shape[1:], self.axes)
    330         if len(self.items) != tot_items:
    331             raise AssertionError(

ValueError: Shape of passed values is (5, 4), indices imply (3, 4)

It appears to me that concat is not respecting verify_integrity = False. The User Guide states verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates. My interpretation of that is "when verify_integrity = False, concat should return all the supplied index values, even when there are duplicates." In pandas 1.1.5, however, the code above behaves identically with verify_integrity = False and verify_integrity = True. Is that the intended behavior? If so, I submit that the User Guide is misleading.

* Revert "[BUG]: Fix ValueError in concat() when at least one Index has duplicates (pandas-dev#36290)" This reverts commit b32febd.

ivirshup added 2 commits December 23, 2020 16:05

Revert "[BUG]: Fix ValueError in concat() when at least one Index has…

a998162

… duplicates (pandas-dev#36290)" This reverts commit b32febd.

Better error message for concat with duplicate indices

b7856e8

jreback requested changes Dec 23, 2020

View reviewed changes

jreback added this to the 1.2 milestone Dec 23, 2020

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas labels Dec 23, 2020

jreback merged commit fa478d3 into pandas-dev:master Dec 24, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 24, 2020

Backport PR pandas-dev#38654: [BUG] Concat duplicates errors (or lack…

4a97fff

… there of)

meeseeksmachine mentioned this pull request Dec 24, 2020

Backport PR #38654 on branch 1.2.x ([BUG] Concat duplicates errors (or lack there of)) #38682

Merged

jreback mentioned this pull request Dec 24, 2020

[BUG]: Fix ValueError in concat() when at least one Index has duplicates #36290

Merged

5 tasks

simonjayhawkins pushed a commit that referenced this pull request Dec 24, 2020

Backport PR #38654: [BUG] Concat duplicates errors (or lack there of) (…

e7454af

…#38682) Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

This was referenced Dec 28, 2020

BUG: IntervalIndex intersections are not intersections in many cases #38743

Closed

[WIP] Test (and more fixes) for duplicate indices with concat #38745

Closed

phofl mentioned this pull request Feb 27, 2023

BUG: concat behaves differently for some non-unique indexes: Reindexing only valid with uniquely valued Index objects #51666

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Concat duplicates errors (or lack there of) #38654

[BUG] Concat duplicates errors (or lack there of) #38654

ivirshup commented Dec 23, 2020 •

edited by jreback

jreback commented Dec 23, 2020

jreback left a comment

jreback commented Dec 23, 2020

jbrockmendel commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 23, 2020

phofl commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 24, 2020

jreback commented Dec 24, 2020

jreback commented Dec 24, 2020

ivirshup commented Dec 28, 2020

jorisvandenbossche commented Dec 28, 2020

alindman commented Jan 12, 2021

[BUG] Concat duplicates errors (or lack there of) #38654

[BUG] Concat duplicates errors (or lack there of) #38654

Conversation

ivirshup commented Dec 23, 2020 • edited by jreback

One dataframe has repeated column names (str)

One dataframe has repeated column names (int)

Repeated columns (same amount) different column ordering

Repeated columns of different values in each

Int index

str index

jreback commented Dec 23, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback commented Dec 23, 2020

jbrockmendel commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 23, 2020

phofl commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 24, 2020

jreback commented Dec 24, 2020

jreback commented Dec 24, 2020

ivirshup commented Dec 28, 2020

jorisvandenbossche commented Dec 28, 2020

alindman commented Jan 12, 2021

ivirshup commented Dec 23, 2020 •

edited by jreback