Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Concat duplicates errors (or lack there of) #38654

Merged
merged 2 commits into from
Dec 24, 2020

Conversation

ivirshup
Copy link
Contributor

@ivirshup ivirshup commented Dec 23, 2020

@jreback

So here is a basic proposal for making concat throw a more useful error when there are duplicate values on the index not being concatenated along. Below are some examples of how this error message is improved. I'm including a comparison to 1.2.0rc1, since it includes a pr which was supposed to change some of these errors to working cases, but those cases are extremely limited. That's why the first commit of this PR reverts that change.

I haven't gone too far with this since it a) reverts a merged PR and b) affects the release candidate, so should be conservative. I would like to get feedback on whether I should proceed with this approach.

My thinking right now is that a better error and revert should go into 1.2 – no behaviour change from 1.1, just better errors. A question of whether pd.concat(..., join="inner") means an intersection of indices (the current and documented behaviour) or an actual inner join seems like something that could benefit from more discussion.

If they are set operations, it should probably error on duplicates. If they are relational algebra operations, they should work with duplicates. It would probably also be useful to then add more arguments to concat, like merges validate.

One dataframe has repeated column names (str)

pd.concat([
    pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((4, 3)), columns=list("abc")),
])

This PR

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Pandas v1.1.5

ValueError: Plan shapes are not aligned

pandas 1.2.0rc1

ValueError: Plan shapes are not aligned

One dataframe has repeated column names (int)

pd.concat([  # One dataframe has repeated column names
    pd.DataFrame(np.ones((4, 4)), columns=[1, 1, 2, 3]),
    pd.DataFrame(np.ones((4, 3)), columns=[1, 2, 3]),
)

This PR

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Pandas v1.1.5

ValueError: Plan shapes are not aligned

pandas 1.2.0rc1

ValueError: Plan shapes are not aligned

Repeated columns (same amount) different column ordering

pd.concat([
    pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
    pd.DataFrame(np.ones((2, 4)), columns=list("abca")),
])

This PR:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

In pandas v1.15

AssertionError: Number of manager items must equal union of block items
# manager items: 3, # tot_items: 4

In pandas v1.2.0rc1

AssertionError: Number of manager items must equal union of block items
# manager items: 3, # tot_items: 4

Repeated columns of different values in each

This point is what the reverted PR was trying to fix. This demonstrates that it didn't seem to fix what it set out to, and that the intended fix introduced strange behaviour anyways. To do this, I'll use the reverted test case.

# Test case from previous PR
a_int = pd.DataFrame([1, 2, 3, 4], index=[0, 1, 1, 4], columns=["a"])
b_int = pd.DataFrame([6, 7, 8, 9], index=[0, 0, 1, 3], columns=["b"])

# Same, but with string indexes
a_str = a_int.set_index(letters[a_int.index])
b_str = b_int.set_index(letters[b_int.index])

Int index

pd.concat([a_int, b_int], axis=1)

This PR

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Pandas v1.1.5

ValueError: Shape of passed values is (8, 2), indices imply (6, 2)

pandas 1.2.0rc1

     a    b
0  1.0  6.0
0  1.0  7.0
1  2.0  8.0
1  3.0  8.0
3  NaN  9.0
4  4.0  NaN
This behaviour is a bit strange

First, the concat documentation defines the options here as set operations, like intersect and union. Is this a union, or is this an outer join?

If these are joins, then inner joins only seem to work when they are equivalent to intersections:

>>> pd.concat([a_int, b_int], axis=1, join="inner")
...
ValueError: Shape of passed values is (3, 2), indices imply (2, 2)

>>> pd.merge(a_int, b_int, how="inner", left_index=True, right_index=True)
   a  b
0  1  6
0  1  7
1  2  8
1  3  8

str index

This is mostly just to show the behaviour in 1.2.0rc1 is different for string and int indices

pd.concat([a_str, b_str], axis=1)

This PR

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Pandas v1.1.5

ValueError: Shape of passed values is (5, 2), indices imply (4, 2)

pandas 1.2.0rc1

ValueError: Shape of passed values is (5, 2), indices imply (4, 2)

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

wow let me have a closer look

cc @jbrockmendel

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this - pretty consistent

can u add test cases for each of the scenarios?

+1 on putting this in 1.2

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

cc @jorisvandenbossche

@jreback jreback added this to the 1.2 milestone Dec 23, 2020
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas labels Dec 23, 2020
@jbrockmendel
Copy link
Member

cc @phofl

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

I think i want to merge this for 1.2 @ivirshup can you add some tests cases here (we could merge w/o those).

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

cc @simonjayhawkins

@phofl
Copy link
Member

phofl commented Dec 23, 2020

We should document that concat does not support duplicates and close all related issues.
there is at least one more open

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

We should document that concat does not support duplicates and close all related issues.
there is at least one more open

yes abolutely.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2020

@ivirshup can you update this with some tests that assert this.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2020

ok i am going to merge this. @ivirshup pls open a new one and we can add tests.

@jreback jreback merged commit fa478d3 into pandas-dev:master Dec 24, 2020
@jreback
Copy link
Contributor

jreback commented Dec 24, 2020

@meeseeksdev backport 1.2.x

@ivirshup
Copy link
Contributor Author

Ah, sorry, took the weekend off for the holiday! I'll get on those tests.

Just a point to make on this, I had left the case where dataframes have equal indexes, even with duplicates. E.g:

pd.concat([
    pd.DataFrame(np.ones((3, 3)), columns=list("aab")),
    pd.DataFrame(np.ones((3, 3)), columns=list("aab"))
])

I believe this has worked for a while. It's more obvious what a user intended in this case – even if it's not quite a set or join operation. If this were to be disallowed, I think it could use a deprecation first.

@jorisvandenbossche
Copy link
Member

Indeed, that case has always worked (I assume because the indexes are equal, we don't attempt any reindexing). If we wanted to change that, it would indeed need a deprecation cycle, so let's keep that as a separate issue/PR (but personally I think I am fine with keeping it working as is).

@alindman
Copy link

I would like to ask a question about this behavior on connection with the effect (or lack thereof) of verify_integrity. The situation in which I found this issue was combining two dataframes of timestamped values where some timestamps are repeated within the initial frames.

Here is a test case:

def test_concat(temp1,temp2):
    [display(frame) for frame in [temp1, temp2]]
    frames = [temp1, temp2]
    temp3 = pd.concat(frames,
                      axis = 1,
                      sort = True,
                      verify_integrity = False)
    display(temp3)
    
print('Case 1')
# Two frames with unique timestamps
test_concat(temp1 = pd.DataFrame(data = [[1,2],[3,4]], columns = ['A', 'B'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:01'])),
            temp2 = pd.DataFrame(data = [[5,6],[7,8]], columns = ['C', 'D'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:01'])))

print('Case 2')
# Same as case 1, except the first frame has identical timestamps on both rows
test_concat(temp1 = pd.DataFrame(data = [[1,2],[3,4]], columns = ['A', 'B'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:00'])),
            temp2 = pd.DataFrame(data = [[5,6],[7,8]], columns = ['C', 'D'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:01'])))

This returns the following output:

  | A | B
-- | -- | --
2020-01-01 10:00:00 | 1 | 2
2020-01-01 10:01:00 | 3 | 4


  | C | D
-- | -- | --
2020-01-01 10:00:00 | 5 | 6
2020-01-01 10:01:00 | 7 | 8


  | A | B | C | D
-- | -- | -- | -- | --
2020-01-01 10:00:00 | 1 | 2 | 5 | 6
2020-01-01 10:01:00 | 3 | 4 | 7 | 8

Case 2

  | A | B
-- | -- | --
2020-01-01 10:00:00 | 1 | 2
2020-01-01 10:00:00 | 3 | 4


  | C | D
-- | -- | --
2020-01-01 10:00:00 | 5 | 6
2020-01-01 10:01:00 | 7 | 8

ValueError                                Traceback (most recent call last)
<ipython-input-6-8812ae5e1d27> in <module>
     13 print('Case 2')
     14 test_concat(temp1 = pd.DataFrame(data = [[1,2],[3,4]], columns = ['A', 'B'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:00'])),
---> 15             temp2 = pd.DataFrame(data = [[5,6],[7,8]], columns = ['C', 'D'], index = pd.to_datetime(['2020-01-01 10:00','2020-01-01 10:01'])))

<ipython-input-6-8812ae5e1d27> in test_concat(temp1, temp2)
      5                       axis = 1,
      6                       sort = True,
----> 7                       verify_integrity = False)
      8     display(temp3)
      9 

~/<path>/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    285     )
    286 
--> 287     return op.get_result()
    288 
    289 

~/<path>/lib/python3.6/site-packages/pandas/core/reshape/concat.py in get_result(self)
    501 
    502             new_data = concatenate_block_managers(
--> 503                 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy,
    504             )
    505             if not self.copy:

~/<path>/lib/python3.6/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
     82         blocks.append(b)
     83 
---> 84     return BlockManager(blocks, axes)
     85 
     86 

~/<path>/lib/python3.6/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    147 
    148         if do_integrity_check:
--> 149             self._verify_integrity()
    150 
    151         # Populate known_consolidate, blknos, and blklocs lazily

~/<path>/lib/python3.6/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    327         for block in self.blocks:
    328             if block.shape[1:] != mgr_shape[1:]:
--> 329                 raise construction_error(tot_items, block.shape[1:], self.axes)
    330         if len(self.items) != tot_items:
    331             raise AssertionError(

ValueError: Shape of passed values is (5, 4), indices imply (3, 4)

It appears to me that concat is not respecting verify_integrity = False. The User Guide states verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates. My interpretation of that is "when verify_integrity = False, concat should return all the supplied index values, even when there are duplicates." In pandas 1.1.5, however, the code above behaves identically with verify_integrity = False and verify_integrity = True. Is that the intended behavior? If so, I submit that the User Guide is misleading.

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
* Revert "[BUG]: Fix ValueError in concat() when at least one Index has duplicates (pandas-dev#36290)"

This reverts commit b32febd.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: concat on axis with both different and duplicate labels raising error
6 participants