BUG: Join with a list of a single element should behave as a join with a single element (#57676) #57890

Dacops · 2024-03-18T17:08:32Z

closes BUG: join with list does not behave like singleton #57676
Added a test to pandas/tests/frame/methods/test_join.py

Even though it does not make much sense, joining a dataframe with a list of a single element should behave as joining the dataframe with that element.

Dacops · 2024-03-20T22:37:26Z

Seems like the same test is failing over and over again, however it does not make sense to me, the test is:

def  test_suffix_on_list_join():
        first = DataFrame({"key": [1, 2, 3, 4, 5]})
        second = DataFrame({"key": [1, 8, 3, 2, 5], "v1": [1, 2, 3, 4, 5]})
        third = DataFrame({"keys": [5, 2, 3, 4, 1], "v2": [1, 2, 3, 4, 5]})
    
        # check proper errors are raised
        msg = "Suffixes not supported when joining multiple DataFrames"
        with pytest.raises(ValueError, match=msg):
>           first.join([second], lsuffix="y")
E           Failed: DID NOT RAISE <class 'ValueError'>

In this example multiple DataFrames are not passed, just a list with a single DataFrame

Dacops · 2024-03-23T01:48:41Z

I've updated the test mentioned above, before any list passed with the Dataframe.join method was assumed to have 2+ elements, by adding this edge case where a list can have a single element (behaving has a Dataframe.join with a single element) that condition in the test "test_suffix_on_list_join" no longer makes sense.

pandas/tests/frame/methods/test_join.py

Aloqeely · 2024-04-09T10:47:00Z

LGTM. @mroeschke would you mind taking a look?

Dacops · 2024-04-22T14:12:23Z

Thanks for the review @Aloqeely, in the meantime I've rebased my PR to address conflicts in the "whatsnew" file. I've noticed that the PR started failing the Numpy Dev tests because it was rebased from the main Pandas branch which is failing the same tests. No other changes (except the whatsnew file) were made since the last review. If any developer familiar with this area of the code could take a look, I'd appreciate it (cc @mroeschke, @WillAyd, @jbrockmendel ) I apologize for the broader ping, but I wasn't sure who would be best suited to review this change.

WillAyd · 2024-05-01T17:14:35Z

Hmm unfortunately I am -1 on making this change. I'm not sure I understand why we need to special case a single element list.

Generally our requirement for joins / merges is that the key be hashable

WillAyd · 2024-05-01T17:16:51Z

Sorry ignore my hashability comment - I see this is in regards to the join argument not the key elements themselves. Still, I'm not sure why we need to special case this

Dacops · 2024-05-01T17:34:23Z

There's a problem, in some rare cases, that joining a dataframe x with a list containing a single dataframe y would lose information in relation to just joining dataframe x with dataframe y (one example of this on the original issue: #57676)

WillAyd · 2024-05-01T17:38:36Z

Ah OK thanks @Dacops . This only solves that then though with a single element list right? What happens when there are multiple list elements? I think there is a fix somewhere else in the code that would work generally - ideally we avoid special-casing solutions like this as they don't add up well over time

Dacops · 2024-05-01T17:48:19Z

With multiple list elements the behaviour remains the same. Basically, what the code did was if the joining element is an Object deal with it in the way A, if it's a List deal with it in the way B. I've added that if-statemen that just checks if the list has a single element. If it does use way A else keep using way B.

Aloqeely · 2024-05-01T21:26:53Z

I think what he's saying is that a better fix would be to not treat lists with 1 item as a special case that uses different behavior, and instead fix the merge logic such that it does not fail even if the list has 1 item only

Dacops · 2024-05-02T00:07:03Z

Hmmm, yeah I did thought of that, however the line that handles that is on frame.py/10685: can_concat = all(df.index.is_unique for df in frames) and it is been there for 12 years. Therefore I assumed the logic was correct since it wasn't changed for that long and it only fails in very rare cases. I thought it would be more prudent to add the if-statement for the edge cases instead of changing code that has been unaltered for so long.

Aloqeely · 2024-05-02T11:11:02Z

It might very well be broken logic even if it was there for 12 years, funnily enough I just saw another comment saying that some broken logic was introduced 13 years ago

Dacops · 2024-05-02T21:13:23Z

I think I've identified the problem, that expression evaluates if all inserted indexes from all objects in the list are unique so that they can be concatenated (stacked on top of each other). However if a MultiIndex is used the entire MultiIndex is compared not separated indexes inside it. A problem can arise here, for example on the original issue a Indexed dataframe is getting joined to a MultiIndexed dataframe, the pairs of indexes of the MultiIndexed dataframe are compared to the indexes of the Indexed dataframe. There's no repetition, so they get concatenated, but here's the error. One of the MultiIndexed dataframes index is 'y' same as the index of the Indexed dataframe, and those no longer are unique, which means they need to be merged. Solution I'm thinking of, after the lambda function to check if all indexes are unique, if this evaluates to True add an extra check to look for repeated indexes and if any of these are not unique change the value to False and skip to the Merge instead of a Concat.

WillAyd · 2024-05-02T21:39:33Z

Hmm I'm not sure I follow - I feel like the indexes should be the same when trying to join. Assuming some kind of matching behavior between an Index and MultiIndex might have a lot of logic pitfalls

Dacops · 2024-05-02T22:45:34Z

The problem is with the MultiIndexes, for example in the original issue, the Index is index=pd.Index(['a', 'b', 'c'], name='y')), while the MultiIndex is index=pd.MultiIndex.from_tuples([(0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c')], names=('x', 'y')). When all(df.index.is_unique for df in frames) is run over both it gives out the values: 'a', 'b', 'c', (0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c') which returns True (there are no repetitions, value of indexes are unique, concatenate them). However this cannot happen since the MultiIndex has a 'y' index, if we extract it and repeat all(df.index.is_unique for df in frames) over both 'y' we get 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c' which returns False (there are repeated values, we can't concatenate them). Tl;dr when a MultiIndex is passed it's getting evaluated as a whole and not the individual indexs inside it where the merge may occur.

Dacops · 2024-05-02T22:50:28Z

I believe I've got the solution for this, I'll push it so you can take a look at it

Aloqeely · 2024-05-02T23:01:25Z

So is this a problem with is_unique on MultiIndex? Do you think it's current behavior makes sense? I think it should return True only if each level has unique values, and if I understood you correctly that seems to be failing right now

If you agree with my suggested behavior then I will open an issue to fix it, if not we should at least document how it operates on MultiIndex (don't see any examples for that)

Dacops · 2024-05-02T23:14:59Z

@Aloqeely I believe the current behaviour of is_unique for MultiIndex is correct, however for this case this behaviour would not work. This because we're joining an Indexed dataframe with a MultiIndexed dataframe, the join will only occur in a single index (the one of the Indexed dataframe), so the remaining indexs of the MultiIndex are irrelevant to this operation.

Dacops · 2024-05-04T18:00:59Z

this doc build and upload check is something I did wrong or do I just re-run the checks, I haven't touched that /home/runner/work/pandas/pandas/doc/source/user_guide/merging.rst

Aloqeely · 2024-05-07T03:24:58Z

It does seem related to your changes, I've summarized the example in the doc that fails:

import pandas as pd

left = pd.DataFrame(
       {
           "A": ["A0", "A1", "A2", "A3"],
           "B": ["B0", "B1", "B2", "B3"],
           "key2": ["K0", "K1", "K0", "K1"],
       },
       index=pd.Index(["K0", "K0", "K1", "K2"], name="key1"),
   )

right = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])

result = left.join([right])

Raises

UnboundLocalError: cannot access local variable 'can_concat' where it is not associated with a value

It is surprising that only the doc build failed but not any tests, so once you fix it for this case please add tests.

WillAyd · 2024-05-08T00:40:33Z

We are fixing the wrong thing from the original issue. I think specifying on=... is reasonable to allow an Index and a MultiIndex to be joined together, but we should not be doing anything to automatically try and determine this. We should actually be raising when the indexes have a different number of levels in join

Dacops · 2024-05-08T00:54:42Z

Oh thanks @Aloqeely , I figured out what it was, on this line, in case "common_indexes" was empty can_concat would not be initialized and it is called in the following if statement, so it threw an error there. I initialized it to a default value before the for loop so that doesn't occur anymore. Everything's good now, @WillAyd if you could re-review the PR I would appreciate it, thanks.

Dacops · 2024-05-08T00:55:08Z

Oh just saw your comment sorry

Dacops · 2024-05-08T00:57:06Z

Hmmm, so if on=... is not defined it should just raise an error?

WillAyd · 2024-05-08T01:10:06Z

Right if you are trying to join two index objects with non-equal levels

Dacops · 2024-05-08T08:26:20Z

Ok, so I've changed the default value of "common_indexes" to a raise if "common_indexes" is empty

Dacops · 2024-05-08T11:11:48Z

@Aloqeely how can I get better errors on the Doc build and upload test like you sent instead of just RuntimeError: Unexpected exception in /home/runner/work/pandas/pandas/doc/source/user_guide/merging.rst line 930 ?

Aloqeely · 2024-05-08T16:40:51Z

You can open the doc file which has the errors and go to the error line, run the examples near to that line manually until you see which example causes the error.

Dacops · 2024-05-08T17:55:43Z

Oh thanks @Aloqeely, so if my error says RuntimeError: Unexpected exception in /home/runner/work/pandas/pandas/doc/source/user_guide/merging.rst line 930, here. Then the test that fails is the one above? The one that starts on line 906, here?

Aloqeely · 2024-05-08T17:57:03Z

I think so

Dacops · 2024-05-09T09:11:35Z

I know what's happening, in this example left and right both have an index named 'key1' while right2 has an index but it is not named. This not named index is inferred to be 'key1', is this expected behaviour? Taking in consideration your input @WillAyd ?

We are fixing the wrong thing from the original issue. I think specifying on=... is reasonable to allow an Index and a MultiIndex to be joined together, but we should not be doing anything to automatically try and determine this. We should actually be raising when the indexes have a different number of levels in join

… index to join (pandas-dev#57676)

Dacops · 2024-05-26T03:02:03Z

I've changed the test in the documentation that was failing, instead of joining index "key 1" on "key 1" on None now they're all "key 1"

Dacops · 2024-05-26T20:03:21Z

@WillAyd the issue should be completed now

Dacops marked this pull request as draft March 20, 2024 15:54

Dacops force-pushed the my-branch branch from f36b353 to 322153a Compare March 20, 2024 17:54

Dacops changed the title ~~BUG: Join with list of dataframes that have MultiIndex now behaves as expected (#57676)~~ BUG: Join with a list of a single element behaves as a join with a single element (#57676) Mar 20, 2024

Dacops marked this pull request as ready for review March 20, 2024 22:35

Dacops changed the title ~~BUG: Join with a list of a single element behaves as a join with a single element (#57676)~~ BUG: Join with a list of a single element should behave as a join with a single element (#57676) Mar 20, 2024

Aloqeely reviewed Apr 3, 2024

View reviewed changes

pandas/tests/frame/methods/test_join.py Outdated Show resolved Hide resolved

Dacops force-pushed the my-branch branch from 4d13fbd to 7a44e42 Compare April 4, 2024 13:40

Dacops force-pushed the my-branch branch from 7a44e42 to 704f2da Compare April 21, 2024 20:45

Dacops force-pushed the my-branch branch from 704f2da to ac65c13 Compare May 1, 2024 16:35

Dacops force-pushed the my-branch branch 2 times, most recently from 351f139 to fe7336b Compare May 2, 2024 23:05

Dacops force-pushed the my-branch branch from fe7336b to 04ac369 Compare May 7, 2024 23:48

Dacops force-pushed the my-branch branch from 04ac369 to e67b043 Compare May 8, 2024 08:25

Dacops force-pushed the my-branch branch from e67b043 to 611cc94 Compare May 8, 2024 09:30

Dacops force-pushed the my-branch branch from 611cc94 to 02d7aa9 Compare May 26, 2024 01:51

BUG: On Join, with a list containing MultiIndexes check uniqueness of…

2b0a3ae

… index to join (pandas-dev#57676)

Dacops force-pushed the my-branch branch from 02d7aa9 to 2b0a3ae Compare May 26, 2024 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Join with a list of a single element should behave as a join with a single element (#57676) #57890

BUG: Join with a list of a single element should behave as a join with a single element (#57676) #57890

Dacops commented Mar 18, 2024 •

edited

Loading

Dacops commented Mar 20, 2024 •

edited

Loading

Dacops commented Mar 23, 2024

Aloqeely commented Apr 9, 2024

Dacops commented Apr 22, 2024

WillAyd commented May 1, 2024

WillAyd commented May 1, 2024

Dacops commented May 1, 2024

WillAyd commented May 1, 2024

Dacops commented May 1, 2024

Aloqeely commented May 1, 2024

Dacops commented May 2, 2024 •

edited

Loading

Aloqeely commented May 2, 2024

Dacops commented May 2, 2024

WillAyd commented May 2, 2024

Dacops commented May 2, 2024

Dacops commented May 2, 2024

Aloqeely commented May 2, 2024

Dacops commented May 2, 2024

Dacops commented May 4, 2024

Aloqeely commented May 7, 2024

WillAyd commented May 8, 2024

Dacops commented May 8, 2024

Dacops commented May 8, 2024

Dacops commented May 8, 2024

WillAyd commented May 8, 2024

Dacops commented May 8, 2024

Dacops commented May 8, 2024

Aloqeely commented May 8, 2024

Dacops commented May 8, 2024

Aloqeely commented May 8, 2024

Dacops commented May 9, 2024

Dacops commented May 26, 2024

Dacops commented May 26, 2024

BUG: Join with a list of a single element should behave as a join with a single element (#57676) #57890

Are you sure you want to change the base?

BUG: Join with a list of a single element should behave as a join with a single element (#57676) #57890

Conversation

Dacops commented Mar 18, 2024 • edited Loading

Dacops commented Mar 20, 2024 • edited Loading

Dacops commented Mar 23, 2024

Aloqeely commented Apr 9, 2024

Dacops commented Apr 22, 2024

WillAyd commented May 1, 2024

WillAyd commented May 1, 2024

Dacops commented May 1, 2024

WillAyd commented May 1, 2024

Dacops commented May 1, 2024

Aloqeely commented May 1, 2024

Dacops commented May 2, 2024 • edited Loading

Aloqeely commented May 2, 2024

Dacops commented May 2, 2024

WillAyd commented May 2, 2024

Dacops commented May 2, 2024

Dacops commented May 2, 2024

Aloqeely commented May 2, 2024

Dacops commented May 2, 2024

Dacops commented May 4, 2024

Aloqeely commented May 7, 2024

WillAyd commented May 8, 2024

Dacops commented May 8, 2024

Dacops commented May 8, 2024

Dacops commented May 8, 2024

WillAyd commented May 8, 2024

Dacops commented May 8, 2024

Dacops commented May 8, 2024

Aloqeely commented May 8, 2024

Dacops commented May 8, 2024

Aloqeely commented May 8, 2024

Dacops commented May 9, 2024

Dacops commented May 26, 2024

Dacops commented May 26, 2024

Dacops commented Mar 18, 2024 •

edited

Loading

Dacops commented Mar 20, 2024 •

edited

Loading

Dacops commented May 2, 2024 •

edited

Loading