Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Join with a list of a single element should behave as a join with a single element (#57676) #57890

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Dacops
Copy link

@Dacops Dacops commented Mar 18, 2024

Even though it does not make much sense, joining a dataframe with a list of a single element should behave as joining the dataframe with that element.

@Dacops Dacops marked this pull request as draft March 20, 2024 15:54
@Dacops Dacops changed the title BUG: Join with list of dataframes that have MultiIndex now behaves as expected (#57676) BUG: Join with a list of a single element behaves as a join with a single element (#57676) Mar 20, 2024
@Dacops Dacops marked this pull request as ready for review March 20, 2024 22:35
@Dacops
Copy link
Author

Dacops commented Mar 20, 2024

Seems like the same test is failing over and over again, however it does not make sense to me, the test is:

def  test_suffix_on_list_join():
        first = DataFrame({"key": [1, 2, 3, 4, 5]})
        second = DataFrame({"key": [1, 8, 3, 2, 5], "v1": [1, 2, 3, 4, 5]})
        third = DataFrame({"keys": [5, 2, 3, 4, 1], "v2": [1, 2, 3, 4, 5]})
    
        # check proper errors are raised
        msg = "Suffixes not supported when joining multiple DataFrames"
        with pytest.raises(ValueError, match=msg):
>           first.join([second], lsuffix="y")
E           Failed: DID NOT RAISE <class 'ValueError'>

In this example multiple DataFrames are not passed, just a list with a single DataFrame

@Dacops Dacops changed the title BUG: Join with a list of a single element behaves as a join with a single element (#57676) BUG: Join with a list of a single element should behave as a join with a single element (#57676) Mar 20, 2024
@Dacops
Copy link
Author

Dacops commented Mar 23, 2024

I've updated the test mentioned above, before any list passed with the Dataframe.join method was assumed to have 2+ elements, by adding this edge case where a list can have a single element (behaving has a Dataframe.join with a single element) that condition in the test "test_suffix_on_list_join" no longer makes sense.

@Aloqeely
Copy link
Member

Aloqeely commented Apr 9, 2024

LGTM. @mroeschke would you mind taking a look?

@Dacops
Copy link
Author

Dacops commented Apr 22, 2024

Thanks for the review @Aloqeely, in the meantime I've rebased my PR to address conflicts in the "whatsnew" file. I've noticed that the PR started failing the Numpy Dev tests because it was rebased from the main Pandas branch which is failing the same tests. No other changes (except the whatsnew file) were made since the last review. If any developer familiar with this area of the code could take a look, I'd appreciate it (cc @mroeschke, @WillAyd, @jbrockmendel ) I apologize for the broader ping, but I wasn't sure who would be best suited to review this change.

@WillAyd
Copy link
Member

WillAyd commented May 1, 2024

Hmm unfortunately I am -1 on making this change. I'm not sure I understand why we need to special case a single element list.

Generally our requirement for joins / merges is that the key be hashable

@WillAyd
Copy link
Member

WillAyd commented May 1, 2024

Sorry ignore my hashability comment - I see this is in regards to the join argument not the key elements themselves. Still, I'm not sure why we need to special case this

@Dacops
Copy link
Author

Dacops commented May 1, 2024

There's a problem, in some rare cases, that joining a dataframe x with a list containing a single dataframe y would lose information in relation to just joining dataframe x with dataframe y (one example of this on the original issue: #57676)

@WillAyd
Copy link
Member

WillAyd commented May 1, 2024

Ah OK thanks @Dacops . This only solves that then though with a single element list right? What happens when there are multiple list elements? I think there is a fix somewhere else in the code that would work generally - ideally we avoid special-casing solutions like this as they don't add up well over time

@Dacops
Copy link
Author

Dacops commented May 1, 2024

With multiple list elements the behaviour remains the same. Basically, what the code did was if the joining element is an Object deal with it in the way A, if it's a List deal with it in the way B. I've added that if-statemen that just checks if the list has a single element. If it does use way A else keep using way B.

@Aloqeely
Copy link
Member

Aloqeely commented May 1, 2024

I think what he's saying is that a better fix would be to not treat lists with 1 item as a special case that uses different behavior, and instead fix the merge logic such that it does not fail even if the list has 1 item only

@Dacops
Copy link
Author

Dacops commented May 2, 2024

Hmmm, yeah I did thought of that, however the line that handles that is on frame.py/10685: can_concat = all(df.index.is_unique for df in frames) and it is been there for 12 years. Therefore I assumed the logic was correct since it wasn't changed for that long and it only fails in very rare cases. I thought it would be more prudent to add the if-statement for the edge cases instead of changing code that has been unaltered for so long.

@Aloqeely
Copy link
Member

Aloqeely commented May 2, 2024

It might very well be broken logic even if it was there for 12 years, funnily enough I just saw another comment saying that some broken logic was introduced 13 years ago

@Dacops
Copy link
Author

Dacops commented May 2, 2024

I think I've identified the problem, that expression evaluates if all inserted indexes from all objects in the list are unique so that they can be concatenated (stacked on top of each other). However if a MultiIndex is used the entire MultiIndex is compared not separated indexes inside it. A problem can arise here, for example on the original issue a Indexed dataframe is getting joined to a MultiIndexed dataframe, the pairs of indexes of the MultiIndexed dataframe are compared to the indexes of the Indexed dataframe. There's no repetition, so they get concatenated, but here's the error. One of the MultiIndexed dataframes index is 'y' same as the index of the Indexed dataframe, and those no longer are unique, which means they need to be merged. Solution I'm thinking of, after the lambda function to check if all indexes are unique, if this evaluates to True add an extra check to look for repeated indexes and if any of these are not unique change the value to False and skip to the Merge instead of a Concat.

@WillAyd
Copy link
Member

WillAyd commented May 2, 2024

Hmm I'm not sure I follow - I feel like the indexes should be the same when trying to join. Assuming some kind of matching behavior between an Index and MultiIndex might have a lot of logic pitfalls

@Dacops
Copy link
Author

Dacops commented May 2, 2024

The problem is with the MultiIndexes, for example in the original issue, the Index is index=pd.Index(['a', 'b', 'c'], name='y')), while the MultiIndex is index=pd.MultiIndex.from_tuples([(0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c')], names=('x', 'y')). When all(df.index.is_unique for df in frames) is run over both it gives out the values: 'a', 'b', 'c', (0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c') which returns True (there are no repetitions, value of indexes are unique, concatenate them). However this cannot happen since the MultiIndex has a 'y' index, if we extract it and repeat all(df.index.is_unique for df in frames) over both 'y' we get 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c' which returns False (there are repeated values, we can't concatenate them). Tl;dr when a MultiIndex is passed it's getting evaluated as a whole and not the individual indexs inside it where the merge may occur.

@Dacops
Copy link
Author

Dacops commented May 2, 2024

I believe I've got the solution for this, I'll push it so you can take a look at it

@Aloqeely
Copy link
Member

Aloqeely commented May 2, 2024

So is this a problem with is_unique on MultiIndex? Do you think it's current behavior makes sense? I think it should return True only if each level has unique values, and if I understood you correctly that seems to be failing right now

If you agree with my suggested behavior then I will open an issue to fix it, if not we should at least document how it operates on MultiIndex (don't see any examples for that)

@Dacops Dacops force-pushed the my-branch branch 2 times, most recently from 351f139 to fe7336b Compare May 2, 2024 23:05
@Dacops
Copy link
Author

Dacops commented May 2, 2024

@Aloqeely I believe the current behaviour of is_unique for MultiIndex is correct, however for this case this behaviour would not work. This because we're joining an Indexed dataframe with a MultiIndexed dataframe, the join will only occur in a single index (the one of the Indexed dataframe), so the remaining indexs of the MultiIndex are irrelevant to this operation.

@Dacops
Copy link
Author

Dacops commented May 4, 2024

this doc build and upload check is something I did wrong or do I just re-run the checks, I haven't touched that /home/runner/work/pandas/pandas/doc/source/user_guide/merging.rst

@Aloqeely
Copy link
Member

Aloqeely commented May 7, 2024

It does seem related to your changes, I've summarized the example in the doc that fails:

import pandas as pd

left = pd.DataFrame(
       {
           "A": ["A0", "A1", "A2", "A3"],
           "B": ["B0", "B1", "B2", "B3"],
           "key2": ["K0", "K1", "K0", "K1"],
       },
       index=pd.Index(["K0", "K0", "K1", "K2"], name="key1"),
   )

right = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])

result = left.join([right])

Raises

UnboundLocalError: cannot access local variable 'can_concat' where it is not associated with a value

It is surprising that only the doc build failed but not any tests, so once you fix it for this case please add tests.

@WillAyd
Copy link
Member

WillAyd commented May 8, 2024

We are fixing the wrong thing from the original issue. I think specifying on=... is reasonable to allow an Index and a MultiIndex to be joined together, but we should not be doing anything to automatically try and determine this. We should actually be raising when the indexes have a different number of levels in join

@Dacops
Copy link
Author

Dacops commented May 8, 2024

Oh thanks @Aloqeely , I figured out what it was, on this line, in case "common_indexes" was empty can_concat would not be initialized and it is called in the following if statement, so it threw an error there. I initialized it to a default value before the for loop so that doesn't occur anymore. Everything's good now, @WillAyd if you could re-review the PR I would appreciate it, thanks.

@Dacops
Copy link
Author

Dacops commented May 8, 2024

Oh just saw your comment sorry

@Dacops
Copy link
Author

Dacops commented May 8, 2024

Hmmm, so if on=... is not defined it should just raise an error?

@WillAyd
Copy link
Member

WillAyd commented May 8, 2024

Right if you are trying to join two index objects with non-equal levels

@Dacops
Copy link
Author

Dacops commented May 8, 2024

Ok, so I've changed the default value of "common_indexes" to a raise if "common_indexes" is empty

@Dacops
Copy link
Author

Dacops commented May 8, 2024

@Aloqeely how can I get better errors on the Doc build and upload test like you sent instead of just RuntimeError: Unexpected exception in /home/runner/work/pandas/pandas/doc/source/user_guide/merging.rst line 930 ?

@Aloqeely
Copy link
Member

Aloqeely commented May 8, 2024

You can open the doc file which has the errors and go to the error line, run the examples near to that line manually until you see which example causes the error.

@Dacops
Copy link
Author

Dacops commented May 8, 2024

Oh thanks @Aloqeely, so if my error says RuntimeError: Unexpected exception in /home/runner/work/pandas/pandas/doc/source/user_guide/merging.rst line 930, here. Then the test that fails is the one above? The one that starts on line 906, here?

@Aloqeely
Copy link
Member

Aloqeely commented May 8, 2024

I think so

@Dacops
Copy link
Author

Dacops commented May 9, 2024

I know what's happening, in this example left and right both have an index named 'key1' while right2 has an index but it is not named. This not named index is inferred to be 'key1', is this expected behaviour? Taking in consideration your input @WillAyd ?

We are fixing the wrong thing from the original issue. I think specifying on=... is reasonable to allow an Index and a MultiIndex to be joined together, but we should not be doing anything to automatically try and determine this. We should actually be raising when the indexes have a different number of levels in join

@Dacops
Copy link
Author

Dacops commented May 26, 2024

I've changed the test in the documentation that was failing, instead of joining index "key 1" on "key 1" on None now they're all "key 1"

@Dacops
Copy link
Author

Dacops commented May 26, 2024

@WillAyd the issue should be completed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: join with list does not behave like singleton
3 participants