Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow EAs to override MergeOperation._get_join_indexers #53696

Open
jbrockmendel opened this issue Jun 15, 2023 · 3 comments
Open

ENH: allow EAs to override MergeOperation._get_join_indexers #53696

jbrockmendel opened this issue Jun 15, 2023 · 3 comments
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Jun 15, 2023

AsOfMerge._get_join_indexers calls to_numpy() on EAs, which can be costly. _MergeOperation._get_join_indexers is a bit more forgiving, using _values_for_factorize in _factorize_keys. The latter still requires a cast to numpy, which makes this a non-starter for (hypothetical) distributed/GPU EAs.

Can we push this up into an EA method that can be potentially overridden? This might be a use case for @mroeschke 's "ExtensionManager" idea.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 15, 2023
@lithomas1 lithomas1 added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2023
@MichaelTiemannOSC
Copy link
Contributor

As an FYI, my EA (which extends PintArray to support uncertain magnitudes, hgrecco/pint-pandas#140) is getting hung up on a mismatch between what I understood to be the interface to _values_from_factorize and what rizer._value_from_factorize is returning. My grief comes from here (pandas/tests/extension/base/reshaping.py):

    def test_merge_on_extension_array_duplicates(self, data):
	# GH 23020                                                                                                                                                                                                                                                                                                      
        a, b = data[:2]
	key = type(data)._from_sequence([a, b, a], dtype=data.dtype)
	df1 = pd.DataFrame({"key": key, "val": [1, 2, 3]})
        df2 = pd.DataFrame({"key": key, "val": [1, 2, 3]})

        result = pd.merge(df1, df2, on="key")
        expected = pd.DataFrame(
            {
             	"key": key.take([0, 0, 0, 0, 1]),
                "val_x": [1, 1, 3, 3, 2],
		"val_y": [1, 3, 1, 3, 2],
            }
        )
        self.assert_frame_equal(result, expected)

data is [1.0, 2.0, 1.0] for both df1 and df2 we get

   key  val
0  1.0    1
1  2.0    2
2  1.0    3

Down in the merge we have:

> /Users/michael/Documents/GitHub/MichaelTiemannOSC/pandas-dev/pandas/core/reshape/merge.py(2396)_factorize_keys()
-> isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype)
(Pdb) lk  # rk is the same
<PintArray>
[1.0, 2.0, 1.0]
Length: 3, dtype: pint[nanometer]
(Pdb) 

and the obviously NONUNIQUE values here:

(Pdb) lk._values_for_factorize()  # rk._values_for_factorize() is the same
(array([1.0, 2.0, 1.0], dtype=object), nan)

My understanding was that values returned by _values_for_factorize should be unique (and should not contain the na_value if use_na_sentinel is true). When I allow my _values_for_factorize to return duplicated values, other tests fail.

Clearly this code (from pandas/core/arrays/base.py) is doing nothing to unique-ify the values of the data, though the documentation snippet does refer to uniques (which are not present in the code):

    def _values_for_factorize(self) -> tuple[np.ndarray, Any]:
        """                                                                                                                                                                                                                                                                                                             
        Return an array and missing value suitable for factorization.                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                        
        Returns                                                                                                                                                                                                                                                                                                         
        -------                                                                                                                                                                                                                                                                                                         
        values : ndarray                                                                                                                                                                                                                                                                                                
            An array suitable for factorization. This should maintain order                                                                                                                                                                                                                                             
            and be a supported dtype (Float64, Int64, UInt64, String, Object).                                                                                                                                                                                                                                          
            By default, the extension array is cast to object dtype.                                                                                                                                                                                                                                                    
        na_value : object                                                                                                                                                                                                                                                                                               
            The value in `values` to consider missing. This will be treated                                                                                                                                                                                                                                             
            as NA in the factorization routines, so it will be coded as                                                                                                                                                                                                                                                 
            `-1` and not included in `uniques`. By default,                                                                                                                                                                                                                                                             
            ``np.nan`` is used.                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                        
        Notes                                                                                                                                                                                                                                                                                                           
        -----                                                                                                                                                                                                                                                                                                           
        The values returned by this method are also used in                                                                                                                                                                                                                                                             
        :func:`pandas.util.hash_pandas_object`. If needed, this can be                                                                                                                                                                                                                                                  
        overridden in the ``self._hash_pandas_object()`` method.                                                                                                                                                                                                                                                        
        """
        return self.astype(object), np.nan

Help?

@jbrockmendel
Copy link
Member Author

My understanding was that values returned by _values_for_factorize should be unique (and should not contain the na_value if use_na_sentinel is true). When I allow my _values_for_factorize to return duplicated values, other tests fail.

An EA's _values_for_factorize does not need to be unique. The reference to "uniques" in the _values_for_factorize docstring is about what is returned by factorize. I agree this is confusing, and in fact am of the opinion that _values_for_factorize is a bad pattern that should go in general (xref #53501). As for things failing when you make values_for_factorize return something non-unique, let's find a dedicated place to discuss that. Is pint-pandas#140 appropriate for that?

@MichaelTiemannOSC
Copy link
Contributor

I got it sorted by aligning the non-unique _values_for_factorize with my EA factorize code that does the unique-ification itself. All good. I'm now passing. Please close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants