ENH: re-implement DataFrame.lookup. #40140

erfannariman · 2021-03-01T10:03:10Z

DataFrame.lookup was deprecated in #35224 in 1.2. After some feedback (#39171 ) I opened this ticket to discuss re-implentation of lookup in a performant way. As mentioned in the discussion on 35244: "but it would have to be performant and not be yet another indexing api".

This ticket can be a starting point for proposed methods, although the old implementation was actually quite performant look at given tests in the discussion of 35244:

pandas/pandas/core/frame.py

Lines 3848 to 3861 in b5958ee

    
           if not self._is_mixed_type or n > thresh: 
        
               values = self.values 
        
               ridx = self.index.get_indexer(row_labels) 
        
               cidx = self.columns.get_indexer(col_labels) 
        
               if (ridx == -1).any(): 
        
                   raise KeyError("One or more row labels was not found") 
        
               if (cidx == -1).any(): 
        
                   raise KeyError("One or more column labels was not found") 
        
               flat_index = ridx * len(self.columns) + cidx 
        
               result = values.flat[flat_index] 
        
           else: 
        
               result = np.empty(n, dtype="O") 
        
               for i, (r, c) in enumerate(zip(row_labels, col_labels)): 
        
                   result[i] = self._get_value(r, c)

The text was updated successfully, but these errors were encountered:

challisd · 2021-08-04T21:55:59Z

I think there should definitely be a lookup function. Since the old one seems to work well, is un-deprecating it an option? I find the proposed alternative using melt to be unreadable, and based on the (sadly heated) discussion here the old lookup function is faster than the melt alternative suggested. Pandas is a module used by many thousands of programmers and scientists who often have only a vague (or no) idea what the melt function does. The ability to run a quick series of lookups using lists of row and column coordinates is a fairly ordinary task, but if you don't provide this lookup function most users will likely fall back on using a slow for loop; and if that's too slow for them, decide to forget it and just use NumPy where you can do the_data[row_index_list, column_index_list]
Can we please keep this function?

berkgercek · 2022-05-09T10:30:38Z

Going to throw my voice in here and say that this is a pretty important feature for dataframes that allows for numpy-like behavior with labeled complex indexes and columns.

My personal use case for Pandas is often reliant on using it to keep labels and data together, and working with a method like lookup is a part of how I use it. It also fits in the scope of the package description provided in the documentation:

Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

It's not something I use often, but the proposed solution linked in the deprecation notice is very inelegant.

One use case I have today is to do something similar to the following (obviously with meaningful data), which I have done with the above solution:

df1 = pd.DataFrame(np.random.normal(size=[1000, 3], columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.normal(size=[1000, 3], columns=['A', 'B', 'C'])
maxidx = df.idxmax(axis=1)
# The solution suggested by the deprecation notice is below.
idx, cols = pd.factorize(maxidx)
df2_lookup = pd.Series(df2.reindex(cols, axis=1).to_numpy()[np.arange(len(df2)), idx], index=df2.index)

This is not a readable solution for me and when others need to maintain this code it will be very much not obvious at a glance what I'm doing here.

If there is another solution that would work equally well but be more intuitive I am happy to use that instead, but I see no alternative to the .lookup method for this use case.

rhshadrach · 2022-05-11T02:08:33Z

+1 on re-implementing unless there is a more understandable alternative; I too find it hard to discern what the current alternative is doing.

challisd · 2023-02-22T18:10:00Z

Any word on if or when this feature will be added back in, or has anyone figured out a viable alternative?

erfannariman · 2023-02-22T18:49:01Z

Just to check, is there an agreement that this will be added back in if there's a viable PR before someone (or myself) starts to work on it. @jorisvandenbossche @mroeschke @rhshadrach

MarcoGorelli · 2023-02-22T19:07:39Z

Agree that the melt alternative wasn't great, but is there any case where the currently-suggested workaround

idx, cols = pd.factorize(df['col'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

isn't sufficient?

I mean, if it was re-implemented with this under the hood. From the timings, it looks like it's more performant than the old solution ever was?

I think I'd also be OK with reimplementing

challisd · 2023-02-22T23:16:00Z

@MarcoGorelli I'm actually confused by that solution. The lookup function take as input a list of index values and a list of column values of equal length. Where are those two lists in the example? The example seems to be limited to the case where you have a specific column value you want for every row, but the lookup function should allow us to lookup specific column values for a specific list of indices.

challisd · 2023-03-09T23:21:55Z

Okay, after spending more time on this than I probably should have, I came up this a solution that seems to fully match the original functionality and has performance that is about the same (slightly faster) than the df.lookup function.

df=pd.DataFrame([[12,3],[1,4],[6,2],[64,3]], index=['a','b','c','d'], columns=['x','y'])
row_labels = ['b', 'd']
col_labels = ['y', 'x']

df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

I tested both the old lookup function and this new approach 100k times on a very small and a moderately large (100k x 4) DataFrame and in both cases this alternate approach ran marginally faster (39 seconds compared to 41.5 seconds)

I suggest the example in the depreciation notice be switched to this as the current one doesn't really match the full functionality of df.lookup

erfannariman · 2023-03-10T07:45:01Z

Okay, after spending more time on this than I probably should have, I came up this a solution that seems to fully match the original functionality and has performance that is about the same (slightly faster) than the df.lookup function.
df=pd.DataFrame([[12,3],[1,4],[6,2],[64,3]], index=['a','b','c','d'], columns=['x','y'])
row_labels = ['b', 'd']
col_labels = ['y', 'x']

df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]
I tested both the old lookup function and this new approach 100k times on a very small and a moderately large (100k x 4) DataFrame and in both cases this alternate approach ran marginally faster (39 seconds compared to 41.5 seconds)

I suggest the example in the depreciation notice be switched to this as the current one doesn't really match the full functionality of df.lookup

Isn't this the same implementation as the original? See OP where I posted a code snippet of the original function.

challisd · 2023-03-10T22:59:01Z

Based on performance metrics I would guess that yes, that is the original implementation. If so that really should be the suggested replacement for the depreciation notice as it is what gives the same functionality. My own opinion is that it never should have been depreciated as it is quite useful and easy to understand the way it is. This functionality is a basic part of numpy and R data.frames, it doesn't make any sense to me to remove it. If there is an alternative that is actually faster AND includes the full functionality, then by all means let's do that, but I haven't been able to find anything like that yet.

mnajaria · 2023-10-11T14:59:27Z

Hi @MarcoGorelli, is there any chance that lookup can be undepreciated? Since it is self-explanatory and familiar for many programmer, isn't it better to have it? The following code is confusing
df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

Moreover, it broke my pipelines :( I will add the lookup function in the pandas code by myself for now, I hope that it won't break the whole package, but it will be great if we have it back in the Pandas :)

MarcoGorelli · 2023-10-11T15:33:05Z

sounds good to me - anyone interested in making a PR?

mnajaria · 2023-10-12T04:17:39Z

I would like to do it.

lohraspco · 2023-10-24T04:59:40Z

I am trying to find where I can assign it to myself (according to the instruction ) , but I didn't find the button for it. How can I do that?

erfannariman · 2023-10-24T17:17:30Z

@lohraspco I assigned it for you. Also you can comment "take" and it will trigger the assignment as well. Good luck with the implementation!

mnajaria · 2023-10-31T07:49:55Z

I have added the lookup functionality with a little modifications from the previous version. I borrowed @challisd method. If interested, you can look at the following link for the code.
https://github.com/mnajaria/pandas/blob/bringing_back_lookup_functionality/pandas/core/frame.py

MarcoGorelli · 2023-10-31T09:31:15Z

thanks - could you open a PR please, and include tests?

MarcoGorelli · 2023-10-31T10:30:58Z

Wait sorry, thinking about this again, to_numpy on a dataframe is very likely to result in object datatype, which I'm not sure we should be encouraging as part of the public API

Is there a way to do this without converting to an object numpy array?

If not, I'm tempted to suggest not bringing it back, just documenting

df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

as the suggested workaround, and noting the performance and memory drawbacks of object dtype

rhshadrach · 2023-10-31T20:35:16Z

This is a much more straightforward solution than I was aware of when I made #40140 (comment). Agreed with @MarcoGorelli - I'd suggest something like:

def pd_lookup(df, row_labels, col_labels):
    rows = df.index.get_indexer(row_labels)
    cols = df.columns.get_indexer(col_labels)
    result = df.to_numpy()[rows, cols]
    return result

is best in user's code rather than pandas.

challisd · 2023-11-06T18:42:49Z

@MarcoGorelli, I just tried out my suggested code again and confirmed it does not change the dtype to object; it is still int64. What situation are you concerned about specifically?

@rhshadrach, why would that be better in user code? As I mentioned above, this type of lookup operation is a basic feature of most other table/dataframe solutions available such as numpy and R data.frames. Leaving it out of Pandas is a major oversight in my opinion and will drive less experienced programmers away from Pandas. While the solution is fairly simple, most beginner and intermediate pandas users are not familiar with the get_indexer function and how to apply it this way. They are not likely to just figure it out; instead they will implement something truly slow like nested for loops. This just feeds into the idea that Python and Pandas are too slow for big data science.

I don't really see it at this point, but if there is likely potential for major performance issues if it is used correctly, I think a simple warning message would be sufficient. If the slowdown is minor enough it can be ignored and if it's a major issue the warning can point users in the right direction.

rhshadrach · 2023-11-06T21:46:47Z

What situation are you concerned about specifically?

When the DataFrame has multiple dtypes

size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.random(size), 'c': 'x'})

row_labels = np.repeat(np.arange(size), 2)
col_labels = np.tile(['a', 'b'], size)
rows = df.index.get_indexer(row_labels)
cols = df.columns.get_indexer(col_labels)
looked_up = df.to_numpy()[rows, cols]

%timeit df[['a', 'b']].sum().sum()
# 505 µs ± 903 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit looked_up.sum()
# 1.88 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

why would that be better in user code?

In general, it seems to me that adding methods that are short combinations of pandas' methods is not sustainable. I think our goal should be to provide the user the fundamental tools they need to combine in various ways accomplish their task, rather than try to make every use case a single method call.

It seems to me that if the "straightforward" implementation involves unnecessarily coercing to object dtype, and there is a more advanced implementation that does not, then the case for inclusion in pandas would be far more compelling.

challisd · 2023-11-07T16:48:15Z

I agree that we can't provide a ton of short combinations of pandas methods, but such a common and basic use case certainly should be included in my opinion. What about something like the following:

def lookup(df, row_labels, col_labels, dtype=None):
    if len(df.dtypes.unique()) > 1:
        warnings.warn("DataFrame contains mixed data types which may lead to unexpected type coersion and/or decreased performance")
    if np.dtype('O') in df2.dtypes.values and dtype is not None and dtype != 'object':
        warnings.warn("DataFrames with columns of the 'object' data type may fail to be coerced to other data types")
    rows = df.index.get_indexer(row_labels)
    cols = df.columns.get_indexer(col_labels)
    return df.to_numpy(dtype=dtype)[rows, cols]

It allows greater control over the output data type and gives warnings when dealing with mixed data types. It gives an additional warning if attempting to coerce an object column to a non-object data-type as this can easily lead to an exception

MarcoGorelli · 2023-11-07T18:14:26Z

such a common and basic use case certainly should be included

just checking, are there any other dataframe libraries which include this?

challisd · 2023-11-07T21:45:40Z

Not sure, I've mostly only used Pandas in Python. I'm making the claim it's a basic feature based off the evidence that both R and Numpy support this functionality as part of the built-in [] indexing function.

challisd · 2023-11-08T23:02:08Z

Actually, it seems I remembered incorrectly and it is not a basic feature of R. Sorry for the mistake!

erfannariman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 1, 2021

erfannariman changed the title ~~ENH: re-implement DataFrame.lookup in a performant way.~~ ENH: re-implement DataFrame.lookup. Mar 1, 2021

erfannariman mentioned this issue Mar 1, 2021

DEPR: DataFrame.lookup #35224

Merged

5 tasks

MarcoGorelli added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 1, 2021

jasperhyp mentioned this issue Aug 7, 2022

ENH: Adding back lookup or value setting by indices #47992

Closed

3 tasks

jorisvandenbossche added this to the 2.0 milestone Nov 28, 2022

mroeschke modified the milestones: 2.0, 3.0 Feb 8, 2023

erfannariman assigned lohraspco Oct 24, 2023

rhshadrach added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: re-implement DataFrame.lookup. #40140

ENH: re-implement DataFrame.lookup. #40140

erfannariman commented Mar 1, 2021 •

edited

challisd commented Aug 4, 2021 •

edited

berkgercek commented May 9, 2022 •

edited

rhshadrach commented May 11, 2022

challisd commented Feb 22, 2023

erfannariman commented Feb 22, 2023

MarcoGorelli commented Feb 22, 2023 •

edited

challisd commented Feb 22, 2023 •

edited

challisd commented Mar 9, 2023 •

edited

erfannariman commented Mar 10, 2023

challisd commented Mar 10, 2023

mnajaria commented Oct 11, 2023

MarcoGorelli commented Oct 11, 2023

mnajaria commented Oct 12, 2023

lohraspco commented Oct 24, 2023

erfannariman commented Oct 24, 2023

mnajaria commented Oct 31, 2023

MarcoGorelli commented Oct 31, 2023

MarcoGorelli commented Oct 31, 2023

rhshadrach commented Oct 31, 2023

challisd commented Nov 6, 2023 •

edited

rhshadrach commented Nov 6, 2023

challisd commented Nov 7, 2023 •

edited

MarcoGorelli commented Nov 7, 2023

challisd commented Nov 7, 2023

challisd commented Nov 8, 2023

ENH: re-implement DataFrame.lookup. #40140

ENH: re-implement DataFrame.lookup. #40140

Comments

erfannariman commented Mar 1, 2021 • edited

challisd commented Aug 4, 2021 • edited

berkgercek commented May 9, 2022 • edited

rhshadrach commented May 11, 2022

challisd commented Feb 22, 2023

erfannariman commented Feb 22, 2023

MarcoGorelli commented Feb 22, 2023 • edited

challisd commented Feb 22, 2023 • edited

challisd commented Mar 9, 2023 • edited

erfannariman commented Mar 10, 2023

challisd commented Mar 10, 2023

mnajaria commented Oct 11, 2023

MarcoGorelli commented Oct 11, 2023

mnajaria commented Oct 12, 2023

lohraspco commented Oct 24, 2023

erfannariman commented Oct 24, 2023

mnajaria commented Oct 31, 2023

MarcoGorelli commented Oct 31, 2023

MarcoGorelli commented Oct 31, 2023

rhshadrach commented Oct 31, 2023

challisd commented Nov 6, 2023 • edited

rhshadrach commented Nov 6, 2023

challisd commented Nov 7, 2023 • edited

MarcoGorelli commented Nov 7, 2023

challisd commented Nov 7, 2023

challisd commented Nov 8, 2023

erfannariman commented Mar 1, 2021 •

edited

challisd commented Aug 4, 2021 •

edited

berkgercek commented May 9, 2022 •

edited

MarcoGorelli commented Feb 22, 2023 •

edited

challisd commented Feb 22, 2023 •

edited

challisd commented Mar 9, 2023 •

edited

challisd commented Nov 6, 2023 •

edited

challisd commented Nov 7, 2023 •

edited