Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: re-implement DataFrame.lookup. #40140

Open
erfannariman opened this issue Mar 1, 2021 · 25 comments
Open

ENH: re-implement DataFrame.lookup. #40140

erfannariman opened this issue Mar 1, 2021 · 25 comments
Assignees
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@erfannariman
Copy link
Member

erfannariman commented Mar 1, 2021

DataFrame.lookup was deprecated in #35224 in 1.2. After some feedback (#39171 ) I opened this ticket to discuss re-implentation of lookup in a performant way. As mentioned in the discussion on 35244: "but it would have to be performant and not be yet another indexing api".

This ticket can be a starting point for proposed methods, although the old implementation was actually quite performant look at given tests in the discussion of 35244:

pandas/pandas/core/frame.py

Lines 3848 to 3861 in b5958ee

if not self._is_mixed_type or n > thresh:
values = self.values
ridx = self.index.get_indexer(row_labels)
cidx = self.columns.get_indexer(col_labels)
if (ridx == -1).any():
raise KeyError("One or more row labels was not found")
if (cidx == -1).any():
raise KeyError("One or more column labels was not found")
flat_index = ridx * len(self.columns) + cidx
result = values.flat[flat_index]
else:
result = np.empty(n, dtype="O")
for i, (r, c) in enumerate(zip(row_labels, col_labels)):
result[i] = self._get_value(r, c)

@erfannariman erfannariman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 1, 2021
@erfannariman erfannariman changed the title ENH: re-implement DataFrame.lookup in a performant way. ENH: re-implement DataFrame.lookup. Mar 1, 2021
@MarcoGorelli MarcoGorelli added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 1, 2021
@challisd
Copy link

challisd commented Aug 4, 2021

I think there should definitely be a lookup function. Since the old one seems to work well, is un-deprecating it an option? I find the proposed alternative using melt to be unreadable, and based on the (sadly heated) discussion here the old lookup function is faster than the melt alternative suggested. Pandas is a module used by many thousands of programmers and scientists who often have only a vague (or no) idea what the melt function does. The ability to run a quick series of lookups using lists of row and column coordinates is a fairly ordinary task, but if you don't provide this lookup function most users will likely fall back on using a slow for loop; and if that's too slow for them, decide to forget it and just use NumPy where you can do the_data[row_index_list, column_index_list]
Can we please keep this function?

@berkgercek
Copy link

berkgercek commented May 9, 2022

Going to throw my voice in here and say that this is a pretty important feature for dataframes that allows for numpy-like behavior with labeled complex indexes and columns.

My personal use case for Pandas is often reliant on using it to keep labels and data together, and working with a method like lookup is a part of how I use it. It also fits in the scope of the package description provided in the documentation:

Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

It's not something I use often, but the proposed solution linked in the deprecation notice is very inelegant.

One use case I have today is to do something similar to the following (obviously with meaningful data), which I have done with the above solution:

df1 = pd.DataFrame(np.random.normal(size=[1000, 3], columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.normal(size=[1000, 3], columns=['A', 'B', 'C'])
maxidx = df.idxmax(axis=1)
# The solution suggested by the deprecation notice is below.
idx, cols = pd.factorize(maxidx)
df2_lookup = pd.Series(df2.reindex(cols, axis=1).to_numpy()[np.arange(len(df2)), idx], index=df2.index)

This is not a readable solution for me and when others need to maintain this code it will be very much not obvious at a glance what I'm doing here.

If there is another solution that would work equally well but be more intuitive I am happy to use that instead, but I see no alternative to the .lookup method for this use case.

@rhshadrach
Copy link
Member

+1 on re-implementing unless there is a more understandable alternative; I too find it hard to discern what the current alternative is doing.

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Nov 28, 2022
@mroeschke mroeschke modified the milestones: 2.0, 3.0 Feb 8, 2023
@challisd
Copy link

Any word on if or when this feature will be added back in, or has anyone figured out a viable alternative?

@erfannariman
Copy link
Member Author

Just to check, is there an agreement that this will be added back in if there's a viable PR before someone (or myself) starts to work on it. @jorisvandenbossche @mroeschke @rhshadrach

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 22, 2023

Agree that the melt alternative wasn't great, but is there any case where the currently-suggested workaround

idx, cols = pd.factorize(df['col'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

isn't sufficient?

I mean, if it was re-implemented with this under the hood. From the timings, it looks like it's more performant than the old solution ever was?

I think I'd also be OK with reimplementing

@challisd
Copy link

challisd commented Feb 22, 2023

@MarcoGorelli I'm actually confused by that solution. The lookup function take as input a list of index values and a list of column values of equal length. Where are those two lists in the example? The example seems to be limited to the case where you have a specific column value you want for every row, but the lookup function should allow us to lookup specific column values for a specific list of indices.

@challisd
Copy link

challisd commented Mar 9, 2023

Okay, after spending more time on this than I probably should have, I came up this a solution that seems to fully match the original functionality and has performance that is about the same (slightly faster) than the df.lookup function.

df=pd.DataFrame([[12,3],[1,4],[6,2],[64,3]], index=['a','b','c','d'], columns=['x','y'])
row_labels = ['b', 'd']
col_labels = ['y', 'x']

df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

I tested both the old lookup function and this new approach 100k times on a very small and a moderately large (100k x 4) DataFrame and in both cases this alternate approach ran marginally faster (39 seconds compared to 41.5 seconds)

I suggest the example in the depreciation notice be switched to this as the current one doesn't really match the full functionality of df.lookup

@erfannariman
Copy link
Member Author

Okay, after spending more time on this than I probably should have, I came up this a solution that seems to fully match the original functionality and has performance that is about the same (slightly faster) than the df.lookup function.

df=pd.DataFrame([[12,3],[1,4],[6,2],[64,3]], index=['a','b','c','d'], columns=['x','y'])
row_labels = ['b', 'd']
col_labels = ['y', 'x']

df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

I tested both the old lookup function and this new approach 100k times on a very small and a moderately large (100k x 4) DataFrame and in both cases this alternate approach ran marginally faster (39 seconds compared to 41.5 seconds)

I suggest the example in the depreciation notice be switched to this as the current one doesn't really match the full functionality of df.lookup

Isn't this the same implementation as the original? See OP where I posted a code snippet of the original function.

@challisd
Copy link

Based on performance metrics I would guess that yes, that is the original implementation. If so that really should be the suggested replacement for the depreciation notice as it is what gives the same functionality. My own opinion is that it never should have been depreciated as it is quite useful and easy to understand the way it is. This functionality is a basic part of numpy and R data.frames, it doesn't make any sense to me to remove it. If there is an alternative that is actually faster AND includes the full functionality, then by all means let's do that, but I haven't been able to find anything like that yet.

@mnajaria
Copy link

Hi @MarcoGorelli, is there any chance that lookup can be undepreciated? Since it is self-explanatory and familiar for many programmer, isn't it better to have it? The following code is confusing
df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

Moreover, it broke my pipelines :( I will add the lookup function in the pandas code by myself for now, I hope that it won't break the whole package, but it will be great if we have it back in the Pandas :)

@MarcoGorelli
Copy link
Member

sounds good to me - anyone interested in making a PR?

@mnajaria
Copy link

I would like to do it.

@lohraspco
Copy link

I am trying to find where I can assign it to myself (according to the instruction ) , but I didn't find the button for it. How can I do that?

@erfannariman
Copy link
Member Author

@lohraspco I assigned it for you. Also you can comment "take" and it will trigger the assignment as well. Good luck with the implementation!

@mnajaria
Copy link

I have added the lookup functionality with a little modifications from the previous version. I borrowed @challisd method. If interested, you can look at the following link for the code.
https://github.com/mnajaria/pandas/blob/bringing_back_lookup_functionality/pandas/core/frame.py

@MarcoGorelli
Copy link
Member

thanks - could you open a PR please, and include tests?

@MarcoGorelli
Copy link
Member

Wait sorry, thinking about this again, to_numpy on a dataframe is very likely to result in object datatype, which I'm not sure we should be encouraging as part of the public API

Is there a way to do this without converting to an object numpy array?

If not, I'm tempted to suggest not bringing it back, just documenting

df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

as the suggested workaround, and noting the performance and memory drawbacks of object dtype

@rhshadrach
Copy link
Member

This is a much more straightforward solution than I was aware of when I made #40140 (comment). Agreed with @MarcoGorelli - I'd suggest something like:

def pd_lookup(df, row_labels, col_labels):
    rows = df.index.get_indexer(row_labels)
    cols = df.columns.get_indexer(col_labels)
    result = df.to_numpy()[rows, cols]
    return result

is best in user's code rather than pandas.

@challisd
Copy link

challisd commented Nov 6, 2023

@MarcoGorelli, I just tried out my suggested code again and confirmed it does not change the dtype to object; it is still int64. What situation are you concerned about specifically?

@rhshadrach, why would that be better in user code? As I mentioned above, this type of lookup operation is a basic feature of most other table/dataframe solutions available such as numpy and R data.frames. Leaving it out of Pandas is a major oversight in my opinion and will drive less experienced programmers away from Pandas. While the solution is fairly simple, most beginner and intermediate pandas users are not familiar with the get_indexer function and how to apply it this way. They are not likely to just figure it out; instead they will implement something truly slow like nested for loops. This just feeds into the idea that Python and Pandas are too slow for big data science.

I don't really see it at this point, but if there is likely potential for major performance issues if it is used correctly, I think a simple warning message would be sufficient. If the slowdown is minor enough it can be ignored and if it's a major issue the warning can point users in the right direction.

@rhshadrach
Copy link
Member

What situation are you concerned about specifically?

When the DataFrame has multiple dtypes

size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.random(size), 'c': 'x'})

row_labels = np.repeat(np.arange(size), 2)
col_labels = np.tile(['a', 'b'], size)
rows = df.index.get_indexer(row_labels)
cols = df.columns.get_indexer(col_labels)
looked_up = df.to_numpy()[rows, cols]

%timeit df[['a', 'b']].sum().sum()
# 505 µs ± 903 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit looked_up.sum()
# 1.88 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

why would that be better in user code?

In general, it seems to me that adding methods that are short combinations of pandas' methods is not sustainable. I think our goal should be to provide the user the fundamental tools they need to combine in various ways accomplish their task, rather than try to make every use case a single method call.

It seems to me that if the "straightforward" implementation involves unnecessarily coercing to object dtype, and there is a more advanced implementation that does not, then the case for inclusion in pandas would be far more compelling.

@rhshadrach rhshadrach added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Nov 6, 2023
@challisd
Copy link

challisd commented Nov 7, 2023

I agree that we can't provide a ton of short combinations of pandas methods, but such a common and basic use case certainly should be included in my opinion. What about something like the following:

def lookup(df, row_labels, col_labels, dtype=None):
    if len(df.dtypes.unique()) > 1:
        warnings.warn("DataFrame contains mixed data types which may lead to unexpected type coersion and/or decreased performance")
    if np.dtype('O') in df2.dtypes.values and dtype is not None and dtype != 'object':
        warnings.warn("DataFrames with columns of the 'object' data type may fail to be coerced to other data types")
    rows = df.index.get_indexer(row_labels)
    cols = df.columns.get_indexer(col_labels)
    return df.to_numpy(dtype=dtype)[rows, cols]

It allows greater control over the output data type and gives warnings when dealing with mixed data types. It gives an additional warning if attempting to coerce an object column to a non-object data-type as this can easily lead to an exception

@MarcoGorelli
Copy link
Member

such a common and basic use case certainly should be included

just checking, are there any other dataframe libraries which include this?

@challisd
Copy link

challisd commented Nov 7, 2023

Not sure, I've mostly only used Pandas in Python. I'm making the claim it's a basic feature based off the evidence that both R and Numpy support this functionality as part of the built-in [] indexing function.

@challisd
Copy link

challisd commented Nov 8, 2023

Actually, it seems I remembered incorrectly and it is not a basic feature of R. Sorry for the mistake!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

9 participants