Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Dedicated method for creating conditional columns #39154

Closed
erfannariman opened this issue Jan 13, 2021 · 46 comments · Fixed by #56059
Closed

ENH: Dedicated method for creating conditional columns #39154

erfannariman opened this issue Jan 13, 2021 · 46 comments · Fixed by #56059
Assignees
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@erfannariman
Copy link
Member

erfannariman commented Jan 13, 2021

Creating conditional columns is an operation you perform a lot when wrangling data. Still right now there is no dedicated and explicit way of creating conditional columns in pandas. This can be seen on StackOverflow which is my top hit when I google "pandas conditional column" (this will obviously vary per person).

On this StackOverflow link we see the result of not being explicit since there are at least 6 different methods mentioned and a lot of anti patterns, also methods we do not want users to use because of inefficiency (apply with if else, map with lambda x for example).

This ticket is made as a starting point for discussion if we should have a more explicit way of creating conditional columns in pandas. Also because we need to look ahead when switching to EA's because the most used methods (and top accepted answer on SO) are numpy methods and return numpy arrays.


What do we have now?:

For single conditional columns:

  • np.where
  • .loc with assignment
  • Series.where
  • Series.mask, see comment

For multi conditional columns:

  • np.select
  • complex list comprehensions
  • .apply with if else (not an option imo)

I don't have a suggestion yet, but I know I really like the clean code you can create with np.select and also with SQL CASE WHEN.

Other languages:

SQL has CASE WHEN:

CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    WHEN conditionN THEN resultN
    ELSE result
END;

R (dplyr) has case_when:

df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
                            a == 0 | a == 1 | a == 4 | a == 3 |  c == 4 ~ 3,
                            TRUE ~ NA_real_))

Related tickets / questions:

@erfannariman erfannariman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2021
@erfannariman
Copy link
Member Author

cc @jorisvandenbossche @WillAyd

@rhshadrach
Copy link
Member

One to add to the list is Series.mask, e.g.

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

mask = df['a'] < 2
df['b'] = df['b'].mask(mask, 0)

mask = (df['a'] + df['b'] > 2) & (df['a'] + df['b'] < 8)
df['b'] = df['b'].mask(mask, 0)

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 15, 2021
@samukweku
Copy link
Contributor

One option, using pd.Series.mask under the hood, is to use a comma as a separator, where the condition comes before the result. I feel it is easier to understand what is going on, and makes the code a bit clean( my opinion). It also allows pandas data types to be retained, since under the hood, the implementation will be using pandas.mask, not np.select. Using @rhshadrach examples above:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c = case_when(df.a < 2, 0, df.b))
# multiple conditions : 
df.assign(d = case_when (df.a < 2, 0,   # first condition, result   
                         df.b < 2, 1,    # second condition, result 
                         df.a))            # default

The idea is inspired by if_else in pydatatable and fcase in R's data.table; A chainable method is implemented in pyjanitor (currently in dev) as case_when

@erfannariman
Copy link
Member Author

erfannariman commented Oct 1, 2021

One option, using pd.Series.mask under the hood, is to use a comma as a separator, where the condition comes before the result. I feel it is easier to understand what is going on, and makes the code a bit clean( my opinion). It also allows pandas data types to be retained, since under the hood, the implementation will be using pandas.mask, not np.select. Using @rhshadrach examples above:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c = case_when(df.a < 2, 0, df.b))
# multiple conditions : 
df.assign(d = case_when (df.a < 2, 0,   # first condition, result   
                         df.b < 2, 1,    # second condition, result 
                         df.a))            # default

The idea is inspired by if_else in pydatatable and fcase in R's data.table; A chainable method is implemented in pyjanitor (currently in dev) as case_when

This looks like a clean method for the api, that means this would be a general function eg pd.case_when, which makes sense to me. We have mask and where for the DataFrame and Series

Instead I think a default argument for the default value would make it more clear.

@pwwang
Copy link

pwwang commented Mar 17, 2022

datar supports if_else and case_when (like dplyr's):

>>> import pandas as pd
>>> from datar.all import f, mutate, case_when, if_else
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df >> mutate(c=if_else(f.a < 2, 0, f.b))
        a       b         c
  <int64> <int64> <float64>
0       1       4       0.0
1       2       5       5.0
2       3       6       6.0
>>> df >> mutate(d=case_when(f.a < 2, 0, f.b < 2, 1, True, f.a))
        a       b         d
  <int64> <int64> <float64>
0       1       4       0.0
1       2       5       2.0
2       3       6       3.0

@jreback
Copy link
Contributor

jreback commented Mar 19, 2022

would be averse to adding case_when as an api, but would require a champion and substantial testing.

@erfannariman
Copy link
Member Author

would be averse to adding case_when as an api, but would require a champion and substantial testing.

I agree on the latter, I started with an attempt to implement this and ran into quite some complexity fast. But given the original post and problem statement, wouldn't you agree a "data wrangling" module needs a dedicated method to create conditional columns? It is an action you perform quite often when transforming data.

@jreback
Copy link
Contributor

jreback commented Mar 19, 2022

was a typo, 'wouldn't be averse'. that said this is a substantial effort and would have to be designed properly.

@samukweku
Copy link
Contributor

@erfannariman what complexity did you encounter?

@erfannariman
Copy link
Member Author

@erfannariman what complexity did you encounter?

The first implementation of creating a conditional column was relatively easy, but then the design choices around numpy dtypes vs pandas extension dtypes. Also passing all the tests and edge cases + mypy type checks.

And np.select will still be the fastest method for multi-conditional cases as far as I tested. So using that under the hood made sense to me.

But to be fair, the attempt was relatively short compared to what is needed to implement this.

@samukweku
Copy link
Contributor

samukweku commented Mar 20, 2022

If you do not mind @erfannariman kindly share the edge cases, as well as the ones regarding dtypes issues so I can test on my end.

@ELHoussineT
Copy link

Any decision made here on what method to proceed with?

@ELHoussineT
Copy link

ELHoussineT commented Oct 10, 2022

I am still a bit biased towards my suggestion in (issue: #46285, PR: #46286) because its based on Lambdas. Any major counter arguments?

@rhshadrach
Copy link
Member

Any major counter arguments?

If I'm understanding the code correctly, those lambdas will be called row-by-row. In that case, they would be highly inefficient when compared to vectorized operations.

@ELHoussineT
Copy link

ELHoussineT commented Oct 19, 2022

If I'm understanding the code correctly, those lambdas will be called row-by-row. In that case, they would be highly inefficient when compared to vectorized operations.

@rhshadrach

Isn't it the same row-by-row assignment when we do:

df.assign(
    column_1 = lambda x: ...,
    column_2 = lambda x: ...,
    column_3 = lambda x: ...,
)

The columns above get assigned one by one, but in the assignment of the column itself, we have vectorisation. I believe this is similar to what was proposed in (issue: #46285, PR: #46286).

If you agree, I can re-open the PR and we can drive this feature forward together.

Thanks.

@samukweku
Copy link
Contributor

@erfannariman just checking if you are still working on this. If you are not working on it, I can take over and raise a PR

@rhshadrach
Copy link
Member

@ELHoussineT I see, I missed that you're allowing the values to be Series as well as callable. It seems to me this API would not be able to handle the example in #39154 (comment), is that correct?

@samukweku
Copy link
Contributor

@rhshadrach @ELHoussineT , one option, would be to create a standalone if_else function, that can be used within assign or on its own. That way assign stays the same with no changes and can be adapted to whatever assignment operation is required. Have a look at an example in #39154(comment) for a possible implementation

@ELHoussineT
Copy link

@rhshadrach

It seems to me this API would not be able to handle the example in #39154 (comment), is that correct?

I believe the only missing feature is the "default" value. I can tweak this API to support that and add some tests.

If you agree, I will open a PR.

@rhshadrach
Copy link
Member

@ELHoussineT - can you show what #39154 (comment) would look like with your suggested signature? My understanding is that the keys of the dictionary are to become the values in the result. I don't see how you can use a Series in a straight-forward manner as the values in the result.

@ELHoussineT
Copy link

ELHoussineT commented Oct 24, 2022 via email

@ELHoussineT
Copy link

ELHoussineT commented Oct 24, 2022

@rhshadrach

Alright, let's start by explicitly describing the requirements for the functionality we want to build. Please feel free to suggest otherwise.

Requirements

The functionality should satisfy the following:

  1. Can assign multiple values to a column based on multiple conditions
  2. Vectorization is maintained
  3. The conditions can be lambdas/methods or Series. If lambdas/methods, the data frame shall be passed to the lambda/method
  4. Ability to provide "default" value (value to use if all conditions failed). If the default value is not defined, use current value otherwise None
  5. All conditions are evaluated. Moreover, values are assigned based on first match. This is an important point to agree on because in the example below, if there are items that satisfy condition 1 and condition 3 for example, said items will get value 1.
"value 1" = lambda x: <conditions 1>,
"value 2" = lambda x: <conditions 2> 
"value 3" = lambda x: <conditions 3> 

Options

Now that we agreed - on a high level - on the requirements. IMHO, here are our options:

Option 1:

What has been suggested by @samukweku in their comment.

Example:

df.assign(d = pd.case_when(lambda x: <some conditions>, 0,   # first condition, result   
                           lambda x: <some conditions>, 1,   # second condition, result 
                           df.a))                            # default (optional)

Option 2:

We follow what Spark did with their when(...).otherwise(...) (docs) functionality.

Example:

df.assign(d = pd.when(lambda x: <some conditions>, 0) # first condition, result   
                .when(lambda x: <some conditions>, 1) # second condition, result 
                .otherwise(df.a))            	      # default (optional)

Please note that if we choose this option, upon implementation, we can rename .otherwise(df.a) with .default(df.a) for convenience.

My opinion

I would grant it to option 2 as IMHO it is more intuitive compared to option 1. I would also be happy to hear what you think.

Let me know your thoughts so we can choose and implement a solution.

@rhshadrach
Copy link
Member

rhshadrach commented Oct 25, 2022

How do these differ from Series.mask?

(
    df.a
    .mask(lambda x: <some conditions>, 0) # first condition, result   
    .mask(lambda x: <some conditions>, 1) # second condition, result
)

Also, I don't see how we can implement otherwise. pyspark can do this because of lazy evaluation, but pandas is eager.

Edit: Ah, option 1 allows the default. Am I to assume that condition 2 will only be utilized where condition 1 was false? If this is the case, is it still evaluated where condition 1 is true? This would have implications if you're doing some op that may raise on rows where condition 1 is true.

@samukweku
Copy link
Contributor

samukweku commented Oct 25, 2022

I can speak about option1. It uses the same idea as np.select, where all conditions are evaluated, and values assigned based on first match, using pd.mask. so if condition 1 is true and condition 2 is true, condition 1 takes precedence. If all fails, then the default kicks in. Basically a for loop, just like in np select, but using pd.mask. Hopefully, I explained it well enough. An implementation is here and an example from stackoverflow

@ELHoussineT
Copy link

ELHoussineT commented Oct 25, 2022

@rhshadrach

How do these differ from Series.mask?

(
    df.a
    .mask(lambda x: <some conditions>, 0) # first condition, result   
    .mask(lambda x: <some conditions>, 1) # second condition, result
)

I can see at least 1 major difference. Please correct me if I understood incorrectly; but in the method you mentioned, you cannot reference other rows.

For example, you cannot do the following:

df.assign(d = pd.when(lambda x: (x.a < 3) & (x.r.isna()), 0)             # first condition, result   
                .when(lambda x: (x.c % 3 == 0) & (x.t == "some str"), 1) # second condition, result 
                .otherwise(df.s))            	                         # default (optional)

If this is the case, is it still evaluated where condition 1 is true? This would have implications if you're doing some op that may raise on rows where condition 1 is true.

Good point, you are right. But as @samukweku this not handled in np.select and I don't believe that it is handled in PySpark either. Maybe it is too much of a corner case which should be handled by the method/lambda written by the user. What do you think?

@ELHoussineT
Copy link

@samukweku

where all conditions are evaluated, and values assigned based on first match

Good point, I missed that np.select works that way. I think Pandas is closer to Numpy than it is to PySpark. Hence, I adjusted the requirements to "values assigned based on first match" instead of "values assigned based on last match"

@rhshadrach
Copy link
Member

rhshadrach commented Oct 28, 2022

Given that I don't think Option 2 is viable (happy to be wrong here if there is a working implementation), Option 1 as linked to here looks good to me. However, I think we should drop the column_name argument and have the method return a Series.

Another question about the API is what to do with default. np.select uses 0, which makes sense with NumPy being much more heavily focused on numeric dtypes. However 0 doesn't seem like an appropriate default to me for pandas. A few other options

  • Use lib.no_default and insert the appropriate null value for the dtype
  • Use lib.no_default to detect and raise when the user doesn't pass a value
  • Use lib.no_default to detect when the user doesn't pass a value, and only raise when default needs to be used.

These all seem like viable options to me, it's not clear which is to be preferred. Any thoughts?

@samukweku
Copy link
Contributor

Yes, a pd.Series returned makes sense; the user can do whatever they want with it.

I'd just set the default to None; however, I think the first option of using lib.no_default and inserting the appropriate null value seems more plausible. Dont know the full details of lib.no_default though

@ELHoussineT
Copy link

Nice.

@samukweku Will you pick this one or shall I give it a shot? Thanks.

@samukweku
Copy link
Contributor

@ELHoussineT I'll give it a shot. Thanks. @erfannariman was already working on it, and I do not know the steps involved in taking over. @rhshadrach @erfannariman ok if i make a PR based on our conversations?

@rhshadrach
Copy link
Member

@samukweku - most certainly!

@samukweku
Copy link
Contributor

take

@ELHoussineT
Copy link

@samukweku Thanks! Let me know if you need any help.

@samukweku
Copy link
Contributor

@ELHoussineT sure .. i'll start on it

@samukweku
Copy link
Contributor

@ELHoussineT pls kindly take over with regards to this PR. Thanks

@erfannariman
Copy link
Member Author

I've been busy with work, but I would still like to give this one more serious try if it's okay with the others.

@ELHoussineT
Copy link

@erfannariman Sure thing, if you decided to stop working on it, please ping me to start.

@samukweku
Copy link
Contributor

samukweku commented Dec 4, 2022

@erfannariman any progress so far? Got some time on me now for this ☺️

@ELHoussineT
Copy link

ELHoussineT commented Dec 19, 2022

This is my first contribution to Pandas core, I would kindly ask for your guidance and patience to make it happen as it will somehow make me happy if I successfully contributed to Pandas core.

Can you please confirm your agreement with the initial approach in #50343?

Usage example:

df.assign(d = pd.case_when(lambda x: <some conditions>, 0,            # first condition, result   
                           (other_df.z > 4) | (other_df.x.isna()), 1, # second condition, result 
                           "some_value"))                             # default (optional)

Approach advantages:

  1. .assign(...) logic is not touched. (Changing the core .assign(...) was discouraged by @jreback in my previous attempt here)
  2. Conditions can be passed as a series or a callable (see "Usage example" in this comment)
  3. Maintains the organic behaviour of np.select where values are assigned based on the first match
  4. I tried to introduce minimal logic and rather rely on np.select

Open questions:

  1. How to expose the function to be accessible via simply calling pd.case_when(...)? I tried adding it here but its not effective yet.
  2. I would suggest for the first iteration of this functionality to keep the default simply None for now. Any disagreements?

Next steps:

  • If you agree to the approach in the mentioned PR, I will proceed with what is left to do in the PR which is mainly (1) docs and (2) tests

Credit:

@ELHoussineT
Copy link

ping @erfannariman or @rhshadrach

@rhshadrach
Copy link
Member

  1. How to expose the function to be accessible via simply calling pd.case_when(...)? I tried adding it here but its not effective yet.

You need to also modify pandas.__init__ as well as the modification linked above.

  1. I would suggest for the first iteration of this functionality to keep the default simply None for now. Any disagreements?

I think you're saying the default value of the default argument in the function's signature should be None? In such a case, you would not be able to tell if the user passed None (so explicitly wants None) or if they are leaving the default argument unspecified. For this reason, I recommend using lib.no_default.

@ELHoussineT
Copy link

@rhshadrach Thanks for your input

I've added case_when to the api and implemented lib.no_default.

Docs and tests are remaining, will add those and move the PR from draft to review.

@ELHoussineT
Copy link

ELHoussineT commented Jan 6, 2023

I took a shot at docs and tests, can you please take a final look? @rhshadrach

#50343

@erfannariman
Copy link
Member Author

Thanks for picking this up, do you mind to keep the discussion in your PR, so no need to comment here for every update. The core devs will respond when they have time. This way we keep this issue clean. @ELHoussineT

@samukweku
Copy link
Contributor

samukweku commented Sep 16, 2023

Is it ok if I pick this up and work on it?

@samukweku
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
7 participants