Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boolean "~" operator ignored after "|" #60

Open
EricPrideaux opened this issue Jul 26, 2018 · 8 comments
Open

Boolean "~" operator ignored after "|" #60

EricPrideaux opened this issue Jul 26, 2018 · 8 comments

Comments

@EricPrideaux
Copy link

Hi kieferk,

I am an R user learning how to use dfply. I may have spotted an issue: it appears that Boolean ~ isn't evaluated after Boolean | if applied in the syntax below.

My code:

# Import
import pandas as pd
import numpy as np
from dfply import *

# Create data frame and mask it
df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
        mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)

Here is the original data frame, df:

       a    b    c
    0  NaN  6.0  5
    1  2.0  7.0  4
    2  3.0  8.0  3
    3  4.0  9.0  2
    4  5.0  NaN  1

And here is the result of the piped mask, df2:

         a    b    c
      0  NaN  6.0  5
      4  5.0  NaN  1

However, I expect this instead:

         a    b    c
      0  NaN  6.0  5
      1  2.0  7.0  4
      2  3.0  8.0  3
      3  4.0  9.0  2

I don't understand why the | and ~ operators result in rows in which column "a" is either NaN or column "b" is not NaN?

By the way, I also tried np.logical_or():

df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
        mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)

But this resulted in error:

mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__
@EricPrideaux EricPrideaux changed the title Boolean ~ operators ignored after | Boolean "~" operator ignored after "|" Jul 26, 2018
@kieferk
Copy link
Owner

kieferk commented Aug 28, 2018

This is a tricky one. I'll have to dive into it a little bit to see what's going on. The ~ usage on the symbolic is one of the more complicated parts of the code and it's been awhile since I wrote that.

@kieferk
Copy link
Owner

kieferk commented Aug 28, 2018

Ok so this is definitely a bug, but I'm gonna need to think about how I'll fix it. Essentially the problem is that the inversion is not propagating through properly in the chain of operations, and unfortunately it's not a trivial fix as far as I can tell right now. I'll let you know when I come up with a solution.

@EricPrideaux
Copy link
Author

Hi Kieferk,
Many thanks for your update. I look forward to your solution and will keep an eye out!

@andrewkho
Copy link
Contributor

Just wanted to chime in that I have also come across this bug, same scenario when using mask except my case was e.g. mask(X.bool_col1 & (~X.bool_col2))

@andrewkho
Copy link
Contributor

Also wanted to add that in the case of &, you can use mask(condA, ~condB), and alternatively, the - sign for inversion also works, e.g. mask(condA & -condB)

@kieferk
Copy link
Owner

kieferk commented Jan 18, 2019

Sorry I've been inactive for awhile since work has been very busy. I am going to dive back in and try to tackle this over the weekend.

I am hoping I can resolve this "elegantly" but from what I can see it may require some substantial code re-writing. I'll keep you posted.

@jstrong-tios
Copy link

interestingly, passing the invert operator to make_symbolic results in correct behavior (fwiw):

from operator import inv # inv(x) == ~x

df['a'].isnull() | (~df['b'].isnull())
#        m
# 0   True
# 1   True
# 2   True
# 3   True
# 4  False

df >> transmute(m = X.a.isnull() | inv(X.b.isnull()))
#        m
# 0   True
# 1  False
# 2  False
# 3  False
# 4   True

df >> transmute(m = X.a.isnull() | make_symbolic(inv)(X.b.isnull()))
#        m
# 0   True
# 1   True
# 2   True
# 3   True
# 4  False

@antonio-yu
Copy link

Hi kieferk,

My friends and I are very excited and thankful when encounting the dplyr-style package.
We use filter_by a lot in filting chinese by boolean values.
We look forward to your solution for this Boolean bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants