Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not resolve column names that are also functions in the environment #65

Closed
holgerbrandl opened this issue Aug 28, 2018 · 12 comments
Closed

Comments

@holgerbrandl
Copy link

Consider the following example:

diamonds >> mutate(rank=min_rank(X.carat)) >> filter_by(X.rank <10)

This fails with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 142, in __rrshift__
    result = self.function(other_copy)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 149, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 329, in __call__
    return self.function(*args, **kwargs)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 282, in __call__
    return self.function(df, *args, **kwargs)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/subset.py", line 62, in mask
    if arg.dtype != bool:
AttributeError: 'NotImplementedType' object has no attribute 'dtype'

but seems legit to me.

@sharpe5
Copy link

sharpe5 commented Aug 28, 2018 via email

@holgerbrandl
Copy link
Author

Isn't that what I did? The only think I've skipped is the from dfply import * preamble, which I took for granted in here.

@sharpe5
Copy link

sharpe5 commented Aug 29, 2018 via email

@holgerbrandl
Copy link
Author

I think it's rather a member function of pandas.DataFrame. But when symbols are being resolved internally by dfply, I'd expect variables to have precedence.

I'll try to submit the next ticket in a more reproducible way.

@sharpe5
Copy link

sharpe5 commented Aug 29, 2018 via email

@sharpe5
Copy link

sharpe5 commented Sep 1, 2018

Could you please close this issue? Thanks!

@holgerbrandl
Copy link
Author

But the problem is not solved at all?! It also affects dozens of other names with happen to be used by pandas. rank was just an example. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html for a complete listing.

For sure @kieferk if you think it's not worth fixing or too hard, feel free to do so.

@sharpe5
Copy link

sharpe5 commented Sep 3, 2018 via email

@holgerbrandl
Copy link
Author

This, or giving column names priority over pandas functions when resolving X.foo. The latter seems more correct to me, but I haven't used dfply much yet.

@kieferk
Copy link
Owner

kieferk commented Sep 4, 2018

I'm open to fixing this if possible, but it's tricky. The X symbol is just a generic instance of the Intention class, and as such is at some point evaluated against a "context" object. If the context passed is a pandas DataFrame, which is typically the case, it will apply the function to that DataFrame. The function in this case would be the __getattr__ call for foo (or rank, or whatever it may be).

The ugly way to deal with this would be to do a check on the context object before it's sent to the function and have special logic in place to "override" the pandas behavior. To be honest I'm not really keen on doing that. Pandas would expect you to access your variable by string name in the case that it duplicates a built-in function, and so I'd advise you to do the same. For example:

from dfply import *

diamonds >> head()
   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75

diamonds >> select(X['cut']) >> head()
       cut
0    Ideal
1  Premium
2     Good
3  Premium
4     Good

In your case of course you would have 'rank' instead of 'cut'.

@sharpe5
Copy link

sharpe5 commented Sep 5, 2018

Perhaps just give a meaningful error in this case?

@holgerbrandl
Copy link
Author

@kieferk thanks for the details. I did not know about the X['rank'] way of accessing the columns, which is a reasonable/readable way of doing it. I initially thought that it would not be possible to use names such rank for columns at all.

Thanks both of you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants