-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Implement pipe protocol, a method for extensible method chaining #10129
Comments
I really like how simple the implementation is. No PEPs, no hacking on CPython, others should be able to implement the same protocol. One (maybe) drawback is that we must rely on people to be responsible and not modify the |
I like the idea of our community establishing and using protocols more heavily. Will ponder this one for a while. |
@jreback I've used that in the past, and this would be very similar. |
We might also extend # in pandas
def pipe(self, func, *args, **kwargs):
if hasattr(func, 'pipe_arg'):
kwargs[func.pipe_arg] = self
return func(*args, **kwargs)
return func(self, *args, **kwargs)
# in seaborn (personally, I would probably implement this with a decorator)
sns.violinplot.pipe_arg = 'data'
# in user code
df.pipe(sns.violinplot, x='species', y='sepal_ratio') Alternatively, we could just ask @mwaskom to break Seaborn's API again ;). |
I agree that we can probably make something more pandas-y than |
Is it possible to get infix syntax using a decorator? https://github.com/JulienPalard/Pipe |
That's done by overriding |
@datnamer Yes, but not satisfactorily without macros. The problem is that to get So turning functions into pipe objects means that their non-piped usability is compromised. There are ways around this, e.g., using tuples like dask: |
This isn't a viable candidate for a Pandas solution, but I thought I'd point out |
Following is a list of pipes which R's http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
If any of them is useful, preparing an separate kw (or method) might be an option. |
df.tee(sns.violinplot, x='x', y='y')
.pipe(sm.OLS.from_formula('y ~ x')) I do consider this experimental though. My initial thought was to start slow and see how things go so I didn't include that in the initial proposal. |
👍 I like this in general as an alternative to the current
Instead of pushing the decorator out to other libraries, I think it would be better to encourage end user to decorate on the fly, ex pd_violin = pd.pipeify(sns.violin)
artists = df.pipe(pd_violin, x='foo', y='bar') or to give def pipe(self, func, _pipe_target=None, *args, **kwargs):
if pipe_target is not None:
kwargs[_pipe_target] = self
return func(*args, **kwargs)
else:
return func(self, *args, **kwargs) |
Another drawback is that libraries wishing to support pandas <0.17 will need to implement there own def pipeable(pipe_arg):
def decorate(func):
setattr(func, 'pipe_arg', pipe_arg)
@wraps(func)
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
return wrapper
return decorate It's not difficult, but it's essential that everyone agrees on the name for |
@josef-pkt would statsmodels be interested in adding the decorators to support this? I see it working quite nicely with the formula API. I'll send over a PR if you want. I believe you're hoping to have a release in the not too distant future? It'd be nice to lay the groundwork in statsmodels 0.7. Even if that comes out before the next version of pandas, adding the decorators should entirely transparent to pre 0.17 pandas and just work for pandas >= 0.17. |
Hm, tricky. Though, |
What's the difference between Would this be worth the increased complexity and fragility and possibly reduced flexibility? It might be more useful in standalone function like some plots, or hypothesis tests or data transformations.
The main problem I see is that statsmodels is largely model centric and not data centric like pandas and seaborn. The other design issue is that statsmodels has little functional code, I just spend most of two weeks chaining models and results through multiple inheritance and similar. (We need to modify or chain 3 to 7 methods of a Maximum Likelihood Model.) Personally, I find Nevertheless, the best user interfaces, pandas wrapper and formulas, were included without my involvement. And we do have quite a bit of non-model functions. |
@josef-pkt No functional difference between those first two. And readabilty-wise they're pretty much the same. The (potential) benefit comes in chains, but that case is admittedly weaker for statsmodels since it's typically the last step in the chain, and you'll probably want to have a reference to the DataFrame that went into the model. |
@TomAugspurger We are holding on to the dataframe right now because we use it for some optional methods. I also see more benefits in other functions like in my examples where statsmodels is not necessarily the last step in the chain. In some cases the main advantage in interactive work of having a method available is that I don't need an explicit import. That's one advantage of a plot method, since most of the time I don't import matplotlib automatically. Just an idea: instead of a pipe we could add (monkey patch) a method on pandas.DataFrame, like |
@TomAugspurger Hmm -- I'm not seeing issue you've described with seaborn. Are you sure you're running the latest dev version? Here's my example notebook, in which each of I suppose we can include a utility function for decorating functions as pipeable, but really it's even easier than you think. For example, you can just add the attribute directly: def pipeable(pipe_arg):
def decorate(func):
func.pipe_arg = pipe_arg
return func
return decorate @josef-pkt To clarify, your first example should actually be: Coming back to the API design discussion, one of my favorite things about the original, simple version of pipe is that it is entirely explicit and there is absolutely no magic: def pipe(self, func, *args, **kwargs):
return func(self, *args, **kwargs) So no, I don't think it's a good idea to encourage monkey patching DataFrame from external libraries. It's not explicit and very much goes against the norms of idiomatic Python. Similarly, as much as I want So something like @tacaswell's suggestion might be the better way to go. A few other options:
Of these, I think my favorite is @tacaswell's original suggestion @mwaskom can you remind us why you don't want to keep |
I just realized that method or functions that take a dataframe as a first argument will work without any changes (even if the data argument is called exog) with the original proposed
or with a lambda function (edited to fix typos)
and
|
@shoyer I must have had a bug in mine. It works both with named and positional arguments. |
Right, one of the motivations for the change was to allow things to work better with |
@mwaskom OK, fair enough. I agree that for Seaborn, for which DataFrames are optional, it makes sense for I suppose another option would be to support an alternative API in Seaborn itself, e.g., |
This would be reasonably easy to implement on my end, especially for the categorical plots, so that would be fine. Though, for the user, that approach probably ends up involving almost as much extra typing as just making an intermediate dataframe :) |
So, attempting a summary.
FWIW, I'm coming around to 1 (maybe 2). Taking things slowly and seeing how the community responds could be good. The nice thing about 1 is it gives us the most flexibility in the future if we see something that needs to change, without breaking anyones code. With 2 and especially 3 we only get one shot. side note: I'm moving this week so I won't have much time for this other than today. I've written a section in the release notes and |
Another idea I have been noodling with but have not written out anywhere yet is for pandas to provide an 'unpack' decorator to throwing data at non-panadas aware code (note really silly name given because I don't want to think about a good name right now) def unpackify_decorator(func):
def wrapped(df, column_map, *args, **kwargs):
for k, v in column_map.items():
if k in kwargs:
raise ValueError()
kwargs[k] = df[v].values
return func(*args, **kwargs)
return wrapped or something similar so the usage with pipe would look like awarified = unpackify_decorator(ax.scatter)
df.pipe(awarififed, {'x': 'age', 'y': 'total_commit', 'c': 'number_of_projects', 's': 'net_LoC'}, cmap='gray') Having written this out it might also make sense to have this logic be part of the def pipe(self, func, _pipe_target=None, *args, **kwargs):
if is_string(_pipe_target):
kwargs[_pipe_target] = self
return func(*args, **kwargs)
elif is_dict(_pipe_target):
for k, v in _pipe_target.items():
if k in kwargs:
raise ValueError()
kwargs[k] = self[v].values
return func(*args, **kwargs)
elif is_tuple(_pipe_target):
args = tuple(self[k] for k in _pipe_target) + args
return func(*args, **kwargs)
else:
return func(self, *args, **kwargs) so the uasge would look like df.pipe(ax.scatter, {'x': 'age', 'y': 'total_commit', 'c': 'net_LoC', 's': 'number_of_projects'}, cmap='gray') or df.pipe(ax.scatter, ('age', 'total_commit', 'net_LoC', 'number_of_projects'), cmap='gray') It is a bit verbose (but I picked a complicated example mapping 4 columns and used really descriptive column names), but it I think captures all of the simple cases for functions. If you want to mix the order of user supplied args and data extracted from |
And thinking about this a bit more, having |
Yes, exactly. For Seaborn, your example would work, though to handle optional arguments properly you'd want something like this:
In practice, I would probably write a decorator to consolidate the logic: def pipeable(func):
def pipe_func(data, *args, **kwargs):
return func(*args, data=data, **kwargs)
func.__pipe_func__ = pipe_func
return func
# now we decorate all the Seaborn plotting functions as pipeable
@pipeable
def violinplot(x=None, y=None, hue=None, data=None, ...):
# ... @josef-pkt I misread your comment -- indeed, it would also make sense to define |
Yep, your way is simpler. I'll add it to the original. I posted on our ML. Feel free to post to r/python and / or ideas |
Something related but not much practical value:
and
|
Please, note that this looks much like the stream library. (maybe stream is a better name?) Also, how are pipes going to play with the recent Not being picky or making requests, but just trying to bring these discussions up in order to avoid future problems. |
@FelipeLema It's best to think of everything we can ahead of time. I don't think this proposal really interacts with async / |
FWIW, The lower-level glyph API doesn't flow quite as smoothly, but I think that's ok. |
FYI, I finally got around to posting on Python-Ideas: https://mail.python.org/pipermail/python-ideas/2015-May/033673.html @TomAugspurger Nice to see this already works with bokeh! |
HI @TomAugspurger and others. The timing here is auspicious. A few of us are convened in Austin this week to hash out a final interface for the charts interface, so it would be good to have any input from this issue. What I got from my brief reading was "data as the first argument" is good. I don't expect that that would change, but it's good to know. If there is any other things that you would like to pass on to make sure things integrate as well as possible, please let me know here (or in an issue on the Bokeh GH). Ping @fpliger @rothnic |
I definitely don't see |
FYI, there is a new piping discussion here : https://mail.python.org/pipermail/python-ideas/2015-May/033491.html |
@josef-pkt @mwaskom @tacaswell How important to you is the ability of the pipe argument to control what it does (via @bryevdv @fpliger Seems like |
I wonder if that's just a fundamental difference. I spend way more time at the interpreter as a user of pandas than as a developer of it. And that's when the Anyway, I'm fine with excluding |
I am 👎 on I am (predictably) in favor of my suggestion above to specify how to un-pack data frames into non-data frame aware function as input to Speaking for my self, I would be against mpl adding |
@tacaswell I feel pretty strongly that The helper function factory for unpacking dataframes seems like a decent idea, thought I'm not sure it's necessary to put it in pandas. Indeed, we may want to simply encourage authors that would use |
Perhaps the concern is that one could consider "piping" in contexts other
|
Where else would it make sense for it to live? |
This strikes me as a recipe for confusing users. How many people will first encounter a particular function in a "piped" context, then import the wrong thing and not understand why other examples don't work? The duplication of functions only makes sense if you have a pretty thorough understanding of pandas implementation details. What if instead it were possible to do DataFrame(...).pipe("data", sns.violinplot, "day", "tip") Where if the first argument is a string and the second is a callable, the first is interpreted as the keyword arg to pipe the dataframe into. I'm not sure if this would require manual inspection of an |
Could also maybe be more straightforward to do DataFrame(...).pipe((sns.violinplot, "data"), "day, "tip") then the code on the pandas side is simpler, and it's clear that the "data arg" is more tightly associated with what to call than what to plot (if that makes sense). |
@mrocklin I like that. Instead of Here's what @mwaskom's proposals look like (in code). Both of these seem pretty reasonable to me: def pipe1(self, target, *args, **kwargs):
# DataFrame(...).pipe("data", sns.violinplot, "day", "tip")
if callable(target):
return target(self, *args, **kwargs)
else:
kwargs[target] = self
func = args[0]
args = args[1:]
return func(*args, **kwargs)
def pipe2(self, target, *args, **kwargs):
# DataFrame(...).pipe((sns.violinplot, "data"), "day, "tip")
if isinstance(target, tuple):
func, data_arg = target
kwargs[data_arg] = self
return func(*args, **kwargs)
else:
return target(self, *args, **kwargs) |
SummaryThere wasn't too much feedback from either python-ideas or r/python. The most common response was more or less why are you using dunder methods? followed by the We're also attached to keeping There's some worry that the TasksThese are all intertwined
|
As @TomAugspurger describes in the PR (#10253), in the dev meeting we settled on @mwaskom's tuple proposal: def pipe(self, target, *args, **kwargs):
# DataFrame(...).pipe((sns.violinplot, "data"), "day, "tip")
if isinstance(target, tuple):
func, data_arg = target
kwargs[data_arg] = self
return func(*args, **kwargs)
else:
return target(self, *args, **kwargs) There's no magic, and it leaves the door open to future extensions. Please speak up if you have any further concerns. |
I think this is a missed opportunity to make it easy to export data in a |
@tacaswell I also think that the simplicity of the implementation in #10253 is crucial. And the boat hasn't entirely sailed on something like what you proposed with a Thanks everyone for the feedback and ideas (and civil discourse! <3 PyData) |
@TomAugspurger Fair enough, I do see the case for keeping it as simple as possible. As linked above, I have versions of the more verbose code in a PR against mpl which can also serve as a Guinea pig (and any feed back from the pandas folks on what I did wrong would be helpful). |
Today @shoyer and I were talking about a new "protocol" that will let us sidestep the whole macro / method chaining issue. The basic idea is that pandas objects define a
pipe
method (ideally other libraries will implement this to, assuming this is useful).Based on the discussions below, we're leaning towards a method like
That's it. This lets you write code like:
seaborn didn't have to do anything! If the DataFrame is the first argument to the function, things are even simpler:
Users or libraries can work around the need for the (somewhat ugly)
lambda _:
, by using the__pipe_func__
attribute of the function beingpipe
d in. This is where a protocol (de facto or official) would be useful, since libraries that know nothing else about each other can rely on it. As an example, consider seaborn's violin plot, which expects a DataFrame as its fourth argument,data
. Seaborn can define a simple decorator to attach a__pipe_func__
attribute, allowing it to define how it expects to bepipe
d to.And users write
Why?
Heavily nested function calls are bad. They're hard to read, and can easily introduce bugs. Consider:
For pandas, the approach has been to add
f
,g
, andh
as methods to (say)DataFrame
The code is certainly cleaner. It reads and flows top to bottom instead of inside-out. The function arguments are next to the function calls. But there's a hidden cost. DataFrame has something like 200+ methods, which is crazy. It's less flexible for users since it's hard to get their own functions into pipelines (short of monkey-patching). With
.pipe
, we canThe other way around the nested calls is using temporary variables:
Which is better, but not as good as the
.pipe
solution.A relevant thread on python-ideas, started by @mrocklin: https://mail.python.org/pipermail/python-ideas/2015-March/032745.html
This doesn't achieve everything macros could. We still can't do things like
df.plot(x=x_col, y=y_col)
wherex_col
andy_col
are captured bydf
's namespace. But it may be good enough.Going to cc a bunch of people here, who've had interest in the past.
@shoyer
@mrocklin
@datnamer
@dalejung
The text was updated successfully, but these errors were encountered: