Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Implement pipe protocol, a method for extensible method chaining #10129

Closed
TomAugspurger opened this issue May 13, 2015 · 59 comments · Fixed by #10253
Closed

API: Implement pipe protocol, a method for extensible method chaining #10129

TomAugspurger opened this issue May 13, 2015 · 59 comments · Fixed by #10253
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Today @shoyer and I were talking about a new "protocol" that will let us sidestep the whole macro / method chaining issue. The basic idea is that pandas objects define a pipe method (ideally other libraries will implement this to, assuming this is useful).

Based on the discussions below, we're leaning towards a method like

def pipe(self, func, *args, **kwargs):
    pipe_func = getattr(func, '__pipe_func__', func)
    return pipe_func(self, *args, **kwargs)

That's it. This lets you write code like:

import seaborn as sns

(sns.load_dataset('iris')
    .query("sepal_length > 5")
    .assign(sepal_ratio = lambda df: df.sepal_length / df.sepal_width)
    .pipe(lambda df: sns.violinplot(x='species', y='sepal_ratio', data=df))
)

seaborn didn't have to do anything! If the DataFrame is the first argument to the function, things are even simpler:

def f(x):
    return x / 2

df['sepal_length'].pipe(f)

Users or libraries can work around the need for the (somewhat ugly) lambda _:, by using the __pipe_func__ attribute of the function being piped in. This is where a protocol (de facto or official) would be useful, since libraries that know nothing else about each other can rely on it. As an example, consider seaborn's violin plot, which expects a DataFrame as its fourth argument, data. Seaborn can define a simple decorator to attach a __pipe_func__ attribute, allowing it to define how it expects to be piped to.

def pipeable(func):
    def pipe_func(data, *args, **kwargs):
        return func(*args, data=data, **kwargs)
    func.__pipe_func__ = pipe_func
    return func

# now we decorate all the Seaborn methods as pipeable

@pipeable
def violinplot(x=None, y=None, hue=None, data=None, ...):
    # ...   

And users write

(sns.load_dataset('iris')
    .query("sepal_length > 5")
    .assign(sepal_ratio = lambda x: x.sepal_length / x.sepal_width)
    .pipe(sns.violinplot, x='species', y='sepal_ratio')
)

Why?

Heavily nested function calls are bad. They're hard to read, and can easily introduce bugs. Consider:

# f, g, and h are functions that take and receive a DataFrame
result = f(g(h(df), arg1=1), arg2=2, arg3=3)

For pandas, the approach has been to add f, g, and h as methods to (say) DataFrame

(df.h()
   .g(arg1=1)
   .f(arg2=2, arg3=3)
)

The code is certainly cleaner. It reads and flows top to bottom instead of inside-out. The function arguments are next to the function calls. But there's a hidden cost. DataFrame has something like 200+ methods, which is crazy. It's less flexible for users since it's hard to get their own functions into pipelines (short of monkey-patching). With .pipe, we can

(df.pipe(h)
   .pipe(g, arg1=1)
   .pipe(f, arg2=2, arg3=3)
)

The other way around the nested calls is using temporary variables:

r1 = h(df)
r2 = g(r1, arg1=1)
r3 = f(r2, arg2=2, arg3=3)

Which is better, but not as good as the .pipe solution.


A relevant thread on python-ideas, started by @mrocklin: https://mail.python.org/pipermail/python-ideas/2015-March/032745.html

This doesn't achieve everything macros could. We still can't do things like df.plot(x=x_col, y=y_col) where x_col and y_col are captured by df's namespace. But it may be good enough.

Going to cc a bunch of people here, who've had interest in the past.

@shoyer
@mrocklin
@datnamer
@dalejung

@TomAugspurger
Copy link
Contributor Author

I really like how simple the implementation is. No PEPs, no hacking on CPython, others should be able to implement the same protocol.

One (maybe) drawback is that we must rely on people to be responsible and not modify the self being passed in. But that's a very pythonic sort of "problem". Anyone, I look forward to the discussion :)

@jreback
Copy link
Contributor

jreback commented May 13, 2015

@mrocklin
Copy link
Contributor

I like the idea of our community establishing and using protocols more heavily. Will ponder this one for a while.

@TomAugspurger
Copy link
Contributor Author

@jreback I've used that in the past, and this would be very similar. toolz.pipe is less convenient for interactive use though, I think. You need to define each function ahead of time, including simple functions to get the data to pipe into the correct position for functions that don't expect data as the first argument. We'll still have to do that with df.pipe, but I think it's easier to do as you work through the chain.

@shoyer
Copy link
Member

shoyer commented May 14, 2015

We might also extend pipe to allow functions to customize which argument is used for chaining. This would be useful for seaborn, which currently takes dataframes in the data keyword argument:

# in pandas
def pipe(self, func, *args, **kwargs):
    if hasattr(func, 'pipe_arg'):
        kwargs[func.pipe_arg] = self
        return func(*args, **kwargs)
    return func(self, *args, **kwargs)

# in seaborn (personally, I would probably implement this with a decorator)
sns.violinplot.pipe_arg = 'data'

# in user code
df.pipe(sns.violinplot, x='species', y='sepal_ratio')

Alternatively, we could just ask @mwaskom to break Seaborn's API again ;).

@mrocklin
Copy link
Contributor

I agree that we can probably make something more pandas-y than toolz.pipe which, while convenient in its own way, doesn't jive particularly well with the pandas flow.

@datnamer
Copy link

Is it possible to get infix syntax using a decorator? https://github.com/JulienPalard/Pipe

@TomAugspurger
Copy link
Contributor Author

That's done by overriding __or__ though, right? I wouldn't feel comfortable doing that in library code.

@shoyer
Copy link
Member

shoyer commented May 14, 2015

@datnamer Yes, but not satisfactorily without macros. The problem is that to get df >> sns.violinplot('x', 'y') to work properly, you need to allow sns.violinplot('x', 'y'), even though that would usually raise an error because the required data argument is not provided.

So turning functions into pipe objects means that their non-piped usability is compromised. There are ways around this, e.g., using tuples like dask: df >> (sns.violinplot, 'x', 'y'), but >> (with the spaces) is only one less character than .pipe (which is much more explicit).

@mrocklin
Copy link
Contributor

This isn't a viable candidate for a Pandas solution, but I thought I'd point out toolz.curry which lives in the same space.

@sinhrks
Copy link
Member

sinhrks commented May 14, 2015

Following is a list of pipes which R's magrittr package provides (the package is used by dplyr).

http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html

  • pipe: current example (and main usecase). %>% in magrittr.
  • tee: apply the function to left value, and return the left value as it is. %T>% in magrittr.
  • pipe and assignment: corresponding to panda's inplace option, %<>% in magrittr.

If any of them is useful, preparing an separate kw (or method) might be an option.

@TomAugspurger
Copy link
Contributor Author

tee should be quite easy as a separate method. Just apply func and return self.

df.tee(sns.violinplot, x='x', y='y')
    .pipe(sm.OLS.from_formula('y ~ x'))

I do consider this experimental though. My initial thought was to start slow and see how things go so I didn't include that in the initial proposal.

@tacaswell
Copy link
Contributor

👍 I like this in general as an alternative to the current df.plot. It also fits pretty well with where I (speaking only for me) want to push the mpl API.

tee worries me a bit as you lose the ability to get the Artists back which is important if you want to do anything interactive.

Instead of pushing the decorator out to other libraries, I think it would be better to encourage end user to decorate on the fly, ex

pd_violin = pd.pipeify(sns.violin)
artists = df.pipe(pd_violin, x='foo', y='bar')

or to give pipe a reserved kwarg like _pipe_target so

def pipe(self, func, _pipe_target=None, *args, **kwargs):
    if pipe_target is not None:
        kwargs[_pipe_target] = self
        return func(*args, **kwargs)
    else:
        return func(self, *args, **kwargs)

@TomAugspurger
Copy link
Contributor Author

Another drawback is that libraries wishing to support pandas <0.17 will need to implement there own pipeable decorator. I've got one in pandas/util/decorators.py

def pipeable(pipe_arg):
    def decorate(func):
        setattr(func, 'pipe_arg', pipe_arg)

        @wraps(func)
        def wrapper(*args, **kwargs):
            return func(*args, **kwargs)
        return wrapper
    return decorate

It's not difficult, but it's essential that everyone agrees on the name for pipe_arg. This is a point in favor of @tacaswell's suggestion, but I'm pretty attached to simply being able to write df.pipe(sns.violinplot, 'x', 'y') :)

@TomAugspurger
Copy link
Contributor Author

@josef-pkt would statsmodels be interested in adding the decorators to support this? I see it working quite nicely with the formula API. I'll send over a PR if you want. I believe you're hoping to have a release in the not too distant future? It'd be nice to lay the groundwork in statsmodels 0.7. Even if that comes out before the next version of pandas, adding the decorators should entirely transparent to pre 0.17 pandas and just work for pandas >= 0.17.

@mwaskom
Copy link
Contributor

mwaskom commented May 16, 2015

Another drawback with this API is that the user must be a bit more disciplined about positional arguments vs kwargs. Assuming df.pipe and that seaborn's violin plot has been decorated.

Hm, tricky. Though, FacetGrid.map works the same way, so there's precedent.

@josef-pkt
Copy link

What's the difference between
model = df.pipe(sm.OLS.from_formula('y ~ x')) and
model = sm.OLS.from_formula('y ~ x', df) ?

Would this be worth the increased complexity and fragility and possibly reduced flexibility?

It might be more useful in standalone function like some plots, or hypothesis tests or data transformations.
Also, I guess more useful than getting a model could be a wrapper for something like prediction, (assuming we are not interested in other results statistics):

predicted = smf.ols('y ~ x', df).fit().predict(df2) returns currently a numpy array
or
fitted = sm.OLS.from_formula('y ~ x', df).fit().fittedvalues returns a pandas.Series

The main problem I see is that statsmodels is largely model centric and not data centric like pandas and seaborn. The other design issue is that statsmodels has little functional code, I just spend most of two weeks chaining models and results through multiple inheritance and similar. (We need to modify or chain 3 to 7 methods of a Maximum Likelihood Model.)

Personally, I find magrittr code pretty ugly, since my tradition is in repeated assignments to temporaries a la matlab/numpy.

Nevertheless, the best user interfaces, pandas wrapper and formulas, were included without my involvement. And we do have quite a bit of non-model functions.

@TomAugspurger
Copy link
Contributor Author

@josef-pkt No functional difference between those first two. And readabilty-wise they're pretty much the same.

The (potential) benefit comes in chains, but that case is admittedly weaker for statsmodels since it's typically the last step in the chain, and you'll probably want to have a reference to the DataFrame that went into the model.

@josef-pkt
Copy link

@TomAugspurger We are holding on to the dataframe right now because we use it for some optional methods. I also see more benefits in other functions like in my examples where statsmodels is not necessarily the last step in the chain.

In some cases the main advantage in interactive work of having a method available is that I don't need an explicit import. That's one advantage of a plot method, since most of the time I don't import matplotlib automatically.

Just an idea: instead of a pipe we could add (monkey patch) a method on pandas.DataFrame, like df.sm(...) which automatically uses the function or class in the sm namespace and replaces the data argument with df. I guess we don't have enough consistent usage of a data argument yet to make that work consistently. df.sns('violinplot', ...) ?

@shoyer
Copy link
Member

shoyer commented May 17, 2015

@TomAugspurger Hmm -- I'm not seeing issue you've described with seaborn. Are you sure you're running the latest dev version? Here's my example notebook, in which each of iris.pipe(sns.violinplot, 'species', 'sepal_width') and iris.pipe(sns.violinplot, x='species', y='sepal_width') work: http://nbviewer.ipython.org/gist/shoyer/3f73dc7ffa03883d3a43

I suppose we can include a utility function for decorating functions as pipeable, but really it's even easier than you think. For example, you can just add the attribute directly: sns.violinplot.pip_arg = 'data'. Or if you prefer a decorator, you can use something even simpler:

def pipeable(pipe_arg):
    def decorate(func):
        func.pipe_arg = pipe_arg
        return func
    return decorate

@josef-pkt To clarify, your first example should actually be: model = df.pipe(sm.OLS.from_formula, 'y ~ x')

Coming back to the API design discussion, one of my favorite things about the original, simple version of pipe is that it is entirely explicit and there is absolutely no magic:

def pipe(self, func, *args, **kwargs):
    return func(self, *args, **kwargs)

So no, I don't think it's a good idea to encourage monkey patching DataFrame from external libraries. It's not explicit and very much goes against the norms of idiomatic Python.

Similarly, as much as I want sns.violinplot to be pipeable, I'm not convinced it's actually a good idea to encourage library authors to decorate their functions to denote pipe targets. This adds a whole additional level of complexity which effectively needs to be documented as part of the function signature, but which isn't actually part of the function signature and thus will be far less discoverable. We could do it, but I would only recommend that library authors use it with extreme reservations -- it's far better to just accept data as the first argument to any functions that you want to be pipeable.

So something like @tacaswell's suggestion might be the better way to go. A few other options:

  1. Pipe automatically uses the data argument if present, as determined by inspecting the function signature (with inspect.getargspec). Unfortunately, the inspect module is quite fragile.
  2. df.pipe.data(sns.violinplot, 'x', 'y') or df.pipe['data'](sns.violinplot, 'x', 'y'). These aren't very pretty and would have a much more complex implementation.
  3. We allow the first argument of pipe to be a string, in which case it's used to specify the target argument: df.pipe('data', sns.violinplot, 'x', 'y'). This is also messy to implement.
  4. We use some sort of sentinel object to signal the pipe argument: df.pipe(sns.violinplot, 'x', 'y', data=pipe.HERE). But now pipe needs to import a shared module instead of simply being a protocol.

Of these, I think my favorite is @tacaswell's original suggestion df.pipe(sns.violinplot, 'x', 'y', _target='data'). Or sticking with the simple implementation, and encouraging use of lambda: df.pipe(lambda df: sns.violinplot('x', 'y', data=df))

@mwaskom can you remind us why you don't want to keep data as the first argument to sns.violinplot? I guess it's because it's the only way to make sns.FacetGrid.map work reliability without any magic? This is basically the same problem I'm trying to avoid here (to avoid implementing pipe_arg). On the plus side, at least pipe will already pair well with the explicit sns.FacetGrid API.

@josef-pkt
Copy link

I just realized that method or functions that take a dataframe as a first argument will work without any changes (even if the data argument is called exog)

with the original proposed pipe

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

df = pd.DataFrame(np.random.randn(10, 3))
df.columns = ["a", "b", "c"]
result = smf.ols('a ~ b', df).fit()

def pipe(self, func, *args, **kwargs):
    return func(self, *args, **kwargs)

# monkey to get the example to work
pd.DataFrame.pipe = pipe

predicted = df.pipe(result.predict)
print(predicted)
print(df[df['a'] < 0].pipe(result.predict))
>>> df.pipe(result.predict)
array([ 0.77901375,  0.95606063,  0.72349671,  0.96823813,  1.33929812,
        0.42730606,  0.57528495,  0.47756362,  0.52891164,  0.40572964])
>>> df[df['a'] < 0].pipe(result.predict)
array([ 0.77901375,  0.42730606,  0.57528495])

or with a lambda function

(edited to fix typos)

>>> pt = lambda data, formula: smf.ols(formula, data).fit().summary2().tables[1]

>>> df.pipe(pt, 'a ~ b')
              Coef.  Std.Err.         t     P>|t|    [0.025    0.975]
Intercept  0.560831  0.430557  1.302571  0.228965 -0.432035  1.553698
b          0.369255  0.490102  0.753426  0.472782 -0.760921  1.499431

>>> df.pipe(pt, 'a ~ c')
              Coef.  Std.Err.         t     P>|t|    [0.025    0.975]
Intercept  0.780714  0.392976  1.986672  0.082195 -0.125490  1.686917
c          0.375825  0.593915  0.632793  0.544528 -0.993745  1.745395

>>> df.pipe(pt, 'a ~ b + c')
              Coef.  Std.Err.         t     P>|t|    [0.025    0.975]
Intercept  0.561965  0.422812  1.329111  0.225494 -0.437828  1.561757
b          0.657157  0.543692  1.208693  0.266020 -0.628471  1.942784
c          0.742642  0.652395  1.138331  0.292437 -0.800028  2.285312

and

>>> df[df['a'] < 0].pipe(pt, 'a ~ b')
              Coef.  Std.Err.          t     P>|t|    [0.025    0.975]
Intercept -0.653352  0.048054 -13.596207  0.046739 -1.263935 -0.042768
b          0.088062  0.119958   0.734112  0.596857 -1.436143  1.612267

@TomAugspurger
Copy link
Contributor Author

@shoyer I must have had a bug in mine. It works both with named and positional arguments.

@mwaskom
Copy link
Contributor

mwaskom commented May 17, 2015

@mwaskom can you remind us why you don't want to keep data as the first argument to sns.violinplot? I guess it's because it's the only way to make sns.FacetGrid.map work reliability without any magic? This is basically the same problem I'm trying to avoid here (to avoid implementing pipe_arg). On the plus side, at least pipe will already pair well with the explicit sns.FacetGrid API.

Right, one of the motivations for the change was to allow things to work better with FacetGrid, but also to bring boxplot and violinplot in line with most of the rest of the package. I guess there could be some argument that all seaborn function should take data first (like ggplot), but that would be an even more dramatic change. Also, for almost all the functions, data is optional and the positional arguments are happy to take vectors directly; I think originally I had the mindset of extending the basic matplotlib api and then adding the ability to name variables in a data object later, rather than orienting everything around a common dataframe representation...

@shoyer
Copy link
Member

shoyer commented May 17, 2015

@mwaskom OK, fair enough. I agree that for Seaborn, for which DataFrames are optional, it makes sense for data not to be the first argument.

I suppose another option would be to support an alternative API in Seaborn itself, e.g., sns.pipeable.violinplot. This seems nicer than the other explicit options we've seen. Again I'm still torn about whether or not pipe_arg is a good idea.

@mwaskom
Copy link
Contributor

mwaskom commented May 17, 2015

I suppose another option would be to support an alternative API in Seaborn itself, e.g., sns.pipeable.violinplot. This seems nicer than the other explicit options we've seen. Again I'm still torn about whether or not pipe_arg is a good idea.

This would be reasonably easy to implement on my end, especially for the categorical plots, so that would be fine. Though, for the user, that approach probably ends up involving almost as much extra typing as just making an intermediate dataframe :)

@TomAugspurger
Copy link
Contributor Author

So, attempting a summary.

  • People are +1 to the idea of a .pipe method in general (modulo details)
  • There's been no discussion of the name, which I take to mean .pipe is the correct name(?)
  • The contention is on how to pipe data to the correct position in func. Do we
    • 1: Have users pass in a lambda _: func(a, b, _)
    • 2: Include a special _pipe_target keyword in .pipe indicating where to go
    • 3: Have library authors / users decorate their functions (and rewrite docs / examples).

FWIW, I'm coming around to 1 (maybe 2). Taking things slowly and seeing how the community responds could be good. The nice thing about 1 is it gives us the most flexibility in the future if we see something that needs to change, without breaking anyones code. With 2 and especially 3 we only get one shot.

side note: I'm moving this week so I won't have much time for this other than today. I've written a section in the release notes and basics.rst. I plan to go through the rest of the docs today looking for places where .pipe would be appropriate (there may not be much / any). So I'll open up a PR this afternoon or if someone beats me to it I can do a PR against their branch for the docs.

@tacaswell
Copy link
Contributor

Another idea I have been noodling with but have not written out anywhere yet is for pandas to provide an 'unpack' decorator to throwing data at non-panadas aware code (note really silly name given because I don't want to think about a good name right now)

def unpackify_decorator(func):
    def wrapped(df, column_map, *args, **kwargs):
        for k, v in column_map.items():
            if k in kwargs:
                raise ValueError()
            kwargs[k] = df[v].values
        return func(*args, **kwargs)
    return wrapped

or something similar so the usage with pipe would look like

awarified = unpackify_decorator(ax.scatter)
df.pipe(awarififed, {'x': 'age', 'y': 'total_commit', 'c': 'number_of_projects', 's': 'net_LoC'}, cmap='gray')

Having written this out it might also make sense to have this logic be part of the pipe method.

def pipe(self, func, _pipe_target=None, *args, **kwargs):
    if is_string(_pipe_target):
        kwargs[_pipe_target] = self
        return func(*args, **kwargs)
    elif is_dict(_pipe_target):
        for k, v in _pipe_target.items():
           if k in kwargs:
                raise ValueError()
            kwargs[k] = self[v].values
        return func(*args, **kwargs)
    elif is_tuple(_pipe_target):
        args = tuple(self[k] for k in _pipe_target) + args
        return func(*args, **kwargs)
    else:
        return func(self, *args, **kwargs)

so the uasge would look like

df.pipe(ax.scatter, {'x': 'age', 'y': 'total_commit', 'c': 'net_LoC', 's': 'number_of_projects'}, cmap='gray')

or

df.pipe(ax.scatter, ('age', 'total_commit', 'net_LoC', 'number_of_projects'), cmap='gray')

It is a bit verbose (but I picked a complicated example mapping 4 columns and used really descriptive column names), but it I think captures all of the simple cases for functions. If you want to mix the order of user supplied args and data extracted from df you still have to write a lambda, but doing that in a general way would be ugly to write and very difficult to use.

@tacaswell
Copy link
Contributor

And thinking about this a bit more, having pipe as a stand-alone top-level function broadcast_labeled_data(dict_like, function, _target, *args, **kwargs) is probably a useful thing to have and I want to have a version of that available in mpl.

@shoyer
Copy link
Member

shoyer commented May 18, 2015

@TomAugspurger

To make sure I understand the intent of pipe_func = getattr(func, '__pipe_func__', func), the idea is to leave it up to the function author to decide how it is called, rather than the author of .pipe?

Yes, exactly.

For Seaborn, your example would work, though to handle optional arguments properly you'd want something like this:

violinplot.__pipe_func__ = lambda data, *args, **kwargs: violinplot(*args, data=data, **kwargs)

In practice, I would probably write a decorator to consolidate the logic:

def pipeable(func):
    def pipe_func(data, *args, **kwargs):
        return func(*args, data=data, **kwargs)
    func.__pipe_func__ = pipe_func
    return func

# now we decorate all the Seaborn plotting functions as pipeable

@pipeable
def violinplot(x=None, y=None, hue=None, data=None, ...):
    # ...   

@josef-pkt I misread your comment -- indeed, it would also make sense to define .pipe on statsmodels objects. self is just the object on which you are calling the pipe method. I agree, dispatching on the type of self in __pipe_func__ would be entirely appropriate.

@TomAugspurger
Copy link
Contributor Author

Yep, your way is simpler. I'll add it to the original. I posted on our ML. Feel free to post to r/python and / or ideas

@douglas-larocca
Copy link

Something related but not much practical value:

def recursive(df,fn):
    """
    recursive function application 
    on a dataframe, use like 

    >>> df.recursive(lambda self: ((
    ...     lambda _df, fn: 
    ...         _df if _df.max() > 1 else
    ...         fn(_df*1.01, fn))
    ... )
    """
    return fn(df,fn)

pd.DataFrame.recursive = recursive

and

def self_merge(df, self_fn, other_fn, *args, **kwargs):
    """
    self_merge
    ----------

    # pragma: rename `join_many`

    apply two functions to a dataframe
    then merge the result;

    implements multiple joins, e.g.
    if we have four dataframes,
    df1, df2, df3, df4,
    chaining `pd.merge` sequentially
    gets ugly, so instead do

    >>> df1.self_merge(
    ...     lambda _: pd.merge(_, df2, on='id', how='left'),
    ...     lambda _: _.self_merge(
    ...         lambda __: pd.merge(__, df3, on='id', how='left'),
    ...         lambda __: pd.merge(__, df4, on='id', how='left'),
    ...         on='id',
    ...         how='left',
    ...     ),
    ...     on='id',
    ...     how='left',
    ... )

    """
    return pd.merge(self_fn(df), other_fn(df), *args, **kwargs)

@datnamer
Copy link

CCing @pzwang and @bryevdv who might be interested in working this into blaze and or bokeh.

@FelipeLema
Copy link

Please, note that this looks much like the stream library. (maybe stream is a better name?)

Also, how are pipes going to play with the recent await and yield from tools? This coroutine syntax sugar plays well with composability ("piping") and seems natural to me that pipes should be implemented with it.

Not being picky or making requests, but just trying to bring these discussions up in order to avoid future problems.

@TomAugspurger
Copy link
Contributor Author

@FelipeLema It's best to think of everything we can ahead of time.

I don't think this proposal really interacts with async / await. It's entirely up to the functions being called in pipe whether they want to use parallel execution / asyncio.

@TomAugspurger
Copy link
Contributor Author

FWIW, bokeh's charting API all take their data as the first argument, so this will work seamlessly.

screen shot 2015-05-25 at 4 48 03 pm

The lower-level glyph API doesn't flow quite as smoothly, but I think that's ok.

@shoyer
Copy link
Member

shoyer commented May 26, 2015

FYI, I finally got around to posting on Python-Ideas: https://mail.python.org/pipermail/python-ideas/2015-May/033673.html

@TomAugspurger Nice to see this already works with bokeh!

@bryevdv
Copy link

bryevdv commented May 26, 2015

HI @TomAugspurger and others. The timing here is auspicious. A few of us are convened in Austin this week to hash out a final interface for the charts interface, so it would be good to have any input from this issue. What I got from my brief reading was "data as the first argument" is good. I don't expect that that would change, but it's good to know. If there is any other things that you would like to pass on to make sure things integrate as well as possible, please let me know here (or in an issue on the Bokeh GH). Ping @fpliger @rothnic

@fpliger
Copy link

fpliger commented May 27, 2015

I definitely don't see bokeh.charts "data as the first argument" interface changing either. Also think it's a perfect timing for suggestions.. 😉

@datnamer
Copy link

FYI, there is a new piping discussion here : https://mail.python.org/pipermail/python-ideas/2015-May/033491.html

@shoyer
Copy link
Member

shoyer commented May 31, 2015

@josef-pkt @mwaskom @tacaswell How important to you is the ability of the pipe argument to control what it does (via __pipe_func__)? As much as I love that flexibility, it's a little un-Pythonic, and I've gotten pushback every time I've introduced it to non-PyData programmers.

@bryevdv @fpliger Seems like bokeh.charts has the right API to use this immediately :). "Data the first argument" is indeed helpful :).

@TomAugspurger
Copy link
Contributor Author

I wonder if that's just a fundamental difference. I spend way more time at the interpreter as a user of pandas than as a developer of it. And that's when the __pipe_func__ is useful.

Anyway, I'm fine with excluding __pipe_func__ for 0.17. We'd be able to add it later without breaking backwards compatibility if people find the lamdas too annoying.

@tacaswell
Copy link
Contributor

I am 👎 on pipe looking for attributes on the input function to sort out what to do. The mess that will result if more than one library adopts using __pipe_func__ seems worse than the benefits. I am also not convinced that __pipe_func__ makes it more flexible, I think it makes everything far more brittle.

I am (predictably) in favor of my suggestion above to specify how to un-pack data frames into non-data frame aware function as input to pipe, providing a helper function factory which does the aformentioned remapping, or to just insisting that users write explicitly data-frame aware functions (either natively or to wrap whatever they have) to use pipe.

Speaking for my self, I would be against mpl adding __pipe_func__ to our functions if this does get adopted.

@shoyer
Copy link
Member

shoyer commented Jun 1, 2015

@tacaswell I feel pretty strongly that pipe itself needs to be kept to a bare minimum number of lines of code if we are going to encourage its adoption by other libraries. Hence __pipe_ufunc__ -- a way for library authors to control that behavior without needing to put that complexity in pipe itself.

The helper function factory for unpacking dataframes seems like a decent idea, thought I'm not sure it's necessary to put it in pandas.

Indeed, we may want to simply encourage authors that would use __pipe_func__ (e.g., Seaborn) to instead consider providing a separate namespace with "pipeable" functions, e.g., seaborn.pipeable.violinplot, perhaps imported by convention with import seaborn.pipeable as snsp.

@mrocklin
Copy link
Contributor

mrocklin commented Jun 1, 2015

Perhaps the concern is that one could consider "piping" in contexts other
than pandas. Perhaps the protocol should have "DataFrame" in the name.
On Jun 1, 2015 9:14 AM, "Stephan Hoyer" notifications@github.com wrote:

@tacaswell https://github.com/tacaswell I feel pretty strongly that pipe
itself needs to be kept to a bare minimum number of lines of code if we are
going to encourage its adoption by other libraries. Hence pipe_ufunc
-- a way for library authors to control that behavior without needing to
put that complexity in pipe itself.

The helper function factory for unpacking dataframes seems like a decent
idea, thought I'm not sure it's necessary to put it in pandas.

Indeed, we may want to simply encourage authors that would use
pipe_func (e.g., Seaborn) to instead consider providing a separate
namespace with "pipeable" functions, e.g., seaborn.pipeable.violinplot,
perhaps imported by convention with import seaborn.pipeable as snsp.


Reply to this email directly or view it on GitHub
#10129 (comment).

@tacaswell
Copy link
Contributor

The helper function factory for unpacking dataframes seems like a decent idea, thought I'm not sure it's necessary to put it in pandas.

Where else would it make sense for it to live?

@mwaskom
Copy link
Contributor

mwaskom commented Jun 1, 2015

Indeed, we may want to simply encourage authors that would use pipe_func (e.g., Seaborn) to instead consider providing a separate namespace with "pipeable" functions, e.g., seaborn.pipeable.violinplot, perhaps imported by convention with import seaborn.pipeable as snsp.

This strikes me as a recipe for confusing users. How many people will first encounter a particular function in a "piped" context, then import the wrong thing and not understand why other examples don't work? The duplication of functions only makes sense if you have a pretty thorough understanding of pandas implementation details.

What if instead it were possible to do

DataFrame(...).pipe("data", sns.violinplot, "day", "tip")

Where if the first argument is a string and the second is a callable, the first is interpreted as the keyword arg to pipe the dataframe into. I'm not sure if this would require manual inspection of an *args list or if it could be handled in a nice way with multiple dispatch based on argument type.

@mwaskom
Copy link
Contributor

mwaskom commented Jun 1, 2015

Could also maybe be more straightforward to do

DataFrame(...).pipe((sns.violinplot, "data"), "day, "tip")

then the code on the pandas side is simpler, and it's clear that the "data arg" is more tightly associated with what to call than what to plot (if that makes sense).

@shoyer
Copy link
Member

shoyer commented Jun 1, 2015

@mrocklin I like that. Instead of __pipe_func__ we would have __pipe_dataframe__. If anyone else wants their own pipe override, it's obvious they should use their own name for the special attribute (e.g., __pipe_formula__ or whatever).

Here's what @mwaskom's proposals look like (in code). Both of these seem pretty reasonable to me:

def pipe1(self, target, *args, **kwargs):
    # DataFrame(...).pipe("data", sns.violinplot, "day", "tip")
    if callable(target):
        return target(self, *args, **kwargs)
    else:
        kwargs[target] = self
        func = args[0]
        args = args[1:]
        return func(*args, **kwargs)

def pipe2(self, target, *args, **kwargs):
    # DataFrame(...).pipe((sns.violinplot, "data"), "day, "tip")
    if isinstance(target, tuple):
        func, data_arg = target
        kwargs[data_arg] = self
        return func(*args, **kwargs)
    else:
        return target(self, *args, **kwargs)

@TomAugspurger
Copy link
Contributor Author

Summary

There wasn't too much feedback from either python-ideas or r/python. The most common response was more or less why are you using dunder methods? followed by the __pipe_func__ not really being pythonic / necessary.

We're also attached to keeping .pipe itself clean and simple. This is our concern with the more complex implementations, e.g. here.
The end goal is to have other libraries adopt this where appropriate, and complexity hinders that.

There's some worry that the data being piped may be too broadly defined to be really useful as a protocol. Perhaps we should make it DataFrame specific. (@mrocklin was this your intended meaning?)

Tasks

These are all intertwined

  • Decide whether we really are implementing and promoting an ad-hoc protocol, or just implementing a useful method on DataFrame
  • Decide whether the benefits of __pipe_arg__ outweigh its un-pythonicness
  • If we drop __pipe_arg__ do we have users write lambdas to shuffle args around, or accept tuples of (function, target)

@shoyer
Copy link
Member

shoyer commented Jun 3, 2015

As @TomAugspurger describes in the PR (#10253), in the dev meeting we settled on @mwaskom's tuple proposal:

def pipe(self, target, *args, **kwargs):
    # DataFrame(...).pipe((sns.violinplot, "data"), "day, "tip")
    if isinstance(target, tuple):
        func, data_arg = target
        kwargs[data_arg] = self
        return func(*args, **kwargs)
    else:
        return target(self, *args, **kwargs)

There's no magic, and it leaves the door open to future extensions. Please speak up if you have any further concerns.

@tacaswell
Copy link
Contributor

I think this is a missed opportunity to make it easy to export data in a DataFrame to non-pandas aware libraries.

@TomAugspurger
Copy link
Contributor Author

@tacaswell I also think that the simplicity of the implementation in #10253 is crucial.

And the boat hasn't entirely sailed on something like what you proposed with a _pipe_target specifying how self should be unpacked. I'd like to see what people do with the simpler implementation first.

Thanks everyone for the feedback and ideas (and civil discourse! <3 PyData)

@tacaswell
Copy link
Contributor

@TomAugspurger Fair enough, I do see the case for keeping it as simple as possible.

As linked above, I have versions of the more verbose code in a PR against mpl which can also serve as a Guinea pig (and any feed back from the pandas folks on what I did wrong would be helpful).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.