Add "tee-like" function for easy introspection of intermediate steps in method chaining #25072

AllanLRH · 2019-02-01T12:06:04Z

Enhanchment proposal: Add method to ease introspection of intermediate variables in chained methods

Method chaining allows us to compose easily readable "stories" of data processing steps, without having to come up with names for what is often temporary variables, but it comes at the expense of introspection of these intermediate variables... much like the Unix tee-command, which prints what's piped into it, while also piping it's input to the next function.

The simplest way to achieve this is AFAIK to do:

def tee(df):
    print(df)
    return df

# Apply method_1 to df, print the result, and apply method_2 to the output of df.method_1().
result = (df.method_1()
            .pipe(tee)
            .method_2()
         )

The line with the tee-function could simply be commented out using a single character, instead of having to comment out the method_2()-line and move the closing parenthesis (and likely also commenting out a method_3(), method_4(), ..., method_n() ).

But being able to omit the definition of tee and replace .pipe(tee) with .tee() would be preferable, and something which I think will help users adopt method chaining.

tee could be more advanced, perhaps something like

def tee(df, head=0, tail=0, sample=0, title=None, display_function=print):
    """
    Displays subset of df, and return the df-input unmodified.
    Defaut subsample is head=10.
    title-keyword sets a string to be displayed before the df
    is displayed.
    display_function is the function to use for displaying df,
    and defaults to the print function
    """
    concat_list = list()
    if head == tail == sample == 0:
        head = 10
    if head:
        concat_list.append(df.head(head))
    if sample:
        concat_list.append(df.sample(sample))
    if tail:
        concat_list.append(df.tail(tail))
    to_display = pd.concat(concat_list)
    if title:
        display_function(title)
    display_function(to_display)
    return df

I suspect that the method-chaining approach is well suited for parallelization, whether being done using Dask, a future version of Pandas or an interface to the Arrow-project.
I also suspect that asking for intermediate values in a distributed system might slow it down considerably, and that ordering of the data could be hard to maintain, so compatibility with a future distributed system might be something to consider if a tee-like function is implemented as part of Pandas.

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-03T01:27:47Z

Not having used the Unix command before are there practical applications to this outside of printing?

gfyoung · 2019-02-07T01:51:37Z

I'm +0.5 on this. I understand the appeal, but print statements can indeed hamper performance.

cc @jreback

jreback · 2019-02-07T02:54:44Z

This does look useful, but the thing is i think you would need some sort of option to turn this on / off.

maybe better to just implement this as a context manager around a block of code

gfyoung · 2019-02-07T05:15:23Z

This does look useful, but the thing is i think you would need some sort of option to turn this on / off.

I'm not sure I follow. Doesn't choosing to pipe it through tee or not count of "turning on / off"?

jreback · 2019-02-07T11:56:39Z

once you write a tee that is turning it on
but if i take the code and go to production i would want to control this, similar to how logging works

AllanLRH · 2019-02-10T16:05:06Z

I'm +0.5 on this. I understand the appeal, but print statements can indeed hamper performance.

It was not intended for production code, but as a debugging tool. And when it comes to debugging code, a simple print can sometimes be just what you need, since it doesn't require you to set up a logging system to do a quick bit of debugging.

This does look useful, but the thing is i think you would need some sort of option to turn this on / off.

maybe better to just implement this as a context manager around a block of code

I do see the value in being able to turn it on or off, but writing it as a context manager requires reformatting of the code, rather than just adding a single line, so I personally prefer the simpler approach of commenting out the tee-line.
On the other hand, integrating with logging seems like a great idea, though I think that the option to use print should still exist, and be easy and fast to chose.

I suspect that Pandas is used for a lot of exploratory data analysis, and for that usecase I think flexibility and ease-of-debugging trumps performance and scalability.
There is also the possibility that the (dafault) behaviour is set using pd.set_option, which already controls some printing and layout options, but if a tee-like function should also be able to interface with the logging-module, that might change the behaviour of Pandas too drastically... what do you think?

Also for the sake of discussion, do you agree that the pipelines could use some better debugging tools?
It is not like you can't debug them, but it could be easier, and that might discourage users from adapting pipelines.

And just a new thought, could pandas have a breakpoint()-like function for making introspection easier with Python 3.7.2+ on a debugger (see PEP573)?
Something simple like

def pandas_breakpoint(df):
    breakpoint()
    return df

That would make it easier to step thorough parts of a pipeline, without dealing with the guts of Pandas.

AllanLRH · 2019-02-26T12:52:07Z

Any additional thoughts on this (I closed the issue accidentally at some point)?

It is something that I would like to make a pull-request for, and since it is new functionality, I suspect that it wouldn't break other code which depends on it — but I am also aware that it adds features to Pandas, which will increase the burden of maintaining the project, and as such, the maintainers might chose to reject the proposals.

@jreback
@wesm (hope it's all right to ping you here; see the bottom paragraph of the initial post)

TomAugspurger · 2019-02-26T13:00:36Z

It's basically unmaintained by me right now, but I wrote a package engarde that tried to explore this area. I think it's an important issue with method chaining.

FWIW, I didn't find that I needed a new top-level method on the dataframe; pipe was sufficient for inspecting DataFrame methods, and I preferred decorating my own functions that took and returned data frames (search for "costs" here).

ianliu · 2022-08-11T15:12:33Z

I was searching for a tee command in pandas and stumbled on this issue, and I think I can add more use cases than printing.

Suppose I want to write a pipe that saves several slices of a dataframe, like so:

df = ...

df_a = df[["a"]]
df_a.to_parquet(...)

df_bsum = df.groupby("b").sum()
print(df_bsum) # inspects intermediate value
df_bsum.to_parquet(...)

This could be written into a pipeline:

(df.tee(lambda df: df[["a"]].to_parquet(...))
   .tee(lambda df: (
            df.groupby("b").sum()
              .tee(print) # inspects intermediate value
              .to_parquet(...)))

Right now, we can emulate this with .pipe(tee(fn)), where tee is

def tee(fn):
    def g(df, *args, **kwargs):
        fn(df, *args, **kwargs)
        return df
    return g

gfyoung added Enhancement Needs Discussion Requires discussion from core team before further action labels Feb 7, 2019

AllanLRH closed this as completed Feb 11, 2019

AllanLRH reopened this Feb 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "tee-like" function for easy introspection of intermediate steps in method chaining #25072

Add "tee-like" function for easy introspection of intermediate steps in method chaining #25072

AllanLRH commented Feb 1, 2019 •

edited

WillAyd commented Feb 3, 2019

gfyoung commented Feb 7, 2019

jreback commented Feb 7, 2019

gfyoung commented Feb 7, 2019

jreback commented Feb 7, 2019 •

edited

AllanLRH commented Feb 10, 2019 •

edited

AllanLRH commented Feb 26, 2019

TomAugspurger commented Feb 26, 2019

ianliu commented Aug 11, 2022

Add "tee-like" function for easy introspection of intermediate steps in method chaining #25072

Add "tee-like" function for easy introspection of intermediate steps in method chaining #25072

Comments

AllanLRH commented Feb 1, 2019 • edited

Enhanchment proposal: Add method to ease introspection of intermediate variables in chained methods

WillAyd commented Feb 3, 2019

gfyoung commented Feb 7, 2019

jreback commented Feb 7, 2019

gfyoung commented Feb 7, 2019

jreback commented Feb 7, 2019 • edited

AllanLRH commented Feb 10, 2019 • edited

AllanLRH commented Feb 26, 2019

TomAugspurger commented Feb 26, 2019

ianliu commented Aug 11, 2022

AllanLRH commented Feb 1, 2019 •

edited

jreback commented Feb 7, 2019 •

edited

AllanLRH commented Feb 10, 2019 •

edited