Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "tee-like" function for easy introspection of intermediate steps in method chaining #25072

Open
AllanLRH opened this issue Feb 1, 2019 · 9 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@AllanLRH
Copy link

AllanLRH commented Feb 1, 2019

Enhanchment proposal: Add method to ease introspection of intermediate variables in chained methods

Method chaining allows us to compose easily readable "stories" of data processing steps, without having to come up with names for what is often temporary variables, but it comes at the expense of introspection of these intermediate variables... much like the Unix tee-command, which prints what's piped into it, while also piping it's input to the next function.

pandas_tee

The simplest way to achieve this is AFAIK to do:

def tee(df):
    print(df)
    return df

# Apply method_1 to df, print the result, and apply method_2 to the output of df.method_1().
result = (df.method_1()
            .pipe(tee)
            .method_2()
         )

The line with the tee-function could simply be commented out using a single character, instead of having to comment out the method_2()-line and move the closing parenthesis (and likely also commenting out a method_3(), method_4(), ..., method_n() ).

But being able to omit the definition of tee and replace .pipe(tee) with .tee() would be preferable, and something which I think will help users adopt method chaining.

tee could be more advanced, perhaps something like

def tee(df, head=0, tail=0, sample=0, title=None, display_function=print):
    """
    Displays subset of df, and return the df-input unmodified.
    Defaut subsample is head=10.
    title-keyword sets a string to be displayed before the df
    is displayed.
    display_function is the function to use for displaying df,
    and defaults to the print function
    """
    concat_list = list()
    if head == tail == sample == 0:
        head = 10
    if head:
        concat_list.append(df.head(head))
    if sample:
        concat_list.append(df.sample(sample))
    if tail:
        concat_list.append(df.tail(tail))
    to_display = pd.concat(concat_list)
    if title:
        display_function(title)
    display_function(to_display)
    return df

I suspect that the method-chaining approach is well suited for parallelization, whether being done using Dask, a future version of Pandas or an interface to the Arrow-project.
I also suspect that asking for intermediate values in a distributed system might slow it down considerably, and that ordering of the data could be hard to maintain, so compatibility with a future distributed system might be something to consider if a tee-like function is implemented as part of Pandas.

@WillAyd
Copy link
Member

WillAyd commented Feb 3, 2019

Not having used the Unix command before are there practical applications to this outside of printing?

@gfyoung gfyoung added Enhancement Needs Discussion Requires discussion from core team before further action labels Feb 7, 2019
@gfyoung
Copy link
Member

gfyoung commented Feb 7, 2019

I'm +0.5 on this. I understand the appeal, but print statements can indeed hamper performance.

cc @jreback

@jreback
Copy link
Contributor

jreback commented Feb 7, 2019

This does look useful, but the thing is i think you would need some sort of option to turn this on / off.

maybe better to just implement this as a context manager around a block of code

@gfyoung
Copy link
Member

gfyoung commented Feb 7, 2019

This does look useful, but the thing is i think you would need some sort of option to turn this on / off.

I'm not sure I follow. Doesn't choosing to pipe it through tee or not count of "turning on / off"?

@jreback
Copy link
Contributor

jreback commented Feb 7, 2019

once you write a tee that is turning it on
but if i take the code and go to production i would want to control this, similar to how logging works

@AllanLRH
Copy link
Author

AllanLRH commented Feb 10, 2019

I'm +0.5 on this. I understand the appeal, but print statements can indeed hamper performance.

It was not intended for production code, but as a debugging tool. And when it comes to debugging code, a simple print can sometimes be just what you need, since it doesn't require you to set up a logging system to do a quick bit of debugging.

This does look useful, but the thing is i think you would need some sort of option to turn this on / off.

maybe better to just implement this as a context manager around a block of code

I do see the value in being able to turn it on or off, but writing it as a context manager requires reformatting of the code, rather than just adding a single line, so I personally prefer the simpler approach of commenting out the tee-line.
On the other hand, integrating with logging seems like a great idea, though I think that the option to use print should still exist, and be easy and fast to chose.

I suspect that Pandas is used for a lot of exploratory data analysis, and for that usecase I think flexibility and ease-of-debugging trumps performance and scalability.
There is also the possibility that the (dafault) behaviour is set using pd.set_option, which already controls some printing and layout options, but if a tee-like function should also be able to interface with the logging-module, that might change the behaviour of Pandas too drastically... what do you think?

Also for the sake of discussion, do you agree that the pipelines could use some better debugging tools?
It is not like you can't debug them, but it could be easier, and that might discourage users from adapting pipelines.

And just a new thought, could pandas have a breakpoint()-like function for making introspection easier with Python 3.7.2+ on a debugger (see PEP573)?
Something simple like

def pandas_breakpoint(df):
    breakpoint()
    return df

That would make it easier to step thorough parts of a pipeline, without dealing with the guts of Pandas.

@AllanLRH AllanLRH reopened this Feb 12, 2019
@AllanLRH
Copy link
Author

Any additional thoughts on this (I closed the issue accidentally at some point)?

It is something that I would like to make a pull-request for, and since it is new functionality, I suspect that it wouldn't break other code which depends on it — but I am also aware that it adds features to Pandas, which will increase the burden of maintaining the project, and as such, the maintainers might chose to reject the proposals.

@jreback
@wesm (hope it's all right to ping you here; see the bottom paragraph of the initial post)

@TomAugspurger
Copy link
Contributor

It's basically unmaintained by me right now, but I wrote a package engarde that tried to explore this area. I think it's an important issue with method chaining.

FWIW, I didn't find that I needed a new top-level method on the dataframe; pipe was sufficient for inspecting DataFrame methods, and I preferred decorating my own functions that took and returned data frames (search for "costs" here).

@ianliu
Copy link

ianliu commented Aug 11, 2022

I was searching for a tee command in pandas and stumbled on this issue, and I think I can add more use cases than printing.

Suppose I want to write a pipe that saves several slices of a dataframe, like so:

df = ...

df_a = df[["a"]]
df_a.to_parquet(...)

df_bsum = df.groupby("b").sum()
print(df_bsum) # inspects intermediate value
df_bsum.to_parquet(...)

This could be written into a pipeline:

(df.tee(lambda df: df[["a"]].to_parquet(...))
   .tee(lambda df: (
            df.groupby("b").sum()
              .tee(print) # inspects intermediate value
              .to_parquet(...)))

Right now, we can emulate this with .pipe(tee(fn)), where tee is

def tee(fn):
    def g(df, *args, **kwargs):
        fn(df, *args, **kwargs)
        return df
    return g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants