New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "tee-like" function for easy introspection of intermediate steps in method chaining #25072
Comments
Not having used the Unix command before are there practical applications to this outside of printing? |
I'm +0.5 on this. I understand the appeal, but cc @jreback |
This does look useful, but the thing is i think you would need some sort of option to turn this on / off. maybe better to just implement this as a context manager around a block of code |
I'm not sure I follow. Doesn't choosing to pipe it through |
once you write a tee that is turning it on |
It was not intended for production code, but as a debugging tool. And when it comes to debugging code, a simple
I do see the value in being able to turn it on or off, but writing it as a context manager requires reformatting of the code, rather than just adding a single line, so I personally prefer the simpler approach of commenting out the I suspect that Pandas is used for a lot of exploratory data analysis, and for that usecase I think flexibility and ease-of-debugging trumps performance and scalability. Also for the sake of discussion, do you agree that the pipelines could use some better debugging tools? And just a new thought, could pandas have a def pandas_breakpoint(df):
breakpoint()
return df That would make it easier to step thorough parts of a pipeline, without dealing with the guts of Pandas. |
Any additional thoughts on this (I closed the issue accidentally at some point)? It is something that I would like to make a pull-request for, and since it is new functionality, I suspect that it wouldn't break other code which depends on it — but I am also aware that it adds features to Pandas, which will increase the burden of maintaining the project, and as such, the maintainers might chose to reject the proposals. @jreback |
It's basically unmaintained by me right now, but I wrote a package engarde that tried to explore this area. I think it's an important issue with method chaining. FWIW, I didn't find that I needed a new top-level method on the dataframe; |
I was searching for a Suppose I want to write a pipe that saves several slices of a dataframe, like so: df = ...
df_a = df[["a"]]
df_a.to_parquet(...)
df_bsum = df.groupby("b").sum()
print(df_bsum) # inspects intermediate value
df_bsum.to_parquet(...) This could be written into a pipeline: (df.tee(lambda df: df[["a"]].to_parquet(...))
.tee(lambda df: (
df.groupby("b").sum()
.tee(print) # inspects intermediate value
.to_parquet(...))) Right now, we can emulate this with def tee(fn):
def g(df, *args, **kwargs):
fn(df, *args, **kwargs)
return df
return g |
Enhanchment proposal: Add method to ease introspection of intermediate variables in chained methods
Method chaining allows us to compose easily readable "stories" of data processing steps, without having to come up with names for what is often temporary variables, but it comes at the expense of introspection of these intermediate variables... much like the Unix
tee
-command, which prints what's piped into it, while also piping it's input to the next function.The simplest way to achieve this is AFAIK to do:
The line with the
tee
-function could simply be commented out using a single character, instead of having to comment out themethod_2()
-line and move the closing parenthesis (and likely also commenting out amethod_3()
,method_4()
, ...,method_n()
).But being able to omit the definition of
tee
and replace.pipe(tee)
with.tee()
would be preferable, and something which I think will help users adopt method chaining.tee
could be more advanced, perhaps something likeI suspect that the method-chaining approach is well suited for parallelization, whether being done using Dask, a future version of Pandas or an interface to the Arrow-project.
I also suspect that asking for intermediate values in a distributed system might slow it down considerably, and that ordering of the data could be hard to maintain, so compatibility with a future distributed system might be something to consider if a
tee
-like function is implemented as part of Pandas.The text was updated successfully, but these errors were encountered: