Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DataFrame.to_expr() method #41837

Open
kerrickstaley opened this issue Jun 6, 2021 · 4 comments
Open

ENH: DataFrame.to_expr() method #41837

kerrickstaley opened this issue Jun 6, 2021 · 4 comments
Labels
DataFrame DataFrame data structure Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action

Comments

@kerrickstaley
Copy link

Is your feature request related to a problem?

I would like to write some unit tests for my Pandas code. I want to test that some DataFrame is equal to an expected value. The expected value is complicated and I would like an easy way to get the Python code to construct it. My program already computes the expected DataFrame value but I need a way to serialize/deserialize it for use in my test code.

Here is a StackOverflow question with more detail.

Describe the solution you'd like

I would like there to be a DataFrame.to_expr() method. It should return a str containing valid Python code that can be used to re-construct the DataFrame.

To the greatest extent possible, it should be true that pd.testing.assert_frame_equal(df1, eval(df2.to_expr())) throws an AssertionError if and only if pd.testing.assert_frame_equal(df1, df2) throws an AssertionError. I am using assert_frame_equal because it checks column dtypes, whereas DataFrame.equals() does not.

Concretely, I think the return value of .to_expr() should be something like

pandas.DataFrame({'column_1': pandas.Series([1, 2, 3], dtype='int64'), 'column_2': pandas.Series([1.0, 2.0, 3.0], dtype='float64')})

Note that on many Python objects, this .to_expr() method is called __repr__(). The Python docs state:

For many types, [__repr__] makes an attempt to return a string that would yield an object with the same value when passed to eval()...

However, DataFrame.__repr__ is already defined to print a different representation (which is arguably more useful in an interactive environment).

API breaking implications

This is a backwards-compatible change.

Describe alternatives you've considered

I've used DataFrame.to_dict() and DataFrame.from_dict() for this purpose in the past. However, this doesn't preserve the type, and so it doesn't work if you're working with an empty DataFrame. I also worry that from_dict will sometimes fail to infer the original type even for non-empty DataFrames.

@kerrickstaley kerrickstaley added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
@mzeitlin11
Copy link
Member

Thanks for the request @kerrickstaley! For an existing round-tripping option, does to_pickle work for your purposes? (or functions like to_csv if you want something more readable)

@kerrickstaley
Copy link
Author

to_pickle works, but its downside is that it doesn't produce a human-readable test.
to_csv mostly works, but it doesn't work in the case where you're testing an empty dataframe, and maybe there are other cases where round-tripping through CSV doesn't preserve column types; I'm not sure.

So I think there is still a use-case for a to_expr method.

@mzeitlin11
Copy link
Member

Thanks for explaining your reasoning @kerrickstaley. I personally don't find this use case compelling enough to add a new to_* method (the API is huge already :) - sounds better suited split into its own package where a bunch of different formats could be supported. But curious if others have thoughts / are interested in this feature.

@mzeitlin11 mzeitlin11 added DataFrame DataFrame data structure IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2021
@timlod
Copy link
Contributor

timlod commented Jul 21, 2021

+1

I would find this very useful as well, for generating small test cases!
Our test cases usually use very small representable dataframes that we write manually. Right now I mainly print df.to_numpy(), and copy this into the pd.DataFrame constructor including the column names, but the process is a little tedious. to_expr() would greatly speed up this process - we can now craft the test cases interactively, and persist the input/expected dataframes directly into the test suite.

If anyone else has a better solution, I'd also be interested to hear it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants