Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REFACTOR-#6939: Make 'modin.pandas.DataFrame._to_pandas' a public method #6940

Merged
merged 5 commits into from
Feb 16, 2024

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Feb 16, 2024

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Make modin.pandas.DataFrame._to_pandas a public method #6939
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

… a public method

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@dchigarev dchigarev marked this pull request as ready for review February 16, 2024 15:45
anmyachev
anmyachev previously approved these changes Feb 16, 2024
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

modin/utils.py Outdated
if (
isinstance(getattr(result, "name", None), str)
isinstance(obj, BaseQueryCompiler)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this additional condition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this logic was only applied to objects having public .to_pandas() method (query compilers, low-level modin frames), so I thought that putting this condition is necessary. But on the second though, it seems that this logic works just fine for high-level dataframes as well, so removed the condition in the last commit

YarShev
YarShev previously approved these changes Feb 16, 2024
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@dchigarev dchigarev dismissed stale reviews from YarShev and anmyachev via f36c49e February 16, 2024 17:00
@YarShev YarShev merged commit d020aac into modin-project:master Feb 16, 2024
37 checks passed
@anmyachev
Copy link
Collaborator

@dchigarev @YarShev it seems to me that we forgot that there is a special namespace for functionality that is not in Pandas - DataFrame.modin. I am in favor of moving to_pandas to this namespace.

@YarShev
Copy link
Collaborator

YarShev commented Mar 5, 2024

@anmyachev, that was done intentionally. We wanted to simplify conversion of a Modin DataFrame to the other data format by adding methods to the DataFrame object itself. We considered that would not break the statement: Modin is a drop-in replacement for pandas, but at the same time Modin may have additional API that doesn't exist in pandas. DataFrame.modin in turn is for experimental functionality.

cc @dchigarev

@anmyachev
Copy link
Collaborator

anmyachev commented Mar 5, 2024

DataFrame.modin in turn is for experimental functionality.

@YarShev by expanding the scope of modin namespace (including also methods to_pandas and to_ray_dataset, that is, not only for experimental functions), we will make it easier for the user to identify methods in the code that do not exist in pandas. Using methods this way: dataframe.modin.to_pandas is still easy.

@YarShev
Copy link
Collaborator

YarShev commented Mar 5, 2024

We should probably have more people to achieve an agreement on this matter before releasing 0.28.0.

cc @sfc-gh-dpetersohn, @mvashishtha

@dchigarev
Copy link
Collaborator Author

dchigarev commented Mar 5, 2024

I don't mind copying the .to_pandas() method into the df.modin accessor, but I'd like to keep the IO methods in the DataFrame namespace as well:

# I want both of these calls to work:
df.to_pandas()
df.modin.to_pandas()

Why:
Converting to objects of other types via df.to_something() is a familiar way for all users, in almost any dataframe library a user can call df.to_pandas() without looking at the documentation and get what they expect (pyarrow, polars, vaex, ray dataset, cudf, etc). By hiding .to_pandas() in the .modin accessor, we create unnecessary complexity and force the users to go into our not-so-convenient documentation to find the answer to how to convert modin_df to pandas_df (most likely they'll google the question and end-up using the private ._to_pandas() method combining disadvantages of both approaches: they won't use .modin accessor & we forced them to google for a solution)

Why I'm not convinced with:

we will make it easier for the user to identify methods in the code that do not exist in pandas

In a scenario where users may need to_pandas()/to_ray() methods, the user is already using a mix of libraries in their workload and most likely keeps track of which object they use where, so they don't care so much about this df.modin accessor.

Once again, I'm not against hiding the rest of the pandas API extensions in the .modin accessor, because they can really create confusion about which API is being used where when changing modin imports to pandas and vice versa. But for IO methods, I don't see how this would create a problem.

@anmyachev
Copy link
Collaborator

force the users to go into our not-so-convenient documentation to find the answer to how to convert modin_df to pandas_df (most likely they'll google the question and end-up using the private ._to_pandas() method combining disadvantages of both approaches: they won't use .modin accessor & we forced them to google for a solution)

We can add a link to an additional interface almost at the very beginning of the documentation; it is difficult to imagine a situation where even the readme is not read when working with the new library.

What is Modin?
Modin is a drop-in replacement for [pandas](https://github.com/pandas-dev/pandas) [and extended API for for convenience - URL].

We can also add a new method reference in to _to_pandas method.

@sfc-gh-dpetersohn
Copy link
Contributor

Originally, we wanted to be able to go from and to pandas with the import change. If someone writes a workflow in Modin, it can also work in pandas by design.

If we support APIs like this at the top level, it is not easy for users to find these types of APIs and change them (especially if we add more). Putting this under the df.modin.to_pandas namespace should help, since users will know where to make the changes.

@YarShev
Copy link
Collaborator

YarShev commented Mar 5, 2024

I suppose we will proceed with df.modin.to_pandas. @sfc-gh-dpetersohn, thanks for commenting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make modin.pandas.DataFrame._to_pandas a public method
4 participants