Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS-#6871: Update Modin on Ray cluster tutorial #6872

Merged
merged 12 commits into from Feb 19, 2024

Conversation

Retribution98
Copy link
Collaborator

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Update Modin on Ray cluster tutorial #6871
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date


**This exercise instructs the user on how to start a 700+ core cluster, and it is not shut down until the end of Exercise 6. Read instructions carefully.**

Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently.
Often in practice we have a need to exceed the capabilities of a single machine.
Modin works and performs well in both local mode and in a cluster environment.
The key advantage of Modin is that your notebook does not change between
local development and cluster execution. Users are not required to think about
how many workers exist or how to distribute and partition their data;
Modin handles all of this seamlessly and transparently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key advantage of Modin is that your notebook does not change between
local development and cluster execution.

To live up to this statement, it would be better to use a notebook for the tutorial rather than a python script. @Retribution98 could you describe in more detail what the problem was when using the notebook to launch?

@RehanSD maybe you can give us some advice, since you are one of the last authors of the notebook.

@Retribution98 what effort will it require from you to run benchmarking on a 120 GB dataset as it was originally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmyachev Unfortunately, we cannot use Modin on Ray cluster from Jupyter Notebook without using Ray Client. I also thought about running a Jupyter server on the cluster head node, but I think this is not the best idea. Do you agree with me?
You can suggest using large files, but this will take a lot of time and, I think, it will not be convenient for the user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we cannot use Modin on Ray cluster from Jupyter Notebook without using Ray Client.

ok

I also thought about running a Jupyter server on the cluster head node, but I think this is not the best idea.

@Retribution98 why?

You can suggest using large files, but this will take a lot of time and, I think, it will not be convenient for the user.

We can add a note that this example takes a long time to complete, and indicate a place where the amount of data can be reduced (preferably a one-line change) by user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmyachev Deploying a Jupyter notebook is quite a complex task and requires a separate tutorial. It can be added later, but for now I just added a note about this feature.

@@ -0,0 +1,41 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should rename this file to cluster_exercise.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the exercise number can help users understand in what order the exercises should be done because some of the tips from the tutorial for a single node can be useful for a cluster as well.

docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
.. _`Modin's cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
Using Modin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make the following structure to avoid duplication in the folder and file names?

./using_modin
    index.rst
    ./local
        index.rst
        modin_on_*.rst
    ./cluster
        index.rst
        modin_on_*.rst

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but I'm not sure we should because the URL paths will be changed and some references to Modin docs may be broken.

docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved
@@ -0,0 +1,124 @@
![LOGO](../../../img/MODIN_ver2_hrz.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid text duplication we have here and in rst file? Can we just refer to rst doc here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid not. Before this PR, we had a Jupyter notebook that had the same content as the RST documentation. We can leave only the RST file, but then we should delete the tutorial folder altogether.

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
Retribution98 and others added 2 commits February 14, 2024 17:19
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved
docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved
docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved

%%time
apply_result = df.map(str)
In this tutorial, we provide the `exercise_5.py`_ script, which reads the data from the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to exercise_5.py doesn't work in the docs. Recheck.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because this PR hasn't yet been merged in the master, but there is the link to the actual version in the master.

docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved
docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved
docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM! @anmyachev, any other comments?

docs/getting_started/using_modin/using_modin_cluster.rst Outdated Show resolved Hide resolved
%%time
groupby_result = df.groupby("passenger_count").count()
.. note::
Some Dataframe functions are executed asynchronously, so to correctly measure execution time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant anymore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this note has removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the note.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the note.

Please check it again, I repushed my changes.

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
YarShev
YarShev previously approved these changes Feb 19, 2024
anmyachev
anmyachev previously approved these changes Feb 19, 2024
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@YarShev YarShev merged commit 3615811 into modin-project:master Feb 19, 2024
10 checks passed
@Retribution98 Retribution98 deleted the jupyter_ray_cluster branch February 19, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Modin on Ray cluster tutorial
3 participants