Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS-#6871: Update Modin on Ray cluster tutorial #6872

Merged
merged 12 commits into from
Feb 19, 2024
168 changes: 86 additions & 82 deletions docs/getting_started/using_modin/using_modin_cluster.rst
Original file line number Diff line number Diff line change
@@ -1,132 +1,136 @@
========================
Using Modin in a Cluster
========================

.. note::
| *Estimated Reading Time: 15 minutes*
| You can follow along in a Jupyter notebook in this two-part tutorial: `Part 1`_, `Part 2`_.

Often in practice we have a need to exceed the capabilities of a single machine. Modin
works and performs well in both local mode and in a cluster environment. The key
advantage of Modin is that your notebook does not change between local development and
cluster execution. Users are not required to think about how many workers exist or how
to distribute and partition their data; Modin handles all of this seamlessly and
transparently.
Often in practice we have a need to exceed the capabilities of a single machine.
Modin works and performs well in both local mode and in a cluster environment.
The key advantage of Modin is that your python code does not change between
local development and cluster execution. Users are not required to think about
how many workers exist or how to distribute and partition their data;
Modin handles all of this seamlessly and transparently.

.. note::
It is possible to use a Jupyter notebook, but you will have to deploy a Jupyter server
on the remote cluster head node and connect to it.

Starting up a Ray Cluster
-------------------------
Modin is able to utilize Ray's built-in autoscaled cluster. To launch a Ray cluster
using Amazon Web Service (AWS), you can use `Modin's cluster setup config`_
(`Ray's autoscaler options`_).
.. image:: ../../img/modin_cluster.png
:alt: Modin cluster
:align: center

Extra requirements for AWS authentication
-----------------------------------------

First of all, install the necessary dependencies in your environment:

.. code-block:: bash

pip install boto3
aws configure

To start up the Ray cluster, run the following command in your terminal:
The next step is to setup your AWS credentials. One can set ``AWS_ACCESS_KEY_ID``,
``AWS_SECRET_ACCESS_KEY`` and ``AWS_SESSION_TOKEN`` (Optional)
(refer to `AWS CLI environment variables`_ to get more insight on this) or
just run the following command:

.. code-block:: bash

ray up modin-cluster.yaml
aws configure

This configuration script starts 1 head node (m5.24xlarge) and 7 workers (m5.24xlarge),
768 total CPUs. For more information on how to launch a Ray cluster across different
cloud providers or on-premise, you can also refer to the `Ray's cluster docs`_.
Starting and connecting to the cluster
--------------------------------------

.. note::
By default, Modin on Ray uses 60% of the system memory. It is recommended to use the same
amount, when using your own cluster (for each node).
This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge), 576 total CPUs.
You can check the `Amazon EC2 pricing`_ page.

It is possble to manually create AWS EC2 instances and configure them or just use the `Ray CLI`_ to
create and initialize a Ray cluster on AWS using `Modin's Ray cluster setup config`_,
which we are going to utilize in this example.
Refer to `Ray's autoscaler options`_ page on how to modify the file.

More details on how to launch a Ray cluster can be found on `Ray's cluster docs`_.

Connecting to a Ray Cluster
---------------------------
To start up the Ray cluster, run the following command in your terminal:

.. code-block:: bash

To connect to the Ray cluster, run the following command in your terminal:
ray up modin-cluster.yaml

Once the head node has completed initialization, you can optionally connect to it by running the following command.

.. code-block:: bash

ray attach modin-cluster.yaml

The following code checks that the Ray cluster is properly configured and attached to
Modin:
To exit the ssh session and return back into your local shell session, type:

.. code-block:: python
.. code-block:: bash

import ray
ray.init(address="auto")
from modin.config import NPartitions
assert NPartitions.get() == 768, "Not all Ray nodes are started up yet"
ray.shutdown()
exit

Congratualions! You have successfully connected to the Ray cluster.
Executing in a cluster environment
----------------------------------

.. note::
Be careful when using the Ray client to connect to a remote cluster.
This connection mode may not work. Known bugs:
Be careful when using the `Ray client`_ to connect to a remote cluster.
We don't recommend this connection mode, beacuse it may not work. Known bugs:
- https://github.com/ray-project/ray/issues/38713,
- https://github.com/modin-project/modin/issues/6641.

Using Modin on a Ray Cluster
----------------------------

Now that we have a Ray cluster up and running, we can use Modin to perform pandas
operation as if we were working with pandas on a single machine. We test Modin's
performance on the 200MB `NYC Taxi dataset`_ that was provided as part of our
`Modin's cluster setup config`_. We can time the following operation in a Jupyter
notebook:
Modin lets you instantly speed up your workflows with a large data by scaling pandas
on a cluster. In this tutorial, we will use a 12.5 GB ``big_yellow.csv`` file that was
created by concatenating a 200MB `NYC Taxi dataset`_ file 64 times. Preparing this
file was provided as part of our `Modin's Ray cluster setup config`_.

.. code-block:: python
If you want to use the other dataset, you should provide it to each of
the cluster nodes with the same path. We recomnend doing this by customizing the
``setup_commands`` section of the `Modin's Ray cluster setup config`_.

%%time
df = pd.read_csv("big_yellow.csv", quoting=3)
To run any script in a remote cluster, you need to submit it to the Ray. In this way,
the script file is sent to the the remote cluster head node and executed there.

%%time
count_result = df.count()
In this tutorial, we provide the `exercise_5.py`_ script, which reads the data from the
CSV file and executes such pandas operations as count, groupby and map.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to exercise_5.py doesn't work in the docs. Recheck.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because this PR hasn't yet been merged in the master, but there is the link to the actual version in the master.

As a result of the script, you will see the size of the file being read and the execution
time of each function.

Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
%%time
groupby_result = df.groupby("passenger_count").count()
.. note::
Some Dataframe functions are executed asynchronously, so to correctly measure execution time
of each function we need to wait for the execution result. We use the special ``execute`` function for this,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant anymore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this note has removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the note.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the note.

Please check it again, I repushed my changes.

but you shouldn't use it in a real case scenario.

%%time
apply_result = df.map(str)
You can submit this script to the existing remote cluster by running the following command.

.. note::
When using local paths, make sure that they are available on all nodes in the
cluster, for example using distributed file system like NFS.
.. code-block:: bash

Modin performance scales as the number of nodes and cores increases. The following
chart shows the performance of the above operations with 2, 4, and 8 nodes, with
improvements in performance as we increase the number of resources Modin can use.
ray submit modin-cluster.yaml exercise_5.py

YarShev marked this conversation as resolved.
Show resolved Hide resolved
.. image:: ../../../examples/tutorial/jupyter/img/modin_cluster_perf.png
:alt: Cluster Performance
:align: center
:scale: 90%
To download or upload files to the cluster head node, use ``ray rsync_down`` or ``ray rsync_up``.
It may help if you want to use some other Python modules that should be available to
execute your own script or download a result file after executing the script.

Advanced: Configuring your Ray Environment
------------------------------------------
.. code-block:: bash

In some cases, it may be useful to customize your Ray environment. Below, we have listed
a few ways you can solve common problems in data management with Modin by customizing
your Ray environment. It is possible to use any of Ray's initialization parameters,
which are all found in `Ray's API docs`_.
# download a file from the cluster to the local machine:
ray rsync_down modin-cluster.yaml '/path/on/cluster' '/local/path'
# upload a file from the local machine to the cluster:
ray rsync_up modin-cluster.yaml '/local/path' '/path/on/cluster'

.. code-block:: python
Shutting down the cluster
--------------------------

import ray
ray.init()
import modin.pandas as pd
Now that we have finished the computation, we need to shut down the cluster with `ray down` command.

Modin will automatically connect to the Ray instance that is already running. This way,
you can customize your Ray environment for use in Modin!
.. code-block:: bash

ray down modin-cluster.yaml

.. _`DataFrame`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
.. _`pandas`: https://pandas.pydata.org/pandas-docs/stable/
.. _`open an issue`: https://github.com/modin-project/modin/issues
.. _`Ray's API docs`: https://ray.readthedocs.io/en/latest/api.html
.. _`Part 1`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.ipynb
.. _`Part 2`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.ipynb
.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
.. _`Modin's cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
.. _`Modin's Ray cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
.. _`Amazon EC2 pricing`: https://aws.amazon.com/ec2/pricing/on-demand/
.. _`exercise_5.py`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py
.. _`Ray client`: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
.. _`Ray CLI`: https://docs.ray.io/en/latest/cluster/vms/getting-started.html#running-applications-on-a-ray-cluster
.. _`AWS CLI environment variables`: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
![LOGO](../../../img/MODIN_ver2_hrz.png)

<center>
<h1>Scale your pandas workflows on a Ray cluster</h2>
</center>

**NOTE**: Before starting the exercise, please read the full instructions in the
[Modin documenation](https://modin--6872.org.readthedocs.build/en/6872/getting_started/using_modin/using_modin_cluster.html).
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
YarShev marked this conversation as resolved.
Show resolved Hide resolved

The basic steps to run the script on a remote Ray cluster are:

Step 1. Install the necessary dependencies

```bash
pip install boto3
```

Step 2. Setup your AWS credentials.

```bash
aws configure
```

Step 3. Modify configuration file and start up the Ray cluster.

```bash
ray up modin-cluster.yaml
```

Step 4. Submit your script to the remote cluster.

```bash
ray submit modin-cluster.yaml exercise_5.py
```

Step 5. Shut down the Ray remote cluster.

```bash
ray down