Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS-#6871: Update Modin on Ray cluster tutorial #6872

Merged
merged 12 commits into from
Feb 19, 2024
136 changes: 8 additions & 128 deletions docs/getting_started/using_modin/using_modin_cluster.rst
Original file line number Diff line number Diff line change
@@ -1,132 +1,12 @@
========================
Using Modin in a Cluster
========================

.. note::
| *Estimated Reading Time: 15 minutes*
| You can follow along in a Jupyter notebook in this two-part tutorial: `Part 1`_, `Part 2`_.

Often in practice we have a need to exceed the capabilities of a single machine. Modin
works and performs well in both local mode and in a cluster environment. The key
advantage of Modin is that your notebook does not change between local development and
cluster execution. Users are not required to think about how many workers exist or how
to distribute and partition their data; Modin handles all of this seamlessly and
transparently.

Starting up a Ray Cluster
-------------------------
Modin is able to utilize Ray's built-in autoscaled cluster. To launch a Ray cluster
using Amazon Web Service (AWS), you can use `Modin's cluster setup config`_
(`Ray's autoscaler options`_).

.. code-block:: bash

pip install boto3
aws configure

To start up the Ray cluster, run the following command in your terminal:

.. code-block:: bash

ray up modin-cluster.yaml

This configuration script starts 1 head node (m5.24xlarge) and 7 workers (m5.24xlarge),
768 total CPUs. For more information on how to launch a Ray cluster across different
cloud providers or on-premise, you can also refer to the `Ray's cluster docs`_.

.. note::
By default, Modin on Ray uses 60% of the system memory. It is recommended to use the same
amount, when using your own cluster (for each node).

Connecting to a Ray Cluster
---------------------------

To connect to the Ray cluster, run the following command in your terminal:

.. code-block:: bash

ray attach modin-cluster.yaml

The following code checks that the Ray cluster is properly configured and attached to
Modin:

.. code-block:: python

import ray
ray.init(address="auto")
from modin.config import NPartitions
assert NPartitions.get() == 768, "Not all Ray nodes are started up yet"
ray.shutdown()

Congratualions! You have successfully connected to the Ray cluster.

.. note::
Be careful when using the Ray client to connect to a remote cluster.
This connection mode may not work. Known bugs:
- https://github.com/ray-project/ray/issues/38713,
- https://github.com/modin-project/modin/issues/6641.

Using Modin on a Ray Cluster
----------------------------

Now that we have a Ray cluster up and running, we can use Modin to perform pandas
operation as if we were working with pandas on a single machine. We test Modin's
performance on the 200MB `NYC Taxi dataset`_ that was provided as part of our
`Modin's cluster setup config`_. We can time the following operation in a Jupyter
notebook:

.. code-block:: python

%%time
df = pd.read_csv("big_yellow.csv", quoting=3)

%%time
count_result = df.count()

%%time
groupby_result = df.groupby("passenger_count").count()

%%time
apply_result = df.map(str)

.. note::
When using local paths, make sure that they are available on all nodes in the
cluster, for example using distributed file system like NFS.

Modin performance scales as the number of nodes and cores increases. The following
chart shows the performance of the above operations with 2, 4, and 8 nodes, with
improvements in performance as we increase the number of resources Modin can use.

.. image:: ../../../examples/tutorial/jupyter/img/modin_cluster_perf.png
:alt: Cluster Performance
:align: center
:scale: 90%

Advanced: Configuring your Ray Environment
------------------------------------------

In some cases, it may be useful to customize your Ray environment. Below, we have listed
a few ways you can solve common problems in data management with Modin by customizing
your Ray environment. It is possible to use any of Ray's initialization parameters,
which are all found in `Ray's API docs`_.

.. code-block:: python

import ray
ray.init()
import modin.pandas as pd

Modin will automatically connect to the Ray instance that is already running. This way,
you can customize your Ray environment for use in Modin!

In this section, we show how Modin can be used to accelerate your pandas workflows in a cluster.
Each Modin distributed engine has its own specifics regarding running and using a cluster so
you can choose one of the following instructions to suit the engine you are using.

.. _`DataFrame`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
.. _`pandas`: https://pandas.pydata.org/pandas-docs/stable/
.. _`open an issue`: https://github.com/modin-project/modin/issues
.. _`Ray's API docs`: https://ray.readthedocs.io/en/latest/api.html
.. _`Part 1`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.ipynb
.. _`Part 2`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.ipynb
.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
.. _`Modin's cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
.. toctree::
:maxdepth: 4

using_modin_cluster/using_modin_ray_cluster

Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
Using Modin on Ray in a Cluster
===============================

.. note::
| *Estimated Reading Time: 15 minutes*

Often in practice we have a need to exceed the capabilities of a single machine.
Modin works and performs well in both local mode and in a cluster environment.
The key advantage of Modin is that your python code does not change between
local development and cluster execution. Users are not required to think about
how many workers exist or how to distribute and partition their data;
Modin handles all of this seamlessly and transparently.

.. note::
You can also use a Jupyter notebook, but you need to deploy a Jupyter server
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
on the remote cluster head node and connect to it.

.. image:: ../../../img/modin_cluster.png
YarShev marked this conversation as resolved.
Show resolved Hide resolved
:alt: Modin cluster
:align: center

Extra requirements for AWS authentication
-----------------------------------------

First of all, install the necessary dependencies in your environment:

.. code-block:: bash

pip install boto3

The next step is to setup your AWS credentials. One can set ``AWS_ACCESS_KEY_ID``,
``AWS_SECRET_ACCESS_KEY`` and ``AWS_SESSION_TOKEN`` environment variables or
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
just run the following command:

.. code-block:: bash

aws configure

Starting and connecting to the cluster
--------------------------------------

This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge), 576 total CPUs.
You can check the `Amazon EC2 pricing`_ .
YarShev marked this conversation as resolved.
Show resolved Hide resolved
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved

You can manually create AWS EC2 instances and configure them or just use the `Ray CLI`_ to
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
create and initialize a Ray cluster on AWS using `Modin's Ray cluster setup config`_ .
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
You can read more about how to modify the file on `Ray's autoscaler options`_ .
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved

More details on how to launch a Ray cluster can be found on `Ray's cluster docs`_.

To start up the Ray cluster, run the following command in your terminal:

.. code-block:: bash

ray up modin-cluster.yaml

Once the head node has completed initialization, you can optionally connect to it by running the following command.

.. code-block:: bash

ray attach modin-cluster.yaml

To exit the ssh session and return back into your local shell session, type:

.. code-block:: bash

exit

Executing in a cluster environment
----------------------------------

.. note::
Be careful when using the `Ray client`_ to connect to a remote cluster.
We don't recommend this connection mode, beacuse it may not work. Known bugs:
- https://github.com/ray-project/ray/issues/38713,
- https://github.com/modin-project/modin/issues/6641.

Modin lets you instantly speed up your workflows with a large data by scaling pandas
on a cluster. In this tutorial, we will use a 12.5 GB `big_yellow.csv` file that was
created by concatenating a 200MB `NYC Taxi dataset`_ file 64 times. Preparing this
file was provided as part of our `Modin's Ray cluster setup config`_.

If you want to use the other dataset, you should provide it to each of
the cluster nodes with the same path. We recomnend doing this by customizing the
`setup_commands` section of the [configuration file](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml).
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved

To run any script in a remote cluster, you need to submit it to the ray. In this way,
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
the script file is sent to the the remote cluster head node and executed there.

In this tutorial, we provide the `exercise_5.py`_ script, which reads the data from the
CSV file and executes such pandas operations as count, groupby and applymap.
As a result of the script, you will see the size of the file being read and the execution
time of each function.

.. note::
Some Dataframe functions are executed asynchronously, so to correctly measure execution time
we need to wait for the execution result. We use the special `execute` function for this,
but you should not use this function as it will slow down your script.
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved

You can submit this script to the existing remote cluster by running the following command.

.. code-block:: bash

ray submit modin-cluster.yaml exercise_5.py

To download or upload files to the cluster head node, use `ray rsync_down` or `ray rsync_up`.
It may help you if you want to use some other Python modules that should be available to
execute your own script or download a result file after executing the script.

.. code-block:: bash

# download a file from the cluster to the local computer:
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
ray rsync_down modin-cluster.yaml '/path/on/cluster' '/local/path'
# upload a file from the local computer to the cluster:
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
ray rsync_up modin-cluster.yaml '/local/path' '/path/on/cluster'

Modin performance scales as the number of nodes and cores increases. The following
chart shows the performance of the read_csv operation with different number of nodes,
Retribution98 marked this conversation as resolved.
Show resolved Hide resolved
with improvements in performance as we increase the number of resources Modin can use.

.. image:: ../../../../examples/tutorial/jupyter/img/modin_cluster_perf.png
:alt: Cluster Performance
:align: center

.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
.. _`Modin's Ray cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
.. _`Amazon EC2 pricing`: https://aws.amazon.com/ec2/pricing/on-demand/
.. _`exercise_5.py`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py
.. _`Ray client`: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
.. _`Ray CLI`: https://docs.ray.io/en/latest/cluster/vms/getting-started.html#running-applications-on-a-ray-cluster