Skip to content

Commit

Permalink
Added minor adjustments
Browse files Browse the repository at this point in the history
  • Loading branch information
Retribution98 committed Feb 13, 2024
1 parent 2911a37 commit e8bb146
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 18 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,18 @@ Using Modin on Ray in a Cluster
Often in practice we have a need to exceed the capabilities of a single machine.
Modin works and performs well in both local mode and in a cluster environment.
The key advantage of Modin is that your notebook does not change between
The key advantage of Modin is that your python code does not change between
local development and cluster execution. Users are not required to think about
how many workers exist or how to distribute and partition their data;
Modin handles all of this seamlessly and transparently.

.. note::
You can also use a Jupyter notebook, but you need to deploy a Jupyter server
on the remote cluster head node and connect to it.

.. image:: ../../../img/modin_cluster.png
:alt: Modin cluster
:align: center
:scale: 90%

Extra requirements for AWS authentication
-----------------------------------------
Expand All @@ -39,7 +42,7 @@ Starting and connecting to the cluster
This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge), 576 total CPUs.
You can check the `Amazon EC2 pricing`_ .

You can manually create AWS EC2 instances and configure them or just use the `Ray autoscaler` to
You can manually create AWS EC2 instances and configure them or just use the `Ray CLI`_ to
create and initialize a Ray cluster on AWS using `Modin's Ray cluster setup config`_ .
You can read more about how to modify the file on `Ray's autoscaler options`_ .

Expand Down Expand Up @@ -89,11 +92,16 @@ CSV file and executes such pandas operations as count, groupby and applymap.
As a result of the script, you will see the size of the file being read and the execution
time of each function.

.. note::
Some Dataframe functions are executed asynchronously, so to correctly measure execution time
we need to wait for the execution result. We use the special `execute` function for this,
but you should not use this function as it will slow down your script.

You can submit this script to the existing remote cluster by running the following command.

.. code-block:: bash
ray modin-cluster.yaml exercise_5.py
ray submit modin-cluster.yaml exercise_5.py
To download or upload files to the cluster head node, use `ray rsync_down` or `ray rsync_up`.
It may help you if you want to use some other Python modules that should be available to
Expand All @@ -113,12 +121,12 @@ with improvements in performance as we increase the number of resources Modin ca
.. image:: ../../../../examples/tutorial/jupyter/img/modin_cluster_perf.png
:alt: Cluster Performance
:align: center
:scale: 90%

.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
.. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
.. _`Modin's Ray cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
.. _`Amazon EC2 pricing`: https://aws.amazon.com/ec2/pricing/on-demand/
.. _`exercise_5.py`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py
.. _`Ray client`: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
.. _`Ray client`: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
.. _`Ray CLI`: https://docs.ray.io/en/latest/cluster/vms/getting-started.html#running-applications-on-a-ray-cluster
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,14 @@ and it is not shut down until the end of exercise. Read instructions carefully.*

Often in practice we have a need to exceed the capabilities of a single machine.
Modin works and performs well in both local mode and in a cluster environment.
The key advantage of Modin is that your notebook does not change between
The key advantage of Modin is that your python code not change between
local development and cluster execution. Users are not required to think about
how many workers exist or how to distribute and partition their data;
Modin handles all of this seamlessly and transparently.

**Note**: You can also use a Jupyter notebook, but you need to deploy a Jupyter server
on the remote cluster head node and connect to it.

![Cluster](../../../img/modin_cluster.png)

**Extra requirements for AWS authentication**
Expand All @@ -44,8 +47,9 @@ This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge),

Cost of this cluster can be found here: https://aws.amazon.com/ec2/pricing/on-demand/.

You can manually create AWS EC2 instances and configure them or just use the `Ray autoscaler` to create and initialize
a Ray cluster using the configuration file. This file is included in this directory and is called
You can manually create AWS EC2 instances and configure them or just use the
[`Ray CLI`](https://docs.ray.io/en/latest/cluster/vms/getting-started.html#running-applications-on-a-ray-cluster)
to create and initialize a Ray cluster using the configuration file. This file is included in this directory and is called
[`modin-cluster.yaml`](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml).
You can read more about how to modify `Ray cluster YAML Configuration file` here:
https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-yaml-configuration-options
Expand Down Expand Up @@ -91,6 +95,10 @@ In this exercise, we provide the `exercise_5.py` script, which read the data fro
some pandas Dataframe function such as count, groupby and applymap. As a result of the script, you will see
the size of the file being read and the execution time of each function.

**Note**: Some Dataframe functions are executed asynchronously, so to correctly measure execution time
we need to wait for the execution result. We use the special `execute` function for this,
but you should not use this function as it will slow down your script.

You can submit this script to the existing remote cluster by running the following command.

```bash
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@
import time
import ray
import modin.pandas as pd
from modin.utils import execute

ray.init(address="auto")
cpu_count = ray.cluster_resources()["CPU"]
assert cpu_count == 576, f"Expected 576 CPUs, but found {cpu_count}"

file_size = os.path.getsize("big_yellow.csv")
file_path = "big_yellow.csv"
file_size = os.path.getsize(file_path)


# get human readable file size
Expand All @@ -22,21 +24,28 @@ def sizeof_fmt(num, suffix="B"):
print(f"File size is {sizeof_fmt(file_size)}") # noqa: T201

t0 = time.perf_counter()
df = pd.read_csv("big_yellow.csv", quoting=3)
df = pd.read_csv(file_path, quoting=3)
t1 = time.perf_counter()
print(f"read_csv time is {(t1 - t0):.3f}") # noqa: T201

"""
IMPORTANT:
Some Dataframe functions are executed asynchronously, so to correctly measure execution time
we need to wait for the execution result. We use the special `execute` function for this,
but you should not use this function as it will slow down your script.
"""

t0 = time.perf_counter()
count_result = df.count()
count_result = execute(df.count())

Check warning

Code scanning / CodeQL

Use of the return value of a procedure Warning

The result of
execute
is used even though it is always None.
t1 = time.perf_counter()
print(f"count time is {(t1 - t0):.3f}") # noqa: T201

t0 = time.perf_counter()
groupby_result = df.groupby("passenger_count").count()
groupby_result = execute(df.groupby("passenger_count").count())

Check warning

Code scanning / CodeQL

Use of the return value of a procedure Warning

The result of
execute
is used even though it is always None.
t1 = time.perf_counter()
print(f"groupby time is {(t1 - t0):.3f}") # noqa: T201

t0 = time.perf_counter()
apply_result = df.applymap(str)
apply_result = execute(df.applymap(str))

Check warning

Code scanning / CodeQL

Use of the return value of a procedure Warning

The result of
execute
is used even though it is always None.
t1 = time.perf_counter()
print(f"applymap time is {(t1 - t0):.3f}") # noqa: T201
Original file line number Diff line number Diff line change
Expand Up @@ -166,10 +166,9 @@ setup_commands:
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
- pip install modin
- pip install ray[default]>=1.13.0,!=2.5.0
- pip install pyarrow>=7.0.0
- pip install -U fsspec>=2022.05.0
- conda create -n "modin" -c conda-forge modin "ray-default">=1.13.0,!=2.5.0 -y
- conda activate modin && pip install -U fsspec>=2022.11.0 boto3
- echo "conda activate modin" >> ~/.bashrc
- wget https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
- printf "VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee\n" > big_yellow.csv
- tail -n +2 yellow_tripdata_2015-01.csv{,}{,}{,}{,}{,}{,} >> big_yellow.csv
Expand Down

0 comments on commit e8bb146

Please sign in to comment.