Added minor adjustments

modin-project · Feb 13, 2024 · e8bb146 · e8bb146
1 parent 2911a37
commit e8bb146
Show file tree

Hide file tree

Showing 4 changed files with 42 additions and 18 deletions.
diff --git a/docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst b/docs/getting_started/using_modin/using_modin_cluster/using_modin_ray_cluster.rst
@@ -6,15 +6,18 @@ Using Modin on Ray in a Cluster
 
 Often in practice we have a need to exceed the capabilities of a single machine.
 Modin works and performs well in both local mode and in a cluster environment.
-The key advantage of Modin is that your notebook does not change between
+The key advantage of Modin is that your python code does not change between
 local development and cluster execution. Users are not required to think about
 how many workers exist or how to distribute and partition their data;
 Modin handles all of this seamlessly and transparently.
 
+.. note::
+   You can also use a Jupyter notebook, but you need to deploy a Jupyter server 
+   on the remote cluster head node and connect to it.
+
 .. image:: ../../../img/modin_cluster.png
    :alt: Modin cluster
    :align: center
-   :scale: 90%
 
 Extra requirements for AWS authentication
 -----------------------------------------
@@ -39,7 +42,7 @@ Starting and connecting to the cluster
 This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge), 576 total CPUs.
 You can check the `Amazon EC2 pricing`_ .
 
-You can manually create AWS EC2 instances and configure them or just use the `Ray autoscaler` to 
+You can manually create AWS EC2 instances and configure them or just use the `Ray CLI`_ to 
 create and initialize a Ray cluster on AWS using `Modin's Ray cluster setup config`_ .
 You can read more about how to modify the file on `Ray's autoscaler options`_ .
 
@@ -89,11 +92,16 @@ CSV file and executes such pandas operations as count, groupby and applymap.
 As a result of the script, you will see the size of the file being read and the execution
 time of each function.
 
+.. note::
+   Some Dataframe functions are executed asynchronously, so to correctly measure execution time 
+   we need to wait for the execution result. We use the special `execute` function for this, 
+   but you should not use this function as it will slow down your script.
+
 You can submit this script to the existing remote cluster by running the following command.
 
 .. code-block:: bash
 
-   ray modin-cluster.yaml exercise_5.py
+   ray submit modin-cluster.yaml exercise_5.py
 
 To download or upload files to the cluster head node, use `ray rsync_down` or `ray rsync_up`.
 It may help you if you want to use some other Python modules that should be available to
@@ -113,12 +121,12 @@ with improvements in performance as we increase the number of resources Modin ca
 .. image:: ../../../../examples/tutorial/jupyter/img/modin_cluster_perf.png
    :alt: Cluster Performance
    :align: center
-   :scale: 90%
 
 .. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
 .. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
 .. _`NYC Taxi dataset`: https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
 .. _`Modin's Ray cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
 .. _`Amazon EC2 pricing`: https://aws.amazon.com/ec2/pricing/on-demand/
 .. _`exercise_5.py`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py
-.. _`Ray client`: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
+.. _`Ray client`: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html
+.. _`Ray CLI`: https://docs.ray.io/en/latest/cluster/vms/getting-started.html#running-applications-on-a-ray-cluster
diff --git a/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.md b/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.md
@@ -15,11 +15,14 @@ and it is not shut down until the end of exercise. Read instructions carefully.*
 
 Often in practice we have a need to exceed the capabilities of a single machine.
 Modin works and performs well in both local mode and in a cluster environment.
-The key advantage of Modin is that your notebook does not change between
+The key advantage of Modin is that your python code not change between
 local development and cluster execution. Users are not required to think about
 how many workers exist or how to distribute and partition their data;
 Modin handles all of this seamlessly and transparently.
 
+**Note**: You can also use a Jupyter notebook, but you need to deploy a Jupyter server 
+on the remote cluster head node and connect to it.
+
 ![Cluster](../../../img/modin_cluster.png)
 
 **Extra requirements for AWS authentication**
@@ -44,8 +47,9 @@ This example starts 1 head node (m5.24xlarge) and 5 worker nodes (m5.24xlarge),
 
 Cost of this cluster can be found here: https://aws.amazon.com/ec2/pricing/on-demand/.
 
-You can manually create AWS EC2 instances and configure them or just use the `Ray autoscaler` to create and initialize
-a Ray cluster using the configuration file. This file is included in this directory and is called
+You can manually create AWS EC2 instances and configure them or just use the 
+[`Ray CLI`](https://docs.ray.io/en/latest/cluster/vms/getting-started.html#running-applications-on-a-ray-cluster)
+to create and initialize a Ray cluster using the configuration file. This file is included in this directory and is called
 [`modin-cluster.yaml`](https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml).
 You can read more about how to modify `Ray cluster YAML Configuration file` here:
 https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-yaml-configuration-options
@@ -91,6 +95,10 @@ In this exercise, we provide the `exercise_5.py` script, which read the data fro
 some pandas Dataframe function such as count, groupby and applymap. As a result of the script, you will see 
 the size of the file being read and the execution time of each function.
 
+**Note**: Some Dataframe functions are executed asynchronously, so to correctly measure execution time 
+we need to wait for the execution result. We use the special `execute` function for this, 
+but you should not use this function as it will slow down your script.
+
 You can submit this script to the existing remote cluster by running the following command.
 
 ```bash

diff --git a/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py b/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.py
@@ -2,12 +2,14 @@
 import time
 import ray
 import modin.pandas as pd
+from modin.utils import execute
 
 ray.init(address="auto")
 cpu_count = ray.cluster_resources()["CPU"]
 assert cpu_count == 576, f"Expected 576 CPUs, but found {cpu_count}"
 
-file_size = os.path.getsize("big_yellow.csv")
+file_path = "big_yellow.csv"
+file_size = os.path.getsize(file_path)
 
 
 # get human readable file size
@@ -22,21 +24,28 @@ def sizeof_fmt(num, suffix="B"):
 print(f"File size is {sizeof_fmt(file_size)}")  # noqa: T201
 
 t0 = time.perf_counter()
-df = pd.read_csv("big_yellow.csv", quoting=3)
+df = pd.read_csv(file_path, quoting=3)
 t1 = time.perf_counter()
 print(f"read_csv time is {(t1 - t0):.3f}")  # noqa: T201
 
+"""
+IMPORTANT:
+Some Dataframe functions are executed asynchronously, so to correctly measure execution time 
+we need to wait for the execution result. We use the special `execute` function for this, 
+but you should not use this function as it will slow down your script.
+"""
+
 t0 = time.perf_counter()
-count_result = df.count()
+count_result = execute(df.count())
 t1 = time.perf_counter()
 print(f"count time is {(t1 - t0):.3f}")  # noqa: T201
 
 t0 = time.perf_counter()
-groupby_result = df.groupby("passenger_count").count()
+groupby_result = execute(df.groupby("passenger_count").count())
 t1 = time.perf_counter()
 print(f"groupby time is {(t1 - t0):.3f}")  # noqa: T201
 
 t0 = time.perf_counter()
-apply_result = df.applymap(str)
+apply_result = execute(df.applymap(str))
 t1 = time.perf_counter()
 print(f"applymap time is {(t1 - t0):.3f}")  # noqa: T201
diff --git a/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml b/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
@@ -166,10 +166,9 @@ setup_commands:
     # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
     # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
     # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
-    - pip install modin
-    - pip install ray[default]>=1.13.0,!=2.5.0
-    - pip install pyarrow>=7.0.0
-    - pip install -U fsspec>=2022.05.0
+    - conda create -n "modin" -c conda-forge modin "ray-default">=1.13.0,!=2.5.0 -y
+    - conda activate modin && pip install -U fsspec>=2022.11.0 boto3
+    - echo "conda activate modin" >> ~/.bashrc
     - wget https://modin-datasets.intel.com/testing/yellow_tripdata_2015-01.csv
     - printf "VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee\n" > big_yellow.csv
     - tail -n +2 yellow_tripdata_2015-01.csv{,}{,}{,}{,}{,}{,} >> big_yellow.csv