DOCS-#6819: Update Modin on cluster documentation (#6678)

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
modin-project · Dec 13, 2023 · 7f2dc36 · 7f2dc36
1 parent acfcf34
commit 7f2dc36
Show file tree

Hide file tree

Showing 3 changed files with 46 additions and 15 deletions.
diff --git a/docs/getting_started/using_modin/using_modin_cluster.rst b/docs/getting_started/using_modin/using_modin_cluster.rst
@@ -4,7 +4,7 @@ Using Modin in a Cluster
 
 .. note::
   | *Estimated Reading Time: 15 minutes*
-  | You can follow along in a Jupyter notebook in this two-part tutorial:  [`Part 1 <https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.ipynb>`_], [`Part 2 <https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.ipynb>`_].
+  | You can follow along in a Jupyter notebook in this two-part tutorial: `Part 1`_, `Part 2`_.
 
 Often in practice we have a need to exceed the capabilities of a single machine. Modin
 works and performs well in both local mode and in a cluster environment. The key
@@ -15,8 +15,9 @@ transparently.
 
 Starting up a Ray Cluster
 -------------------------
-Modin is able to utilize Ray's built-in autoscaled cluster. To launch a Ray cluster using Amazon Web Service (AWS), you can use `this file <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml>`_
-as the config file.
+Modin is able to utilize Ray's built-in autoscaled cluster. To launch a Ray cluster
+using Amazon Web Service (AWS), you can use `Modin's cluster setup config`_ 
+(`Ray's autoscaler options`_).
 
 .. code-block:: bash
 
@@ -31,8 +32,11 @@ To start up the Ray cluster, run the following command in your terminal:
 
 This configuration script starts 1 head node (m5.24xlarge) and 7 workers (m5.24xlarge),
 768 total CPUs. For more information on how to launch a Ray cluster across different
-cloud providers or on-premise, you can also refer to the Ray documentation `here <https://docs.ray.io/en/latest/cluster/cloud.html>`_.
+cloud providers or on-premise, you can also refer to the `Ray's cluster docs`_.
 
+.. note::
+   By default, Modin on Ray uses 60% of the system memory. It is recommended to use the same
+   amount, when using your own cluster (for each node).
 
 Connecting to a Ray Cluster
 ---------------------------
@@ -56,13 +60,20 @@ Modin:
 
 Congratualions! You have successfully connected to the Ray cluster.
 
+.. note::
+   Be careful when using the Ray client to connect to a remote cluster.
+   This connection mode may not work. Known bugs:
+   - https://github.com/ray-project/ray/issues/38713,
+   - https://github.com/modin-project/modin/issues/6641.
+
 Using Modin on a Ray Cluster
 ----------------------------
 
 Now that we have a Ray cluster up and running, we can use Modin to perform pandas
 operation as if we were working with pandas on a single machine. We test Modin's
-performance on the 200MB `NYC Taxi dataset <https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv>`_ that was provided as part of our `cluster setup script <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml>`_. We can time the following operation
-in a Jupyter notebook:
+performance on the 200MB `NYC Taxi dataset`_ that was provided as part of our
+`Modin's cluster setup config`_. We can time the following operation in a Jupyter
+notebook:
 
 .. code-block:: python
 
@@ -78,6 +89,10 @@ in a Jupyter notebook:
    %%time
    apply_result = df.map(str)
 
+.. note::
+   When using local paths, make sure that they are available on all nodes in the
+   cluster, for example using distributed file system like NFS.
+
 Modin performance scales as the number of nodes and cores increases. The following
 chart shows the performance of the above operations with 2, 4, and 8 nodes, with
 improvements in performance as we increase the number of resources Modin can use.
@@ -93,7 +108,7 @@ Advanced: Configuring your Ray Environment
 In some cases, it may be useful to customize your Ray environment. Below, we have listed
 a few ways you can solve common problems in data management with Modin by customizing
 your Ray environment. It is possible to use any of Ray's initialization parameters,
-which are all found in `Ray's documentation`_.
+which are all found in `Ray's API docs`_.
 
 .. code-block:: python
 
@@ -108,4 +123,10 @@ you can customize your Ray environment for use in Modin!
 .. _`DataFrame`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
 .. _`pandas`: https://pandas.pydata.org/pandas-docs/stable/
 .. _`open an issue`: https://github.com/modin-project/modin/issues
-.. _`Ray's documentation`: https://ray.readthedocs.io/en/latest/api.html
+.. _`Ray's API docs`: https://ray.readthedocs.io/en/latest/api.html
+.. _`Part 1`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_5.ipynb
+.. _`Part 2`: https://github.com/modin-project/modin/tree/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/exercise_6.ipynb
+.. _`Ray's autoscaler options`: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config
+.. _`Ray's cluster docs`: https://docs.ray.io/en/latest/cluster/getting-started.html
+.. _`NYC Taxi dataset`: https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv
+.. _`Modin's cluster setup config`: https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/cluster/modin-cluster.yaml
diff --git a/docs/getting_started/using_modin/using_modin_locally.rst b/docs/getting_started/using_modin/using_modin_locally.rst
@@ -4,14 +4,16 @@ Using Modin Locally
 
 .. note::
   | *Estimated Reading Time: 5 minutes*
-  | You can follow along this tutorial in a Jupyter notebook `here <hhttps://github.com/modin-project/modin/tree/master/examples/quickstart.ipynb>`.
+  | You can follow along this tutorial in the `Jupyter notebook`_.
 
 In our quickstart example, we have already seen how you can achieve considerable
-speedup from Modin, even on a single machine. Users do not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact,
+speedup from Modin, even on a single machine. Users do not need to know how many
+cores their system has, nor do they need to specify how to distribute the data. In fact,
 users can **continue using their existing pandas code** while experiencing a
 considerable speedup from Modin, even on a single machine.
 
-To use Modin on a single machine, only a modification of the import statement is needed. Once you've changed your import statement, you're ready to use Modin
+To use Modin on a single machine, only a modification of the import statement is needed.
+Once you've changed your import statement, you're ready to use Modin
 just like you would pandas, since the API is identical to pandas.
 
 .. code-block:: python
@@ -66,7 +68,8 @@ cluster for you:
 
 Finally, if you already have an Ray or Dask engine initialized, Modin will
 automatically attach to whichever engine is available. If you are interested in using
-Modin with HDK engine, please refer to :doc:`these instructions </development/using_hdk>`. For additional information on other settings you can configure, see
+Modin with HDK engine, please refer to :doc:`these instructions </development/using_hdk>`.
+For additional information on other settings you can configure, see
 :doc:`Modin's config </flow/modin/config>` page for more details.
 
 Advanced: Configuring the resources Modin uses
@@ -81,8 +84,8 @@ the following code:
    import modin
    print(modin.config.NPartitions.get()) #prints 16 on a laptop with 16 physical cores
 
-Modin fully utilizes the resources on your machine. To read more about how this works, see :doc:`Why Modin? </getting_started/why_modin/pandas/>`
-page for more details.
+Modin fully utilizes the resources on your machine. To read more about how this works,
+see :doc:`Why Modin? </getting_started/why_modin/pandas/>` page for more details.
 
 Since Modin will use all of the resources available on your machine by default, at
 times, it is possible that you may like to limit the amount of resources Modin uses to
@@ -116,4 +119,9 @@ specify more processors than you have available on your machine; however this wi
 improve the performance (and might end up hurting the performance of the system).
 
 .. note::
-   Make sure to update the ``MODIN_CPUS`` configuration and initialize your preferred engine before you start working with the first operation using Modin! Otherwise, Modin will opt for the default setting.
+   Make sure to update the ``MODIN_CPUS`` configuration and initialize your preferred
+   engine before you start working with the first operation using Modin! Otherwise,
+   Modin will opt for the default setting.
+
+
+.. _`Jupyter notebook`: https://github.com/modin-project/modin/tree/master/examples/quickstart.ipynb
diff --git a/modin/core/execution/ray/common/utils.py b/modin/core/execution/ray/common/utils.py
@@ -50,6 +50,8 @@
 _RAY_IGNORE_UNHANDLED_ERRORS_VAR = "RAY_IGNORE_UNHANDLED_ERRORS"
 
 ObjectIDType = ray.ObjectRef
+# TODO: Minimum version of Ray - 1.13
+# `if` branch can be deleted
 if version.parse(ray.__version__) >= version.parse("1.2.0"):
     from ray.util.client.common import ClientObjectRef