DOCS-#3811: improve documentation for experimental 'pandas on Ray' (#…

…3819) Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
modin-project · Dec 16, 2021 · 20abddd · 20abddd
1 parent 6ca66db
commit 20abddd
Show file tree

Hide file tree

Showing 6 changed files with 52 additions and 23 deletions.
diff --git a/docs/advanced_usage/index.rst b/docs/advanced_usage/index.rst
@@ -20,6 +20,16 @@ integrated toolkit for data scientists. We are actively developing data science
 such as DataFrame - spreadsheet integration, DataFrame algebra, progress bars, SQL queries
 on DataFrames, and more. Join the `Discourse`_ for the latest updates!
 
+Experimental APIs
+-----------------
+
+Modin also supports these experimental APIs on top of pandas that are under active development.
+
+- :py:func:`~modin.experimental.pandas.read_csv_glob` -- read multiple files in a directory
+- :py:func:`~modin.experimental.pandas.read_sql` -- add optional parameters for the database connection
+- :py:func:`~modin.experimental.pandas.read_pickle_distributed`  -- read multiple files in a directory
+- :py:meth:`~modin.experimental.pandas.DataFrame.to_pickle_distributed` -- write to multiple files in a directory
+
 Modin Spreadsheet API: Render Dataframes as Spreadsheets
 --------------------------------------------------------
 The Spreadsheet API for Modin allows you to render the dataframe as a spreadsheet to easily explore 

diff --git a/docs/developer/architecture.rst b/docs/developer/architecture.rst
@@ -220,6 +220,10 @@ documentation page on :doc:`contributing </developer/contributing>`.
     - Uses native python execution - mainly used for debugging.
     - The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
     - For more information on the execution path, see the :doc:`pandas on Python </flow/modin/core/execution/python/implementations/pandas_on_python/index>` page.
+- :doc:`pandas on Ray` (experimental)
+    - Uses the Ray_ execution framework.
+    - The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
+    - For more information on the execution path, see the :doc:`experimental pandas on Ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index>` page.
 - :doc:`OmniSci on Native </developer/using_omnisci>` (experimental)
     - Uses OmniSciDB as an engine.
     - The storage format is `omnisci` and the in-memory partition type is a pyarrow Table. When defaulting to pandas, the pandas DataFrame is used.
@@ -326,7 +330,7 @@ details. The documentation covers most modules, with more docs being added every
    │   │   │   │   │       └─── :doc:`omnisci_on_native </flow/modin/experimental/core/execution/native/implementations/omnisci_on_native/index>`
    │   │   │   │   └───ray
    │   │   │   │       └───implementations
-   │   │   │   │           ├─── :doc:`pandas_on_ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray>`
+   │   │   │   │           ├─── :doc:`pandas_on_ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index>`
    │   │   │   │           └─── :doc:`pyarrow_on_ray </flow/modin/experimental/core/execution/ray/implementations/pyarrow_on_ray>`
    │   │   │   └─── :doc:`storage_formats </flow/modin/experimental/core/storage_formats/index>`
    │   │   │       └─── :doc:`omnisci </flow/modin/experimental/core/storage_formats/omnisci/index>`

diff --git a/...w/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index.rst b/...w/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index.rst
@@ -0,0 +1,22 @@
+:orphan:
+
+ExperimentalPandasOnRay Execution
+=================================
+
+`ExperimentalPandasOnRay` execution keeps the underlying mechanisms of :doc:`PandasOnRay </flow/modin/core/execution/ray/implementations/pandas_on_ray/index>`
+execution architecturally unchanged and adds experimental features of ``Data Transformation``, ``Data Ingress`` and ``Data Egress`` (e.g. :py:func:`~modin.experimental.pandas.read_pickle_distributed`).
+
+PandasOnRay and ExperimentalPandasOnRay differences
+---------------------------------------------------
+
+- another Factory ``PandasOnRayFactory`` -> ``ExperimentalPandasOnRayFactory``
+- another IO class ``PandasOnRayIO`` -> ``ExperimentalPandasOnRayIO``
+
+ExperimentalPandasOnRayIO classes and modules
+---------------------------------------------
+
+- :py:class:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO`
+- :py:class:`~modin.core.execution.dispatching.factories.factories.ExperimentalPandasOnRayFactory`
+- :py:class:`~modin.core.io.text.csv_glob_dispatcher.CSVGlobDispatcher`
+- :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser`
+- :doc:`ExperimentalPandasOnRay IO module </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/index>`
diff --git a/...ion/ray/implementations/pandas_on_ray.rst → ...mplementations/pandas_on_ray/io/index.rst b/...ion/ray/implementations/pandas_on_ray.rst → ...mplementations/pandas_on_ray/io/index.rst
@@ -1,7 +1,7 @@
 :orphan:
 
-Pandas-on-Ray Module Description
-""""""""""""""""""""""""""""""""
+IO module Description For Pandas-on-Ray Excecution
+""""""""""""""""""""""""""""""""""""""""""""""""""
 
 High-Level Module Overview
 ''''''''''''''''''''''''''
@@ -22,20 +22,6 @@ statement as follows:
   # import modin.pandas as pd
   import modin.experimental.pandas as pd
 
-Implemented Operations
-''''''''''''''''''''''
-
-For now :py:class:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO`
-implements two methods - :meth:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO.read_sql` and
-:meth:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO.read_csv_glob`.
-The first method allows the user to use typical ``pandas.read_sql`` function extended
-with `Spark-like parameters <https://spark.apache.org/docs/2.0.0/api/R/read.jdbc.html>`_
-such as ``partition_column``, ``lower_bound`` and ``upper_bound``. With these
-parameters, the user will be able to specify how to partition the imported data.
-The second implemented method allows to read multiple CSV files simultaneously
-when a `Python Glob <https://docs.python.org/3/library/glob.html>`_ object is
-provided as a parameter.
-
 Submodules Description
 ''''''''''''''''''''''
 

diff --git a/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/io.py b/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/io.py
@@ -123,6 +123,10 @@ def read_sql(
         """
         Read SQL query or database table into a DataFrame.
 
+        The function extended with `Spark-like parameters <https://spark.apache.org/docs/2.0.0/api/R/read.jdbc.html>`_
+        such as ``partition_column``, ``lower_bound`` and ``upper_bound``. With these
+        parameters, the user will be able to specify how to partition the imported data.
+
         Parameters
         ----------
         sql : str or SQLAlchemy Selectable (select or text object)

diff --git a/modin/experimental/pandas/io.py b/modin/experimental/pandas/io.py
@@ -231,10 +231,9 @@ def read_pickle_distributed(
     """
     Load pickled pandas object from files.
 
-    In experimental mode, we can use `*` in the filename. The files must contain
-    parts of one dataframe, which can be obtained, for example, by
-    `to_pickle_distributed` function.
-    Note: the number of partitions is equal to the number of input files.
+    This experimental feature provides parallel reading from multiple pickle files which are
+    defined by glob pattern. The files must contain parts of one dataframe, which can be
+    obtained, for example, by `to_pickle_distributed` function.
 
     Parameters
     ----------
@@ -256,6 +255,10 @@ def read_pickle_distributed(
     Returns
     -------
     unpickled : same type as object stored in file
+
+    Notes
+    -----
+    The number of partitions is equal to the number of input files.
     """
     Engine.subscribe(_update_engine)
     assert IsExperimental.get(), "This only works in experimental mode"
@@ -273,8 +276,8 @@ def to_pickle_distributed(
     """
     Pickle (serialize) object to file.
 
-    If `*` in the filename all partitions are written to their own separate file,
-    otherwise default pandas implementation is used.
+    This experimental feature provides parallel writing into multiple pickle files which are
+    defined by glob pattern, otherwise (without glob pattern) default pandas implementation is used.
 
     Parameters
     ----------