DOCS-#3747: Improve documentation of IO part (#3798)

Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
modin-project · Dec 14, 2021 · ffa67c7 · ffa67c7
1 parent 689faee
commit ffa67c7
Show file tree

Hide file tree

Showing 4 changed files with 29 additions and 34 deletions.
diff --git a/docs/developer/architecture.rst b/docs/developer/architecture.rst
@@ -168,23 +168,39 @@ The base storage format in Modin is pandas. In the default case, the Modin Dataf
 Data Ingress
 ''''''''''''
 
+.. note::
+   Data ingress operations (e.g. ``read_csv``) in Modin load data from the source into
+   partitions and vice versa for data egress (e.g. ``to_csv``) operation.
+   Improved performance is achieved by reading/writing in partitions in parallel.
+
+Data ingress starts with a function in the pandas API layer (e.g. ``read_csv``). Then the user's 
+query is passed to the :doc:`Factory Dispatcher </flow/modin/core/execution/dispatching>`,
+which defines a factory specific for the execution. The factory for execution contains an IO class
+(e.g. ``PandasOnRayIO``) whose responsibility is to perform a parallel read/write from/to a file.
+This IO class contains class methods with interfaces and names that are similar to pandas IO functions
+(e.g. ``PandasOnRayIO.read_csv``). The IO class declares the Modin Dataframe and Query Compiler
+classes specific for the execution engine and storage format to ensure the correct object is constructed.
+It also declares IO methods that are mix-ins containing a combination of the engine-specific class for
+deploying remote tasks, the class for parsing the given file format and the class handling the chunking
+of the format-specific file on the head node (see dispatcher classes implementation
+:doc:`details </flow/modin/core/io/index>`). The output from the IO class data ingress function is
+a :doc:`Modin Dataframe </flow/modin/core/dataframe/pandas/dataframe>`.
+
 .. image:: /img/generic_data_ingress.svg
    :align: center
 
 Data Egress
 '''''''''''
 
+Data egress operations (e.g. ``to_csv``) are similar to data ingress operations up to
+execution-specific IO class functions construction. Data egress functions of the IO class
+are defined slightly different from data ingress functions and created only
+specifically for the engine since partitions already have information about its storage
+format. Using the IO class, data is exported from partitions to the target file.
+
 .. image:: /img/generic_data_egress.svg
    :align: center
 
-Factory Dispatcher
-""""""""""""""""""
-
-The :doc:`Factory Dispatcher </flow/modin/core/execution/dispatching>` provides IO methods with interfaces that correspond to pandas IO functions,
-and is in charge of routing IO calls to a factory corresponding to the selected execution.
-The factory, in turn, contains the specific IO class, which it calls with the required arguments.
-The IO class is responsible for performing a parallel read/write from/to a file.
-
 Supported Execution Engines and Storage Formats
 '''''''''''''''''''''''''''''''''''''''''''''''
 

diff --git a/docs/flow/modin/core/execution/dispatching.rst b/docs/flow/modin/core/execution/dispatching.rst
@@ -35,7 +35,7 @@ underlying IO module. For more information about IO module visit :doc:`related d
 
 Factory Dispatcher
 ''''''''''''''''''
-The ``modin.core.execution.dispatching.factories.dispatcher.FactoryDispatcher`` class provides 
+The :py:class:`~modin.core.execution.dispatching.factories.dispatcher.FactoryDispatcher` class provides 
 public methods whose interface corresponds to pandas IO functions, the only difference is that they return `QueryCompiler` of the
 selected storage format instead of high-level :py:class:`~modin.pandas.dataframe.DataFrame`. ``FactoryDispatcher`` is responsible for routing
 these IO calls to the factory which represents the selected execution.
@@ -46,7 +46,7 @@ trace would be the following:
 .. figure:: /img/factory_dispatching.svg
     :align: center
 
-``modin.pandas.read_csv`` calls ``FactoryDispatcher.read_csv``, which calls ``.read_csv``
+``modin.pandas.read_csv`` calls ``FactoryDispatcher.read_csv``, which calls ``._read_csv``
 function of the factory of the selected execution, in our case it's ``PandasOnRayFactory._read_csv``,
 which in turn forwards this call to the actual implementation of ``read_csv`` — to the
 ``PandasOnRayIO.read_csv``. The result of ``modin.pandas.read_csv`` will return a high-level Modin

diff --git a/docs/flow/modin/core/io/index.rst b/docs/flow/modin/core/io/index.rst
@@ -3,33 +3,12 @@
 IO Module Description
 """""""""""""""""""""
 
-High-Level Data Import Operation Workflow
-'''''''''''''''''''''''''''''''''''''''''
-
-.. note:: 
-    ``read_csv`` on PandasOnRay execution was taken as an example
-    in this chapter for reader convenience. For other import functions workflow and
-    classes/functions naming convension will be the same.
-
-Data import operation workflow diagram is shown below. After user calls high-level
-``modin.pandas.read_csv`` function, call is forwarded to the ``FactoryDispatcher``,
-which defines which factory from ``modin.core.execution.dispatching.factories.factories`` and
-execution specific IO class should be used (for Ray engine and pandas in-memory format
-IO class will be named ``PandasOnRayIO``). This class defines Modin frame and query
-compiler classes and ``read_*`` functions, which could be based on the following
-classes: ``RayTask`` - class for managing remote tasks by concrete distribution
-engine, ``PandasCSVParser`` - class for data parsing on the workers by specific
-execution engine and ``CSVDispatcher`` - class for files handling of concrete file format
-including chunking that is executed on the head node.
-
-.. image:: /img/data_import_workflow.png
-   :align: center
-
 Dispatcher Classes Workflow Overview
 ''''''''''''''''''''''''''''''''''''
 
-Call from ``read_csv`` function of ``PandasOnRayIO`` class is forwarded to the
-``_read`` function of ``CSVDispatcher`` class, where function parameters are
+Call from ``read_*`` function of execution-specific IO class (for example, ``PandasOnRayIO`` for
+Ray engine and pandas storage format) is forwarded to the ``_read`` function of file
+format-specific class (for example ``CSVDispatcher`` for CSV files), where function parameters are
 preprocessed to check if they are supported (otherwise default pandas implementation
 is used) and compute some metadata common for all partitions. Then file is splitted
 into chunks (mechanism of splitting is described below) and using this data, tasks

diff --git a/docs/img/data_import_workflow.png b/docs/img/data_import_workflow.png