Skip to content

Commit

Permalink
DOCS-#3747: Improve documentation of IO part (#3798)
Browse files Browse the repository at this point in the history
Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
  • Loading branch information
3 people committed Dec 14, 2021
1 parent 689faee commit ffa67c7
Show file tree
Hide file tree
Showing 4 changed files with 29 additions and 34 deletions.
32 changes: 24 additions & 8 deletions docs/developer/architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,23 +168,39 @@ The base storage format in Modin is pandas. In the default case, the Modin Dataf
Data Ingress
''''''''''''

.. note::
Data ingress operations (e.g. ``read_csv``) in Modin load data from the source into
partitions and vice versa for data egress (e.g. ``to_csv``) operation.
Improved performance is achieved by reading/writing in partitions in parallel.

Data ingress starts with a function in the pandas API layer (e.g. ``read_csv``). Then the user's
query is passed to the :doc:`Factory Dispatcher </flow/modin/core/execution/dispatching>`,
which defines a factory specific for the execution. The factory for execution contains an IO class
(e.g. ``PandasOnRayIO``) whose responsibility is to perform a parallel read/write from/to a file.
This IO class contains class methods with interfaces and names that are similar to pandas IO functions
(e.g. ``PandasOnRayIO.read_csv``). The IO class declares the Modin Dataframe and Query Compiler
classes specific for the execution engine and storage format to ensure the correct object is constructed.
It also declares IO methods that are mix-ins containing a combination of the engine-specific class for
deploying remote tasks, the class for parsing the given file format and the class handling the chunking
of the format-specific file on the head node (see dispatcher classes implementation
:doc:`details </flow/modin/core/io/index>`). The output from the IO class data ingress function is
a :doc:`Modin Dataframe </flow/modin/core/dataframe/pandas/dataframe>`.

.. image:: /img/generic_data_ingress.svg
:align: center

Data Egress
'''''''''''

Data egress operations (e.g. ``to_csv``) are similar to data ingress operations up to
execution-specific IO class functions construction. Data egress functions of the IO class
are defined slightly different from data ingress functions and created only
specifically for the engine since partitions already have information about its storage
format. Using the IO class, data is exported from partitions to the target file.

.. image:: /img/generic_data_egress.svg
:align: center

Factory Dispatcher
""""""""""""""""""

The :doc:`Factory Dispatcher </flow/modin/core/execution/dispatching>` provides IO methods with interfaces that correspond to pandas IO functions,
and is in charge of routing IO calls to a factory corresponding to the selected execution.
The factory, in turn, contains the specific IO class, which it calls with the required arguments.
The IO class is responsible for performing a parallel read/write from/to a file.

Supported Execution Engines and Storage Formats
'''''''''''''''''''''''''''''''''''''''''''''''

Expand Down
4 changes: 2 additions & 2 deletions docs/flow/modin/core/execution/dispatching.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ underlying IO module. For more information about IO module visit :doc:`related d

Factory Dispatcher
''''''''''''''''''
The ``modin.core.execution.dispatching.factories.dispatcher.FactoryDispatcher`` class provides
The :py:class:`~modin.core.execution.dispatching.factories.dispatcher.FactoryDispatcher` class provides
public methods whose interface corresponds to pandas IO functions, the only difference is that they return `QueryCompiler` of the
selected storage format instead of high-level :py:class:`~modin.pandas.dataframe.DataFrame`. ``FactoryDispatcher`` is responsible for routing
these IO calls to the factory which represents the selected execution.
Expand All @@ -46,7 +46,7 @@ trace would be the following:
.. figure:: /img/factory_dispatching.svg
:align: center

``modin.pandas.read_csv`` calls ``FactoryDispatcher.read_csv``, which calls ``.read_csv``
``modin.pandas.read_csv`` calls ``FactoryDispatcher.read_csv``, which calls ``._read_csv``
function of the factory of the selected execution, in our case it's ``PandasOnRayFactory._read_csv``,
which in turn forwards this call to the actual implementation of ``read_csv`` — to the
``PandasOnRayIO.read_csv``. The result of ``modin.pandas.read_csv`` will return a high-level Modin
Expand Down
27 changes: 3 additions & 24 deletions docs/flow/modin/core/io/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,12 @@
IO Module Description
"""""""""""""""""""""

High-Level Data Import Operation Workflow
'''''''''''''''''''''''''''''''''''''''''

.. note::
``read_csv`` on PandasOnRay execution was taken as an example
in this chapter for reader convenience. For other import functions workflow and
classes/functions naming convension will be the same.

Data import operation workflow diagram is shown below. After user calls high-level
``modin.pandas.read_csv`` function, call is forwarded to the ``FactoryDispatcher``,
which defines which factory from ``modin.core.execution.dispatching.factories.factories`` and
execution specific IO class should be used (for Ray engine and pandas in-memory format
IO class will be named ``PandasOnRayIO``). This class defines Modin frame and query
compiler classes and ``read_*`` functions, which could be based on the following
classes: ``RayTask`` - class for managing remote tasks by concrete distribution
engine, ``PandasCSVParser`` - class for data parsing on the workers by specific
execution engine and ``CSVDispatcher`` - class for files handling of concrete file format
including chunking that is executed on the head node.

.. image:: /img/data_import_workflow.png
:align: center

Dispatcher Classes Workflow Overview
''''''''''''''''''''''''''''''''''''

Call from ``read_csv`` function of ``PandasOnRayIO`` class is forwarded to the
``_read`` function of ``CSVDispatcher`` class, where function parameters are
Call from ``read_*`` function of execution-specific IO class (for example, ``PandasOnRayIO`` for
Ray engine and pandas storage format) is forwarded to the ``_read`` function of file
format-specific class (for example ``CSVDispatcher`` for CSV files), where function parameters are
preprocessed to check if they are supported (otherwise default pandas implementation
is used) and compute some metadata common for all partitions. Then file is splitted
into chunks (mechanism of splitting is described below) and using this data, tasks
Expand Down
Binary file removed docs/img/data_import_workflow.png
Binary file not shown.

0 comments on commit ffa67c7

Please sign in to comment.