Skip to content

Commit

Permalink
DOCS-#3811: improve documentation for experimental 'pandas on Ray' (#…
Browse files Browse the repository at this point in the history
…3819)

Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
  • Loading branch information
anmyachev and YarShev committed Dec 16, 2021
1 parent 6ca66db commit 20abddd
Show file tree
Hide file tree
Showing 6 changed files with 52 additions and 23 deletions.
10 changes: 10 additions & 0 deletions docs/advanced_usage/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,16 @@ integrated toolkit for data scientists. We are actively developing data science
such as DataFrame - spreadsheet integration, DataFrame algebra, progress bars, SQL queries
on DataFrames, and more. Join the `Discourse`_ for the latest updates!

Experimental APIs
-----------------

Modin also supports these experimental APIs on top of pandas that are under active development.

- :py:func:`~modin.experimental.pandas.read_csv_glob` -- read multiple files in a directory
- :py:func:`~modin.experimental.pandas.read_sql` -- add optional parameters for the database connection
- :py:func:`~modin.experimental.pandas.read_pickle_distributed` -- read multiple files in a directory
- :py:meth:`~modin.experimental.pandas.DataFrame.to_pickle_distributed` -- write to multiple files in a directory

Modin Spreadsheet API: Render Dataframes as Spreadsheets
--------------------------------------------------------
The Spreadsheet API for Modin allows you to render the dataframe as a spreadsheet to easily explore
Expand Down
6 changes: 5 additions & 1 deletion docs/developer/architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,10 @@ documentation page on :doc:`contributing </developer/contributing>`.
- Uses native python execution - mainly used for debugging.
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
- For more information on the execution path, see the :doc:`pandas on Python </flow/modin/core/execution/python/implementations/pandas_on_python/index>` page.
- :doc:`pandas on Ray` (experimental)
- Uses the Ray_ execution framework.
- The storage format is `pandas` and the in-memory partition type is a pandas DataFrame.
- For more information on the execution path, see the :doc:`experimental pandas on Ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index>` page.
- :doc:`OmniSci on Native </developer/using_omnisci>` (experimental)
- Uses OmniSciDB as an engine.
- The storage format is `omnisci` and the in-memory partition type is a pyarrow Table. When defaulting to pandas, the pandas DataFrame is used.
Expand Down Expand Up @@ -326,7 +330,7 @@ details. The documentation covers most modules, with more docs being added every
│ │ │ │ │ └─── :doc:`omnisci_on_native </flow/modin/experimental/core/execution/native/implementations/omnisci_on_native/index>`
│ │ │ │ └───ray
│ │ │ │ └───implementations
│ │ │ │ ├─── :doc:`pandas_on_ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray>`
│ │ │ │ ├─── :doc:`pandas_on_ray </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/index>`
│ │ │ │ └─── :doc:`pyarrow_on_ray </flow/modin/experimental/core/execution/ray/implementations/pyarrow_on_ray>`
│ │ │ └─── :doc:`storage_formats </flow/modin/experimental/core/storage_formats/index>`
│ │ │ └─── :doc:`omnisci </flow/modin/experimental/core/storage_formats/omnisci/index>`
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
:orphan:

ExperimentalPandasOnRay Execution
=================================

`ExperimentalPandasOnRay` execution keeps the underlying mechanisms of :doc:`PandasOnRay </flow/modin/core/execution/ray/implementations/pandas_on_ray/index>`
execution architecturally unchanged and adds experimental features of ``Data Transformation``, ``Data Ingress`` and ``Data Egress`` (e.g. :py:func:`~modin.experimental.pandas.read_pickle_distributed`).

PandasOnRay and ExperimentalPandasOnRay differences
---------------------------------------------------

- another Factory ``PandasOnRayFactory`` -> ``ExperimentalPandasOnRayFactory``
- another IO class ``PandasOnRayIO`` -> ``ExperimentalPandasOnRayIO``

ExperimentalPandasOnRayIO classes and modules
---------------------------------------------

- :py:class:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO`
- :py:class:`~modin.core.execution.dispatching.factories.factories.ExperimentalPandasOnRayFactory`
- :py:class:`~modin.core.io.text.csv_glob_dispatcher.CSVGlobDispatcher`
- :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser`
- :doc:`ExperimentalPandasOnRay IO module </flow/modin/experimental/core/execution/ray/implementations/pandas_on_ray/io/index>`
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
:orphan:

Pandas-on-Ray Module Description
""""""""""""""""""""""""""""""""
IO module Description For Pandas-on-Ray Excecution
""""""""""""""""""""""""""""""""""""""""""""""""""

High-Level Module Overview
''''''''''''''''''''''''''
Expand All @@ -22,20 +22,6 @@ statement as follows:
# import modin.pandas as pd
import modin.experimental.pandas as pd
Implemented Operations
''''''''''''''''''''''

For now :py:class:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO`
implements two methods - :meth:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO.read_sql` and
:meth:`~modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO.read_csv_glob`.
The first method allows the user to use typical ``pandas.read_sql`` function extended
with `Spark-like parameters <https://spark.apache.org/docs/2.0.0/api/R/read.jdbc.html>`_
such as ``partition_column``, ``lower_bound`` and ``upper_bound``. With these
parameters, the user will be able to specify how to partition the imported data.
The second implemented method allows to read multiple CSV files simultaneously
when a `Python Glob <https://docs.python.org/3/library/glob.html>`_ object is
provided as a parameter.

Submodules Description
''''''''''''''''''''''

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,10 @@ def read_sql(
"""
Read SQL query or database table into a DataFrame.
The function extended with `Spark-like parameters <https://spark.apache.org/docs/2.0.0/api/R/read.jdbc.html>`_
such as ``partition_column``, ``lower_bound`` and ``upper_bound``. With these
parameters, the user will be able to specify how to partition the imported data.
Parameters
----------
sql : str or SQLAlchemy Selectable (select or text object)
Expand Down
15 changes: 9 additions & 6 deletions modin/experimental/pandas/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,10 +231,9 @@ def read_pickle_distributed(
"""
Load pickled pandas object from files.
In experimental mode, we can use `*` in the filename. The files must contain
parts of one dataframe, which can be obtained, for example, by
`to_pickle_distributed` function.
Note: the number of partitions is equal to the number of input files.
This experimental feature provides parallel reading from multiple pickle files which are
defined by glob pattern. The files must contain parts of one dataframe, which can be
obtained, for example, by `to_pickle_distributed` function.
Parameters
----------
Expand All @@ -256,6 +255,10 @@ def read_pickle_distributed(
Returns
-------
unpickled : same type as object stored in file
Notes
-----
The number of partitions is equal to the number of input files.
"""
Engine.subscribe(_update_engine)
assert IsExperimental.get(), "This only works in experimental mode"
Expand All @@ -273,8 +276,8 @@ def to_pickle_distributed(
"""
Pickle (serialize) object to file.
If `*` in the filename all partitions are written to their own separate file,
otherwise default pandas implementation is used.
This experimental feature provides parallel writing into multiple pickle files which are
defined by glob pattern, otherwise (without glob pattern) default pandas implementation is used.
Parameters
----------
Expand Down

0 comments on commit 20abddd

Please sign in to comment.