Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#6832: Implement read_xml_glob, to_xml_glob #6930

Merged
merged 3 commits into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/flow/modin/experimental/pandas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ Experimental API Reference
.. autofunction:: read_pickle_distributed
.. autofunction:: read_parquet_glob
.. autofunction:: read_json_glob
.. autofunction:: read_xml_glob
.. automethod:: modin.pandas.DataFrame.modin::to_pickle_distributed
.. automethod:: modin.pandas.DataFrame.modin::to_parquet_glob
.. automethod:: modin.pandas.DataFrame.modin::to_json_glob
.. automethod:: modin.pandas.DataFrame.modin::to_xml_glob
5 changes: 5 additions & 0 deletions docs/supported_apis/dataframe_supported.rst
Original file line number Diff line number Diff line change
Expand Up @@ -414,6 +414,10 @@ default to pandas.
| | | | Experimental implementation: |
| | | | DataFrame.modin.to_json_glob |
+----------------------------+---------------------------+------------------------+----------------------------------------------------+
| ``to_xml`` | `to_xml`_ | D | |
| | | | Experimental implementation: |
| | | | DataFrame.modin.to_xml_glob |
+----------------------------+---------------------------+------------------------+----------------------------------------------------+
| ``to_latex`` | `to_latex`_ | D | |
+----------------------------+---------------------------+------------------------+----------------------------------------------------+
| ``to_orc`` | `to_orc`_ | D | |
Expand Down Expand Up @@ -651,6 +655,7 @@ default to pandas.
.. _`to_hdf`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_hdf.html#pandas.DataFrame.to_hdf
.. _`to_html`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_html.html#pandas.DataFrame.to_html
.. _`to_json`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html#pandas.DataFrame.to_json
.. _`to_xml`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xml.html#pandas.DataFrame.to_xml
.. _`to_latex`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_latex.html#pandas.DataFrame.to_latex
.. _`to_orc`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_orc.html#pandas.DataFrame.to_orc
.. _`to_parquet`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet
Expand Down
2 changes: 2 additions & 0 deletions docs/supported_apis/io_supported.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ default to pandas.
| `read_json`_ | P | Implemented for ``lines=True`` |
| | | Experimental implementation: read_json_glob |
+-------------------+---------------------------------+--------------------------------------------------------+
| `read_xml` | D | Experimental implementation: read_xml_glob |
+-------------------+---------------------------------+--------------------------------------------------------+
| `read_html`_ | D | |
+-------------------+---------------------------------+--------------------------------------------------------+
| `read_clipboard`_ | D | |
Expand Down
2 changes: 2 additions & 0 deletions docs/usage_guide/advanced_usage/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,11 @@ Modin also supports these experimental APIs on top of pandas that are under acti
- :py:func:`~modin.experimental.pandas.read_pickle_distributed` -- read multiple pickle files in a directory
- :py:func:`~modin.experimental.pandas.read_parquet_glob` -- read multiple parquet files in a directory
- :py:func:`~modin.experimental.pandas.read_json_glob` -- read multiple json files in a directory
- :py:func:`~modin.experimental.pandas.read_xml_glob` -- read multiple xml files in a directory
- :py:meth:`~modin.pandas.DataFrame.modin.to_pickle_distributed` -- write to multiple pickle files in a directory
- :py:meth:`~modin.pandas.DataFrame.modin.to_parquet_glob` -- write to multiple parquet files in a directory
- :py:meth:`~modin.pandas.DataFrame.modin.to_json_glob` -- write to multiple json files in a directory
- :py:meth:`~modin.pandas.DataFrame.modin.to_xml_glob` -- write to multiple xml files in a directory

DataFrame partitioning API
--------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
ExperimentalPandasJsonParser,
ExperimentalPandasParquetParser,
ExperimentalPandasPickleParser,
ExperimentalPandasXmlParser,
)


Expand Down Expand Up @@ -105,6 +106,11 @@ def __make_write(*classes, build_args=build_args):
ExperimentalGlobDispatcher,
build_args={**build_args, "base_write": BaseIO.to_json},
)
read_xml_glob = __make_read(ExperimentalPandasXmlParser, ExperimentalGlobDispatcher)
to_xml_glob = __make_write(
ExperimentalGlobDispatcher,
build_args={**build_args, "base_write": BaseIO.to_xml},
)
read_pickle_distributed = __make_read(
ExperimentalPandasPickleParser, ExperimentalGlobDispatcher
)
Expand Down
15 changes: 15 additions & 0 deletions modin/core/execution/dispatching/factories/dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,16 @@
def to_json_glob(cls, *args, **kwargs):
return cls.get_factory()._to_json_glob(*args, **kwargs)

@classmethod
@_inherit_docstrings(factories.PandasOnRayFactory._read_xml_glob)
def read_xml_glob(cls, *args, **kwargs):
return cls.get_factory()._read_xml_glob(*args, **kwargs)

@classmethod
@_inherit_docstrings(factories.PandasOnRayFactory._to_xml_glob)
def to_xml_glob(cls, *args, **kwargs):
return cls.get_factory()._to_xml_glob(*args, **kwargs)

@classmethod
@_inherit_docstrings(factories.PandasOnRayFactory._read_custom_text)
def read_custom_text(cls, **kwargs):
Expand All @@ -331,6 +341,11 @@
def to_json(cls, *args, **kwargs):
return cls.get_factory()._to_json(*args, **kwargs)

@classmethod
@_inherit_docstrings(factories.BaseFactory._to_xml)
def to_xml(cls, *args, **kwargs):
return cls.get_factory()._to_xml(*args, **kwargs)

Check warning on line 347 in modin/core/execution/dispatching/factories/dispatcher.py

View check run for this annotation

Codecov / codecov/patch

modin/core/execution/dispatching/factories/dispatcher.py#L347

Added line #L347 was not covered by tests

@classmethod
@_inherit_docstrings(factories.BaseFactory._to_parquet)
def to_parquet(cls, *args, **kwargs):
Expand Down
47 changes: 47 additions & 0 deletions modin/core/execution/dispatching/factories/factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -427,6 +427,20 @@
"""
return cls.io_cls.to_json(*args, **kwargs)

@classmethod
def _to_xml(cls, *args, **kwargs):
"""
Write query compiler content to a XML file.

Parameters
----------
*args : args
Arguments to pass to the writer method.
**kwargs : kwargs
Arguments to pass to the writer method.
"""
return cls.io_cls.to_xml(*args, **kwargs)

Check warning on line 442 in modin/core/execution/dispatching/factories/factories.py

View check run for this annotation

Codecov / codecov/patch

modin/core/execution/dispatching/factories/factories.py#L442

Added line #L442 was not covered by tests

@classmethod
def _to_parquet(cls, *args, **kwargs):
"""
Expand Down Expand Up @@ -596,6 +610,39 @@
)
return cls.io_cls.to_json_glob(*args, **kwargs)

@classmethod
@doc(
_doc_io_method_raw_template,
source="XML files",
params=_doc_io_method_kwargs_params,
)
def _read_xml_glob(cls, **kwargs):
current_execution = get_current_execution()
if current_execution not in supported_executions:
raise NotImplementedError(
f"`_read_xml_glob()` is not implemented for {current_execution} execution."
)
return cls.io_cls.read_xml_glob(**kwargs)

@classmethod
def _to_xml_glob(cls, *args, **kwargs):
"""
Write query compiler content to several XML files.

Parameters
----------
*args : args
Arguments to pass to the writer method.
**kwargs : kwargs
Arguments to pass to the writer method.
"""
current_execution = get_current_execution()
if current_execution not in supported_executions:
raise NotImplementedError(
f"`_to_xml_glob()` is not implemented for {current_execution} execution."
)
return cls.io_cls.to_xml_glob(*args, **kwargs)


@doc(_doc_factory_class, execution_name="PandasOnRay")
class PandasOnRayFactory(BaseFactory):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
ExperimentalPandasJsonParser,
ExperimentalPandasParquetParser,
ExperimentalPandasPickleParser,
ExperimentalPandasXmlParser,
)

from ..dataframe import PandasOnRayDataframe
Expand Down Expand Up @@ -107,6 +108,11 @@ def __make_write(*classes, build_args=build_args):
ExperimentalGlobDispatcher,
build_args={**build_args, "base_write": RayIO.to_json},
)
read_xml_glob = __make_read(ExperimentalPandasXmlParser, ExperimentalGlobDispatcher)
to_xml_glob = __make_write(
ExperimentalGlobDispatcher,
build_args={**build_args, "base_write": RayIO.to_xml},
)
read_pickle_distributed = __make_read(
ExperimentalPandasPickleParser, ExperimentalGlobDispatcher
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
ExperimentalPandasJsonParser,
ExperimentalPandasParquetParser,
ExperimentalPandasPickleParser,
ExperimentalPandasXmlParser,
)

from ..dataframe import PandasOnUnidistDataframe
Expand Down Expand Up @@ -107,6 +108,11 @@ def __make_write(*classes, build_args=build_args):
ExperimentalGlobDispatcher,
build_args={**build_args, "base_write": UnidistIO.to_json},
)
read_xml_glob = __make_read(ExperimentalPandasXmlParser, ExperimentalGlobDispatcher)
to_xml_glob = __make_write(
ExperimentalGlobDispatcher,
build_args={**build_args, "base_write": UnidistIO.to_xml},
)
read_pickle_distributed = __make_read(
ExperimentalPandasPickleParser, ExperimentalGlobDispatcher
)
Expand Down
14 changes: 14 additions & 0 deletions modin/core/io/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -665,6 +665,20 @@ def to_json(cls, obj, path, **kwargs): # noqa: PR01

return obj.to_json(path, **kwargs)

@classmethod
@_inherit_docstrings(pandas.DataFrame.to_xml, apilink="pandas.DataFrame.to_xml")
def to_xml(cls, obj, path, **kwargs): # noqa: PR01
"""
Convert the object to a XML string.

For parameters description please refer to pandas API.
"""
ErrorMessage.default_to_pandas("`to_xml`")
if isinstance(obj, BaseQueryCompiler):
obj = obj.to_pandas()

return obj.to_xml(path, **kwargs)

@classmethod
@_inherit_docstrings(
pandas.DataFrame.to_parquet, apilink="pandas.DataFrame.to_parquet"
Expand Down
4 changes: 4 additions & 0 deletions modin/experimental/core/io/glob/glob_dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ def _read(cls, **kwargs):
path_key = "path"
elif "path_or_buf" in kwargs:
path_key = "path_or_buf"
elif "path_or_buffer" in kwargs:
path_key = "path_or_buffer"
filepath_or_buffer = kwargs.pop(path_key)
filepath_or_buffer = stringify_path(filepath_or_buffer)
if not (isinstance(filepath_or_buffer, str) and "*" in filepath_or_buffer):
Expand Down Expand Up @@ -123,6 +125,8 @@ def write(cls, qc, **kwargs):
path_key = "path"
elif "path_or_buf" in kwargs:
path_key = "path_or_buf"
elif "path_or_buffer" in kwargs:
path_key = "path_or_buffer"
filepath_or_buffer = kwargs.pop(path_key)
filepath_or_buffer = stringify_path(filepath_or_buffer)
if not (
Expand Down
18 changes: 18 additions & 0 deletions modin/experimental/core/storage_formats/pandas/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,24 @@ def parse(fname, **kwargs):
return _split_result_for_readers(1, num_splits, df) + [length, width]


@doc(_doc_pandas_parser_class, data_type="XML files")
class ExperimentalPandasXmlParser(PandasParser):
@staticmethod
@doc(_doc_parse_func, parameters=_doc_parse_parameters_common)
def parse(fname, **kwargs):
warnings.filterwarnings("ignore")
num_splits = 1
single_worker_read = kwargs.pop("single_worker_read", None)
df = pandas.read_xml(fname, **kwargs)
if single_worker_read:
return df

length = len(df)
width = len(df.columns)

return _split_result_for_readers(1, num_splits, df) + [length, width]


@doc(_doc_pandas_parser_class, data_type="custom text")
class ExperimentalCustomTextParser(PandasParser):
@staticmethod
Expand Down
1 change: 1 addition & 0 deletions modin/experimental/pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
read_parquet_glob,
read_pickle_distributed,
read_sql,
read_xml_glob,
to_pickle_distributed,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue to remove the deprecated to_pickle_distributed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to deprecate it already in this release.

)

Expand Down
114 changes: 114 additions & 0 deletions modin/experimental/pandas/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -600,3 +600,117 @@ def to_json_glob(
storage_options=storage_options,
mode=mode,
)


@expanduser_path_arg("path_or_buffer")
def read_xml_glob(
path_or_buffer,
*,
xpath="./*",
namespaces=None,
elems_only=False,
attrs_only=False,
names=None,
dtype=None,
converters=None,
parse_dates=None,
encoding="utf-8",
parser="lxml",
stylesheet=None,
iterparse=None,
compression="infer",
storage_options: StorageOptions = None,
dtype_backend=lib.no_default,
) -> DataFrame: # noqa: PR01
"""
Read XML document into a DataFrame object.

This experimental feature provides parallel reading from multiple XML files which are
defined by glob pattern. The files must contain parts of one dataframe, which can be
obtained, for example, by `DataFrame.modin.to_xml_glob` function.

Returns
-------
DataFrame

Notes
-----
* Only string type supported for `path_or_buffer` argument.
* The rest of the arguments are the same as for `pandas.read_xml`.
"""
from modin.core.execution.dispatching.factories.dispatcher import FactoryDispatcher

return DataFrame(
query_compiler=FactoryDispatcher.read_xml_glob(
path_or_buffer=path_or_buffer,
xpath=xpath,
namespaces=namespaces,
elems_only=elems_only,
attrs_only=attrs_only,
names=names,
dtype=dtype,
converters=converters,
parse_dates=parse_dates,
encoding=encoding,
parser=parser,
stylesheet=stylesheet,
iterparse=iterparse,
compression=compression,
storage_options=storage_options,
dtype_backend=dtype_backend,
)
)


@expanduser_path_arg("path_or_buffer")
def to_xml_glob(
self,
path_or_buffer=None,
index=True,
root_name="data",
row_name="row",
na_rep=None,
attr_cols=None,
elem_cols=None,
namespaces=None,
prefix=None,
encoding="utf-8",
xml_declaration=True,
pretty_print=True,
parser="lxml",
stylesheet=None,
compression="infer",
storage_options=None,
) -> None: # noqa: PR01
"""
Render a DataFrame to an XML document.

Notes
-----
* Only string type supported for `path_or_buffer` argument.
* The rest of the arguments are the same as for `pandas.to_xml`.
"""
obj = self
from modin.core.execution.dispatching.factories.dispatcher import FactoryDispatcher

if isinstance(self, DataFrame):
obj = self._query_compiler
FactoryDispatcher.to_xml_glob(
obj,
path_or_buffer=path_or_buffer,
index=index,
root_name=root_name,
row_name=row_name,
na_rep=na_rep,
attr_cols=attr_cols,
elem_cols=elem_cols,
namespaces=namespaces,
prefix=prefix,
encoding=encoding,
xml_declaration=xml_declaration,
pretty_print=pretty_print,
parser=parser,
stylesheet=stylesheet,
compression=compression,
storage_options=storage_options,
)