Add documentation about reading local files in a cluster environment #3509

jmakov · 2021-10-03T20:54:32Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 21.04
Modin version (modin.__version__): 0.10.2
Python version: 3.7
Code we can use to reproduce:

import modin.pandas as pd  # traceback produced as seen below
#  import pandas as pd  # works
import ray

ray.init(address='auto', _redis_password='xxx')

pd.read_parquet("abs_path_to_parquet_df")

Describe the problem

---------------------------------------------------------------------------
RayTaskError(FileNotFoundError)           Traceback (most recent call last)
/tmp/ipykernel_192025/3678036683.py in <module>
      8     pairs = pickle.load(f)
      9 score_matrix = pd.read_parquet(path + "score_matrix_" + SAMPLE_RESOLUTION_RULE + ".parquet")
---> 10 pvalue_matrix = pd.read_parquet(path + "pvalue_matrix_" + SAMPLE_RESOLUTION_RULE + ".parquet")

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/pandas/io.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    216             storage_options=storage_options,
    217             use_nullable_dtypes=use_nullable_dtypes,
--> 218             **kwargs,
    219         )
    220     )

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/data_management/factories/dispatcher.py in read_parquet(cls, **kwargs)
    165     @_inherit_docstrings(factories.BaseFactory._read_parquet)
    166     def read_parquet(cls, **kwargs):
--> 167         return cls.__factory._read_parquet(**kwargs)
    168 
    169     @classmethod

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/data_management/factories/factories.py in _read_parquet(cls, **kwargs)
    194     )
    195     def _read_parquet(cls, **kwargs):
--> 196         return cls.io_cls.read_parquet(**kwargs)
    197 
    198     @classmethod

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/file_dispatcher.py in read(cls, *args, **kwargs)
     65         postprocessing work on the resulting query_compiler object.
     66         """
---> 67         query_compiler = cls._read(*args, **kwargs)
     68         # TODO (devin-petersohn): Make this section more general for non-pandas kernel
     69         # implementations.

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/parquet_dispatcher.py in _read(cls, path, engine, columns, **kwargs)
    134                     column_names = [c for c in column_names if c not in index_columns]
    135             columns = [name for name in column_names if not PQ_INDEX_REGEX.match(name)]
--> 136         return cls.build_query_compiler(path, columns, **kwargs)

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/column_store_dispatcher.py in build_query_compiler(cls, path, columns, **kwargs)
    218         index, row_lens = cls.build_index(partition_ids)
    219         remote_parts = cls.build_partition(partition_ids[:-2], row_lens, column_widths)
--> 220         dtypes = cls.build_dtypes(partition_ids[-1], columns)
    221         new_query_compiler = cls.query_compiler_cls(
    222             cls.frame_cls(

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/column_store_dispatcher.py in build_dtypes(cls, partition_ids, columns)
    191             Series with dtypes for columns.
    192         """
--> 193         dtypes = pandas.concat(cls.materialize(list(partition_ids)), axis=0)
    194         dtypes.index = columns
    195         return dtypes

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/ray/task_wrapper.py in materialize(cls, obj_id)
     80             Whatever was identified by `obj_id`.
     81         """
---> 82         return ray.get(obj_id)

~/miniconda3/envs/test/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
     80         if client_mode_should_convert():
     81             return getattr(ray, func.__name__)(*args, **kwargs)
---> 82         return func(*args, **kwargs)
     83 
     84     return wrapper

~/miniconda3/envs/test/lib/python3.7/site-packages/ray/worker.py in get(object_refs, timeout)
   1619                     worker.core_worker.dump_object_store_memory_usage()
   1620                 if isinstance(value, RayTaskError):
-> 1621                     raise value.as_instanceof_cause()
   1622                 else:
   1623                     raise value

RayTaskError(FileNotFoundError): ray::deploy_ray_func() (pid=1831567, ip=192.168.0.101)
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
    return func(**args)
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/modin/backends/pandas/parsers.py", line 595, in parse
    df = pandas.read_parquet(fname, **kwargs)
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
    **kwargs,
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
    mode="rb",
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
    path_or_handle, mode, is_text=False, storage_options=storage_options
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/common.py", line 710, in get_handle
    handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: parquet_df.parquet

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2021-10-04T02:22:16Z

Hi @jmakov thanks for posting! It looks like some of the workers cannot find the file at that path.

Do all workers have access to the file? Is this file on a shared filesystem or locally on the driver?

jmakov · 2021-10-04T10:13:42Z

Thanks for the quick response. The file is only available locally on a single node from which it's being read. Does Modin require that all the nodes have access?

devin-petersohn · 2021-10-04T13:52:36Z

Does Modin require that all the nodes have access?

That's the current assumption by the parser logic in Modin (S3, NFS, etc.). We could hypothetically add a way to do this, is it something that's important to your workflow?

Happy to discuss more about your workflow if you'd like to email me (devin.petersohn@gmail.com)

jmakov · 2021-10-04T14:30:30Z

No it's not. Just didn't expect this and assumed it's a bug. Would be great to have a note in the docs.

devin-petersohn · 2021-10-04T14:43:54Z

Would be great to have a note in the docs.

Agree, I am going to reopen this to track that task if you don't mind.

RehanSD · 2022-10-12T03:34:43Z

Related to #4479 !

mvashishtha · 2022-10-12T03:40:06Z

Duplicate of #4479

jmakov closed this as completed Oct 4, 2021

devin-petersohn changed the title ~~Cannot read local dataframe saved as parquet~~ Add documentation about reading local files in a cluster environment Oct 4, 2021

devin-petersohn reopened this Oct 4, 2021

anmyachev added the documentation 📜 Updates and issues with the documentation label Apr 21, 2022

mvashishtha marked this as a duplicate of #4479 Oct 12, 2022

mvashishtha closed this as completed Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation about reading local files in a cluster environment #3509

Add documentation about reading local files in a cluster environment #3509

jmakov commented Oct 3, 2021

devin-petersohn commented Oct 4, 2021

jmakov commented Oct 4, 2021

devin-petersohn commented Oct 4, 2021

jmakov commented Oct 4, 2021

devin-petersohn commented Oct 4, 2021

RehanSD commented Oct 12, 2022

mvashishtha commented Oct 12, 2022

Add documentation about reading local files in a cluster environment #3509

Add documentation about reading local files in a cluster environment #3509

Comments

jmakov commented Oct 3, 2021

System information

Describe the problem

devin-petersohn commented Oct 4, 2021

jmakov commented Oct 4, 2021

devin-petersohn commented Oct 4, 2021

jmakov commented Oct 4, 2021

devin-petersohn commented Oct 4, 2021

RehanSD commented Oct 12, 2022

mvashishtha commented Oct 12, 2022