Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation about reading local files in a cluster environment #3509

Closed
jmakov opened this issue Oct 3, 2021 · 7 comments
Closed

Add documentation about reading local files in a cluster environment #3509

jmakov opened this issue Oct 3, 2021 · 7 comments
Labels
documentation 📜 Updates and issues with the documentation

Comments

@jmakov
Copy link

jmakov commented Oct 3, 2021

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 21.04
  • Modin version (modin.__version__): 0.10.2
  • Python version: 3.7
  • Code we can use to reproduce:
import modin.pandas as pd  # traceback produced as seen below
#  import pandas as pd  # works
import ray

ray.init(address='auto', _redis_password='xxx')

pd.read_parquet("abs_path_to_parquet_df")

Describe the problem

---------------------------------------------------------------------------
RayTaskError(FileNotFoundError)           Traceback (most recent call last)
/tmp/ipykernel_192025/3678036683.py in <module>
      8     pairs = pickle.load(f)
      9 score_matrix = pd.read_parquet(path + "score_matrix_" + SAMPLE_RESOLUTION_RULE + ".parquet")
---> 10 pvalue_matrix = pd.read_parquet(path + "pvalue_matrix_" + SAMPLE_RESOLUTION_RULE + ".parquet")

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/pandas/io.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    216             storage_options=storage_options,
    217             use_nullable_dtypes=use_nullable_dtypes,
--> 218             **kwargs,
    219         )
    220     )

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/data_management/factories/dispatcher.py in read_parquet(cls, **kwargs)
    165     @_inherit_docstrings(factories.BaseFactory._read_parquet)
    166     def read_parquet(cls, **kwargs):
--> 167         return cls.__factory._read_parquet(**kwargs)
    168 
    169     @classmethod

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/data_management/factories/factories.py in _read_parquet(cls, **kwargs)
    194     )
    195     def _read_parquet(cls, **kwargs):
--> 196         return cls.io_cls.read_parquet(**kwargs)
    197 
    198     @classmethod

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/file_dispatcher.py in read(cls, *args, **kwargs)
     65         postprocessing work on the resulting query_compiler object.
     66         """
---> 67         query_compiler = cls._read(*args, **kwargs)
     68         # TODO (devin-petersohn): Make this section more general for non-pandas kernel
     69         # implementations.

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/parquet_dispatcher.py in _read(cls, path, engine, columns, **kwargs)
    134                     column_names = [c for c in column_names if c not in index_columns]
    135             columns = [name for name in column_names if not PQ_INDEX_REGEX.match(name)]
--> 136         return cls.build_query_compiler(path, columns, **kwargs)

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/column_store_dispatcher.py in build_query_compiler(cls, path, columns, **kwargs)
    218         index, row_lens = cls.build_index(partition_ids)
    219         remote_parts = cls.build_partition(partition_ids[:-2], row_lens, column_widths)
--> 220         dtypes = cls.build_dtypes(partition_ids[-1], columns)
    221         new_query_compiler = cls.query_compiler_cls(
    222             cls.frame_cls(

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/column_store_dispatcher.py in build_dtypes(cls, partition_ids, columns)
    191             Series with dtypes for columns.
    192         """
--> 193         dtypes = pandas.concat(cls.materialize(list(partition_ids)), axis=0)
    194         dtypes.index = columns
    195         return dtypes

~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/ray/task_wrapper.py in materialize(cls, obj_id)
     80             Whatever was identified by `obj_id`.
     81         """
---> 82         return ray.get(obj_id)

~/miniconda3/envs/test/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
     80         if client_mode_should_convert():
     81             return getattr(ray, func.__name__)(*args, **kwargs)
---> 82         return func(*args, **kwargs)
     83 
     84     return wrapper

~/miniconda3/envs/test/lib/python3.7/site-packages/ray/worker.py in get(object_refs, timeout)
   1619                     worker.core_worker.dump_object_store_memory_usage()
   1620                 if isinstance(value, RayTaskError):
-> 1621                     raise value.as_instanceof_cause()
   1622                 else:
   1623                     raise value

RayTaskError(FileNotFoundError): ray::deploy_ray_func() (pid=1831567, ip=192.168.0.101)
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
    return func(**args)
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/modin/backends/pandas/parsers.py", line 595, in parse
    df = pandas.read_parquet(fname, **kwargs)
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
    **kwargs,
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
    mode="rb",
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
    path_or_handle, mode, is_text=False, storage_options=storage_options
  File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/common.py", line 710, in get_handle
    handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: parquet_df.parquet
@devin-petersohn
Copy link
Collaborator

Hi @jmakov thanks for posting! It looks like some of the workers cannot find the file at that path.

Do all workers have access to the file? Is this file on a shared filesystem or locally on the driver?

@jmakov
Copy link
Author

jmakov commented Oct 4, 2021

Thanks for the quick response. The file is only available locally on a single node from which it's being read. Does Modin require that all the nodes have access?

@devin-petersohn
Copy link
Collaborator

Does Modin require that all the nodes have access?

That's the current assumption by the parser logic in Modin (S3, NFS, etc.). We could hypothetically add a way to do this, is it something that's important to your workflow?

Happy to discuss more about your workflow if you'd like to email me (devin.petersohn@gmail.com)

@jmakov
Copy link
Author

jmakov commented Oct 4, 2021

No it's not. Just didn't expect this and assumed it's a bug. Would be great to have a note in the docs.

@jmakov jmakov closed this as completed Oct 4, 2021
@devin-petersohn devin-petersohn changed the title Cannot read local dataframe saved as parquet Add documentation about reading local files in a cluster environment Oct 4, 2021
@devin-petersohn
Copy link
Collaborator

Would be great to have a note in the docs.

Agree, I am going to reopen this to track that task if you don't mind.

@anmyachev anmyachev added the documentation 📜 Updates and issues with the documentation label Apr 21, 2022
@RehanSD
Copy link
Collaborator

RehanSD commented Oct 12, 2022

Related to #4479 !

@mvashishtha
Copy link
Collaborator

Duplicate of #4479

@mvashishtha mvashishtha marked this as a duplicate of #4479 Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation 📜 Updates and issues with the documentation
Projects
None yet
Development

No branches or pull requests

5 participants