Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In some cases wait_partitions does not work as expected #5944

Open
anmyachev opened this issue Apr 5, 2023 · 1 comment
Open

In some cases wait_partitions does not work as expected #5944

anmyachev opened this issue Apr 5, 2023 · 1 comment
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon

Comments

@anmyachev
Copy link
Collaborator

Initially found in #5713. Found while trying to use this function to wait for remote computation to finish for read functions.

  # at the moment it is not possible to use `wait_partitions` function;
  # in a situation where the reading function is called in a row with the
  # same parameters, `wait_partitions` considers that we have waited for
  # the end of remote calculations, however, when trying to materialize the
  # received data, it is clear that the calculations have not yet ended.
  # for example, `test_io_exp.py::test_read_evaluated_dict` is failed because of that

For Dask, this can be solved by making pure parameter False by default. The problem was also observed for ray and unidist.

It seems that if engines cache the result of a function call with the same parameters, then features pointing to the same object (or to a copy of it) should be returned. But now it turns out that the features are in the state of the end of the calculations, however, the calculations are actually still going on (very similar to a bug). Further research is needed.

@zmbc
Copy link
Contributor

zmbc commented Oct 11, 2023

The current workaround of materializing dtypes can be problematic: for example, if you load a dataset with a very large pd.Categorical that can't fit into memory of a single worker. This works fine in AsyncReadMode but not in the default, synchronous mode, because _ = query_compiler.dtypes will crash the worker.

This is obviously quite an edge case. However, I am a bit surprised that synchronous reading is the default; I see why it is necessary in the test suite but I can't imagine it is common to delete data files as soon as they have been loaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon
Projects
None yet
Development

No branches or pull requests

2 participants