In some cases `wait_partitions` does not work as expected #5944

anmyachev · 2023-04-05T17:04:01Z

Initially found in #5713. Found while trying to use this function to wait for remote computation to finish for read functions.

  # at the moment it is not possible to use `wait_partitions` function;
  # in a situation where the reading function is called in a row with the
  # same parameters, `wait_partitions` considers that we have waited for
  # the end of remote calculations, however, when trying to materialize the
  # received data, it is clear that the calculations have not yet ended.
  # for example, `test_io_exp.py::test_read_evaluated_dict` is failed because of that

For Dask, this can be solved by making pure parameter False by default. The problem was also observed for ray and unidist.

It seems that if engines cache the result of a function call with the same parameters, then features pointing to the same object (or to a copy of it) should be returned. But now it turns out that the features are in the state of the end of the calculations, however, the calculations are actually still going on (very similar to a bug). Further research is needed.

The text was updated successfully, but these errors were encountered:

zmbc · 2023-10-11T16:21:03Z

The current workaround of materializing dtypes can be problematic: for example, if you load a dataset with a very large pd.Categorical that can't fit into memory of a single worker. This works fine in AsyncReadMode but not in the default, synchronous mode, because _ = query_compiler.dtypes will crash the worker.

This is obviously quite an edge case. However, I am a bit surprised that synchronous reading is the default; I see why it is necessary in the test suite but I can't imagine it is common to delete data files as soon as they have been loaded.

anmyachev added bug 🦗 Something isn't working P1 Important tasks that we should complete soon labels Apr 5, 2023

This was referenced Apr 5, 2023

PERF-#5740: allow read_csv, read_fwf, read_table, read_custom_text functions be executed fully asynchronous; introduce ModinDtypes #5713

Merged

CI: test_io.py failing on main because of missing AWS keys #5977

Closed

tanliwei-coder mentioned this issue May 19, 2023

read_csv occurs a error " TypeError: Cannot interpret 'nan' as a data type" #6165

Open

zmbc mentioned this issue Oct 16, 2023

Read parquet asynchronously with Modin ihmeuw/pseudopeople#326

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In some cases `wait_partitions` does not work as expected #5944

In some cases `wait_partitions` does not work as expected #5944

anmyachev commented Apr 5, 2023

zmbc commented Oct 11, 2023

In some cases wait_partitions does not work as expected #5944

In some cases wait_partitions does not work as expected #5944

Comments

anmyachev commented Apr 5, 2023

zmbc commented Oct 11, 2023

In some cases `wait_partitions` does not work as expected #5944

In some cases `wait_partitions` does not work as expected #5944