Avoid stuck map operation when subprocesses crashes #5976

pappacena · 2023-06-21T21:18:31Z

I've been using Dataset.map() with num_proc=os.cpu_count() to leverage multicore processing for my datasets, but from time to time I get stuck processes waiting forever. Apparently, when one of the subprocesses is abruptly killed (OOM killer, segfault, SIGKILL, etc), the main process keeps waiting for the async task sent to that child process to finish.

It seems to be easy to reproduce the issue with the following script:

import os
from datasets import Dataset, Features, Value


def do_stuck(item):
    os.kill(os.getpid(), 9)

data = {
    "col1": list(range(5)),
    "col2": list(range(5)),
}

ds = Dataset.from_dict(
    data,
    features=Features({
        "col1": Value("int64"),
        "col2": Value("int64"),
    }),
)

print(ds.map(do_stuck, num_proc=4))

This is an old behavior in Python, which apparently was fixed a few years ago in concurrent.futures.ProcessPoolExecutor (ref), but not in multiprocessing.pool.Pool / multiprocess.pool.Pool, which is used by Dataset.map (ref).

This PR is an idea to try to detect when a child process gets killed, and raises a RuntimeError warning the dataset.map() caller.

EDIT: Related proposal for future improvement: #5977

lhoestq · 2023-06-22T16:24:35Z

Hi ! Do you think this can be fixed at the Pool level ? Ideally it should be the Pool responsibility to handle this, not the map code. We could even subclass Pool if needed (at least the one from multiprocess)

pappacena · 2023-06-25T14:48:23Z

@lhoestq it makes sense to me. Just pushed a refactoring creating a class ProcessPool(multiprocess.pool.Pool) to keep track of the PID changes.

HuggingFaceDocBuilderDev · 2023-06-26T10:46:58Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Nice ! Some additional suggestions:

lhoestq · 2023-06-26T11:00:50Z

src/datasets/utils/py_utils.py

 def iflatmap_unordered(
-    pool: Union[multiprocessing.pool.Pool, multiprocess.pool.Pool],
+    pool: ProcessPool,


I think we still need to keep Union[multiprocessing.pool.Pool, multiprocess.pool.Pool] support in iflatmap_unordered

src/datasets/utils/py_utils.py

lhoestq · 2023-06-26T12:25:28Z

I managed to raise an error without subclassing Pool with two additions to iflatmap_unordered:

at the beggining

original_pool = list(pool._pool)

in the loop

if any(async_result._pool != original_pool for async_result in async_results) and queue.empty():
    raise RuntimeError(
        "One of the subprocesses has abruptly died during map operation."
        "To debug the error, disable multiprocessing."
    )

It's still a fix that only works for iflatmap_unordered (so not for map, imap etc) but is maybe simpler that subclassing. It also works for both multiprocessing.Pool and multiprocess.Pool

pappacena · 2023-07-05T22:21:37Z

@lhoestq sorry for the delay. Busy weeks here.

I just pushed the change you requested. It looks closer to the original proposal, actually.

It seems that map actually uses iflatmap_unordered (here). I think this solution works fine for the map method (which is the one being tested by the new tests/test_arrow_dataset.py::BaseDatasetTest::test_map_crash_subprocess, right?).

lhoestq · 2023-07-06T09:42:44Z

Yes fixing iflatmap_unordered does fix Dataset.map, but it won't fix any Pool.map that we may use elsewhere so we'll have to keep this in mind.

lhoestq · 2023-07-06T10:20:47Z

It looks all good to me, feel free to fix code formatting by running make style and we can merge :)

pappacena · 2023-07-06T13:09:11Z

Yes fixing iflatmap_unordered does fix Dataset.map, but it won't fix any Pool.map that we may use elsewhere so we'll have to keep this in mind.

Right, I agree. The best way moving forward is probably not using the buggy multiprocess.Pool anymore, and replace it with concurrent.futures.ProcessPoolExecutor as much as possible.

Anyway, I've run make style now. Thanks for the support!

lhoestq · 2023-07-06T15:17:14Z

It looks like checking the async_result._pool doesn't always work - sorry about that. We might just go back to your original solution then. Would also be cool to open an issue in multiprocess to ask if they have a solution or if they plan to fix this.

pappacena · 2023-07-07T00:03:47Z

@lhoestq no problem! Reverted to the previous version.

TBH, given the discussions in this python issue, I don't think the error in multiprocess will be merged upstream any time soon...

tests/test_arrow_dataset.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

lhoestq

Thanks !

github-actions · 2023-07-10T09:58:38Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006060 / 0.011353 (-0.005293)	0.003695 / 0.011008 (-0.007313)	0.080484 / 0.038508 (0.041976)	0.061894 / 0.023109 (0.038785)	0.312510 / 0.275898 (0.036612)	0.352398 / 0.323480 (0.028918)	0.004638 / 0.007986 (-0.003348)	0.002918 / 0.004328 (-0.001410)	0.062932 / 0.004250 (0.058681)	0.050859 / 0.037052 (0.013807)	0.316812 / 0.258489 (0.058323)	0.357684 / 0.293841 (0.063843)	0.027622 / 0.128546 (-0.100924)	0.008012 / 0.075646 (-0.067634)	0.260970 / 0.419271 (-0.158302)	0.045807 / 0.043533 (0.002275)	0.321235 / 0.255139 (0.066096)	0.343162 / 0.283200 (0.059962)	0.021136 / 0.141683 (-0.120547)	1.465886 / 1.452155 (0.013731)	1.500216 / 1.492716 (0.007500)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.187286 / 0.018006 (0.169279)	0.428724 / 0.000490 (0.428235)	0.003029 / 0.000200 (0.002829)	0.000063 / 0.000054 (0.000008)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022703 / 0.037411 (-0.014708)	0.072740 / 0.014526 (0.058215)	0.083436 / 0.176557 (-0.093120)	0.144559 / 0.737135 (-0.592577)	0.083958 / 0.296338 (-0.212380)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.435729 / 0.215209 (0.220520)	4.351146 / 2.077655 (2.273491)	2.316627 / 1.504120 (0.812508)	2.144587 / 1.541195 (0.603393)	2.209182 / 1.468490 (0.740692)	0.501131 / 4.584777 (-4.083646)	3.077085 / 3.745712 (-0.668627)	4.353706 / 5.269862 (-0.916156)	2.621523 / 4.565676 (-1.944154)	0.058976 / 0.424275 (-0.365299)	0.006467 / 0.007607 (-0.001141)	0.506690 / 0.226044 (0.280646)	5.085787 / 2.268929 (2.816858)	2.731336 / 55.444624 (-52.713289)	2.419451 / 6.876477 (-4.457025)	2.583649 / 2.142072 (0.441577)	0.589869 / 4.805227 (-4.215359)	0.131040 / 6.500664 (-6.369624)	0.061332 / 0.075469 (-0.014137)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.220542 / 1.841788 (-0.621245)	18.169643 / 8.074308 (10.095335)	13.251704 / 10.191392 (3.060312)	0.142952 / 0.680424 (-0.537472)	0.016639 / 0.534201 (-0.517562)	0.334851 / 0.579283 (-0.244432)	0.361865 / 0.434364 (-0.072499)	0.380933 / 0.540337 (-0.159404)	0.527374 / 1.386936 (-0.859562)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006319 / 0.011353 (-0.005034)	0.003778 / 0.011008 (-0.007231)	0.062388 / 0.038508 (0.023880)	0.062228 / 0.023109 (0.039119)	0.373727 / 0.275898 (0.097829)	0.399442 / 0.323480 (0.075962)	0.005434 / 0.007986 (-0.002551)	0.003020 / 0.004328 (-0.001308)	0.062774 / 0.004250 (0.058524)	0.052784 / 0.037052 (0.015732)	0.376428 / 0.258489 (0.117939)	0.405039 / 0.293841 (0.111198)	0.027884 / 0.128546 (-0.100662)	0.008086 / 0.075646 (-0.067561)	0.067078 / 0.419271 (-0.352194)	0.042927 / 0.043533 (-0.000606)	0.372142 / 0.255139 (0.117003)	0.389604 / 0.283200 (0.106405)	0.021582 / 0.141683 (-0.120101)	1.473332 / 1.452155 (0.021177)	1.536018 / 1.492716 (0.043302)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.184729 / 0.018006 (0.166723)	0.421065 / 0.000490 (0.420575)	0.002681 / 0.000200 (0.002481)	0.000070 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026067 / 0.037411 (-0.011344)	0.077138 / 0.014526 (0.062612)	0.085178 / 0.176557 (-0.091379)	0.139681 / 0.737135 (-0.597454)	0.087528 / 0.296338 (-0.208810)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.444899 / 0.215209 (0.229690)	4.459168 / 2.077655 (2.381513)	2.408792 / 1.504120 (0.904672)	2.237243 / 1.541195 (0.696048)	2.296298 / 1.468490 (0.827808)	0.498508 / 4.584777 (-4.086269)	3.067064 / 3.745712 (-0.678648)	4.470577 / 5.269862 (-0.799284)	2.701972 / 4.565676 (-1.863705)	0.057711 / 0.424275 (-0.366564)	0.006443 / 0.007607 (-0.001164)	0.524046 / 0.226044 (0.298002)	5.229928 / 2.268929 (2.961000)	2.862101 / 55.444624 (-52.582523)	2.545972 / 6.876477 (-4.330504)	2.606459 / 2.142072 (0.464387)	0.593285 / 4.805227 (-4.211942)	0.124913 / 6.500664 (-6.375751)	0.061942 / 0.075469 (-0.013527)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.322162 / 1.841788 (-0.519625)	18.745796 / 8.074308 (10.671488)	13.955443 / 10.191392 (3.764051)	0.145610 / 0.680424 (-0.534814)	0.016817 / 0.534201 (-0.517384)	0.331180 / 0.579283 (-0.248103)	0.343019 / 0.434364 (-0.091345)	0.379459 / 0.540337 (-0.160878)	0.526403 / 1.386936 (-0.860533)

lhoestq reviewed Jun 26, 2023

View reviewed changes

pappacena force-pushed the tfp/avoid-stuck-map-operation branch from e6a19f3 to 45a5ee0 Compare July 5, 2023 22:09

pappacena added 3 commits July 5, 2023 19:12

Adding simple test

4bc64fe

Checking killed subprocs

37b0830

Code format

43e4c9b

pappacena force-pushed the tfp/avoid-stuck-map-operation branch from 45a5ee0 to 2ad4253 Compare July 5, 2023 22:14

Check pool of async results

095d5ea

pappacena force-pushed the tfp/avoid-stuck-map-operation branch from 2ad4253 to 095d5ea Compare July 5, 2023 22:14

Style

3811f45

Using pool._pool instead of async_result pool

919a9e4

lhoestq reviewed Jul 7, 2023

View reviewed changes

tests/test_arrow_dataset.py Show resolved Hide resolved

Avoid running on Windows a Linux specific test implementation

42d7331

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

lhoestq approved these changes Jul 10, 2023

View reviewed changes

lhoestq merged commit aca4cdc into huggingface:main Jul 10, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid stuck map operation when subprocesses crashes #5976

Avoid stuck map operation when subprocesses crashes #5976

pappacena commented Jun 21, 2023 •

edited

lhoestq commented Jun 22, 2023

pappacena commented Jun 25, 2023

HuggingFaceDocBuilderDev commented Jun 26, 2023 •

edited

lhoestq left a comment

lhoestq Jun 26, 2023

lhoestq commented Jun 26, 2023 •

edited

pappacena commented Jul 5, 2023

lhoestq commented Jul 6, 2023

lhoestq commented Jul 6, 2023

pappacena commented Jul 6, 2023

lhoestq commented Jul 6, 2023

pappacena commented Jul 7, 2023

lhoestq left a comment

github-actions bot commented Jul 10, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Avoid stuck map operation when subprocesses crashes #5976

Avoid stuck map operation when subprocesses crashes #5976

Conversation

pappacena commented Jun 21, 2023 • edited

lhoestq commented Jun 22, 2023

pappacena commented Jun 25, 2023

HuggingFaceDocBuilderDev commented Jun 26, 2023 • edited

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Jun 26, 2023

Choose a reason for hiding this comment

lhoestq commented Jun 26, 2023 • edited

pappacena commented Jul 5, 2023

lhoestq commented Jul 6, 2023

lhoestq commented Jul 6, 2023

pappacena commented Jul 6, 2023

lhoestq commented Jul 6, 2023

pappacena commented Jul 7, 2023

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 10, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

pappacena commented Jun 21, 2023 •

edited

HuggingFaceDocBuilderDev commented Jun 26, 2023 •

edited

lhoestq commented Jun 26, 2023 •

edited