-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loky backend doesn't cleanup worker processes #945
Comments
Hi, thanks for reporting. This is actually the expected behavior for If they are not reused, the processes will be cleaned up if they time out (default is 300s and apparently there is no way to modify that yet). If you need to force the clean-up of such process, you could call:
Let me know if this solves your problem. As for the API, would it be better if you controlled the timemout delay for the worker or would you need to directly clean up the processes with imperative instruction? |
Maybe we should improve the joblib documentation. |
Thank you this note cause it helped me to solve issue with using joblib as a part of apache airflow python task where I had many daemonic processes after DAG execution. |
What if I use |
I want to read all the sheets in an xlsx file using joblib in parallel mode. Polars is the library for processing the xlsx file. def read_custom_csv(source="test.xlsx", sheet_list, xlsx2csv_options):
get_reusable_executor().shutdown(wait=True)
parser = xlsx2csv.Xlsx2csv(source, **xlsx2csv_options)
read_csv_options = {"infer_schema_length": 0, "truncate_ragged_lines": True}
read_csv_options_wo_header = {
"infer_schema_length": 0,
"truncate_ragged_lines": True,
"skip_rows": 1,
"has_header": False,
}
excluded_sheets = ["A", "B", "C", "D"]
core_count = cpu_count()
n_jobs = os.environ.get("THREAD_COUNT", core_count // 2)
print(f"Using {n_jobs} number of processes for loading sheets in parallel.")
args = list()
for sheet in sheet_list:
args.append(
{
"parser": parser,
"sheet_name": sheet,
"read_csv_options": read_csv_options_wo_header if sheet in excluded_sheets else read_csv_options,
}
)
with parallel_config(backend="loky", n_jobs=n_jobs):
results = Parallel(return_as="generator")(delayed(_read_excel_sheet)(**a) for a in args)
return {sheet[0]: sheet[1] for sheet in results}
def _read_excel_sheet(parser, sheet_name, read_csv_options) -> pl.DataFrame:
csv_buffer = StringIO()
parser.convert(outfile=csv_buffer, sheetname=sheet_name)
if csv_buffer.tell() != 0:
csv_buffer.seek(0)
return sheet_name, pl.read_csv(csv_buffer, **read_csv_options) |
Python 3.7.2, macOS 10.13.3 and Ubuntu 18.04
I notice when using the Loky backend, joblib doesn't clean up after itself even when explicitly calling
_terminate_backend()
. Here's a minimal example:Same effect if I use the context manager to construct the pool:
However, with the
multiprocessing
backend, it works as expected:The text was updated successfully, but these errors were encountered: