AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327

prathikr · 2023-08-31T20:41:05Z

System Info

python==3.8

accelerate==0.22.0
evaluate==0.4.0
optimum==1.12.0
scikit-learn==1.3.0
transformers==4.32.1

onnx==1.14.0
onnxruntime-training==1.15.0+cu118

Who can help?

@JingyaHuang @echarlai

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Environment:

FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu118-py38-torch200
RUN pip install accelerate evaluate optimum transformers
RUN pip install scikit-learn

RUN pip list

Run Command:

torchrun run_image_classification.py \
                        --model_name_or_path google/vit-base-patch16-224 \
                        --do_train --do_eval \
                        --dataset_name beans \
                        --fp16 True --num_train_epochs 1 \
                        --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
                        --remove_unused_columns False --ignore_mismatched_sizes True \
                        --output_dir output_dir --overwrite_output_dir \
                        --dataloader_num_workers 1

Error:

Traceback (most recent call last):
  File "run_image_classification_ort.py", line 423, in <module>
    main()
  File "run_image_classification_ort.py", line 397, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/optimum/onnxruntime/trainer.py", line 462, in train
    return inner_training_loop(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/optimum/onnxruntime/trainer.py", line 739, in _inner_training_loop
    for step, inputs in enumerate(train_dataloader):
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/accelerate/data_loader.py", line 381, in __iter__
    dataloader_iter = super().__iter__()
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 442, in __iter__
    return self._get_iterator()
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1043, in __init__
    w.start()
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.train_transforms'

Expected behavior

The script is expected to train google/vit-base-patch16-224 for image-classification task.

The error occurs only when using --dataloader_num_workers parameter. The error callstack never passes through any onnxruntime code and seems to be isolated to the dataloader. I believe some update to the accelerate package is causing this issue with optimum. There is no error when I use ORTModule directly from onnxruntime.training.ortmodule library, only an issue when I use ORTTrainer from optimum.onnxruntime library.

Related links from the web:

This is URGENT as the AzureML Vision Team is in the final steps of releasing finetuning components for image-classification to production. Thank you in advance for any help!

The text was updated successfully, but these errors were encountered:

JingyaHuang · 2023-09-04T09:28:00Z

Gently pinging @pacman100 for details on the issue.

muellerzr · 2023-09-05T12:58:30Z

@prathikr can you try via transformers v4.33.0? And is run_image_classification.py any different than the one in examples/pytorch/run_image_classification.py? Running on two GPUs I saw no issue here.

prathikr · 2023-09-06T01:23:38Z

@muellerzr it is meant to be the same, though I don't think the example script in optimum has been updated in a while. For this particular issue, I copied the code from transformers/examples/pytorch/image-classification/run_image_classification.py and swapped Trainer/TrainingArguments with ORTTrainer/ORTTrainingArguments manually.

I see the same issue when using transformers==4.33.0

shubhamiit · 2023-09-12T06:53:36Z

Having the same issue without ORT also with num_workers>1:
torch= 1.13.1
datasets==2.14.5
transformers==4.33.0
optimum==1.13.1
accelerate==0.22.0
diffusers==0.20.2

Can this issue be prioritized?

prathikr · 2023-09-12T20:41:24Z

@muellerzr @JingyaHuang any updates?

fxmarty · 2023-09-13T05:59:47Z

Although the traceback is slightly different, the same issue exists with

torch==2.0.1
transformers==4.28.1
optimum==1.10.0
accelerate uninstalled

and using the Optimum ORT training example:

  0%|                                                                                                                                | 0/130 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/fxmarty/hf_internship/optimum/examples/onnxruntime/training/image-classification/run_image_classification.py", line 406, in <module>
    main()
  File "/home/fxmarty/hf_internship/optimum/examples/onnxruntime/training/image-classification/run_image_classification.py", line 380, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/optimum/onnxruntime/trainer.py", line 454, in train
    return inner_training_loop(
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/optimum/onnxruntime/trainer.py", line 717, in _inner_training_loop
    for step, inputs in enumerate(train_dataloader):
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
    return self._get_iterator()
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
    w.start()
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.train_transforms'
  0%|                                                                                                                                | 0/130 [00:00<?, ?it/s]

So I doubt the issue is related to accelerate, and it is likely a longstanding one.

fxmarty · 2023-09-13T06:35:00Z

This issue was introduced in #1115, specifically:

optimum/optimum/exporters/onnx/convert.py

Line 57 in c631387

mp.set_start_method("spawn", force=True)

It appears that the onnxruntime-training uses multiprocessing at some point, and spawn is not working. Fork is.

The above PR launches validation of the exported ONNX models in subprocesses avoiding reported memory leaks in ORT InferenceSession destruction on CUDA EP (that does not support fork). I'll put some guards and publish a patch, thank you for the notice.

prathikr added the bug Something isn't working label Aug 31, 2023

JingyaHuang mentioned this issue Sep 1, 2023

[ORT Training] Some important updates of ONNX Runtime training APIs #1335

Merged

3 tasks

fxmarty mentioned this issue Sep 13, 2023

Guard multiprocessing set start method #1377

Merged

fxmarty closed this as completed in #1377 Sep 13, 2023

helena-intel mentioned this issue Sep 22, 2023

Audio classification sample fails with error: can't pickle local object 'main.<locals>.train_transforms' huggingface/optimum-intel#423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327

AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327

prathikr commented Aug 31, 2023 •

edited

JingyaHuang commented Sep 4, 2023

muellerzr commented Sep 5, 2023

prathikr commented Sep 6, 2023 •

edited

shubhamiit commented Sep 12, 2023 •

edited

prathikr commented Sep 12, 2023

fxmarty commented Sep 13, 2023 •

edited

fxmarty commented Sep 13, 2023

AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327

AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327

Comments

prathikr commented Aug 31, 2023 • edited

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

JingyaHuang commented Sep 4, 2023

muellerzr commented Sep 5, 2023

prathikr commented Sep 6, 2023 • edited

shubhamiit commented Sep 12, 2023 • edited

prathikr commented Sep 12, 2023

fxmarty commented Sep 13, 2023 • edited

fxmarty commented Sep 13, 2023

prathikr commented Aug 31, 2023 •

edited

prathikr commented Sep 6, 2023 •

edited

shubhamiit commented Sep 12, 2023 •

edited

fxmarty commented Sep 13, 2023 •

edited