Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327

Closed
2 of 4 tasks
prathikr opened this issue Aug 31, 2023 · 7 comments · Fixed by #1377
Closed
2 of 4 tasks

AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327

prathikr opened this issue Aug 31, 2023 · 7 comments · Fixed by #1377
Labels
bug Something isn't working

Comments

@prathikr
Copy link
Contributor

prathikr commented Aug 31, 2023

System Info

python==3.8

accelerate==0.22.0
evaluate==0.4.0
optimum==1.12.0
scikit-learn==1.3.0
transformers==4.32.1

onnx==1.14.0
onnxruntime-training==1.15.0+cu118

Who can help?

@JingyaHuang @echarlai

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Environment:

FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu118-py38-torch200
RUN pip install accelerate evaluate optimum transformers
RUN pip install scikit-learn

RUN pip list

Run Command:

torchrun run_image_classification.py \
                        --model_name_or_path google/vit-base-patch16-224 \
                        --do_train --do_eval \
                        --dataset_name beans \
                        --fp16 True --num_train_epochs 1 \
                        --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
                        --remove_unused_columns False --ignore_mismatched_sizes True \
                        --output_dir output_dir --overwrite_output_dir \
                        --dataloader_num_workers 1

Error:

Traceback (most recent call last):
  File "run_image_classification_ort.py", line 423, in <module>
    main()
  File "run_image_classification_ort.py", line 397, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/optimum/onnxruntime/trainer.py", line 462, in train
    return inner_training_loop(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/optimum/onnxruntime/trainer.py", line 739, in _inner_training_loop
    for step, inputs in enumerate(train_dataloader):
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/accelerate/data_loader.py", line 381, in __iter__
    dataloader_iter = super().__iter__()
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 442, in __iter__
    return self._get_iterator()
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1043, in __init__
    w.start()
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/envs/ptca/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.train_transforms'

Expected behavior

The script is expected to train google/vit-base-patch16-224 for image-classification task.

The error occurs only when using --dataloader_num_workers parameter. The error callstack never passes through any onnxruntime code and seems to be isolated to the dataloader. I believe some update to the accelerate package is causing this issue with optimum. There is no error when I use ORTModule directly from onnxruntime.training.ortmodule library, only an issue when I use ORTTrainer from optimum.onnxruntime library.

Related links from the web:

This is URGENT as the AzureML Vision Team is in the final steps of releasing finetuning components for image-classification to production. Thank you in advance for any help!

@JingyaHuang
Copy link
Collaborator

Gently pinging @pacman100 for details on the issue.

@muellerzr
Copy link

@prathikr can you try via transformers v4.33.0? And is run_image_classification.py any different than the one in examples/pytorch/run_image_classification.py? Running on two GPUs I saw no issue here.

@prathikr
Copy link
Contributor Author

prathikr commented Sep 6, 2023

@muellerzr it is meant to be the same, though I don't think the example script in optimum has been updated in a while. For this particular issue, I copied the code from transformers/examples/pytorch/image-classification/run_image_classification.py and swapped Trainer/TrainingArguments with ORTTrainer/ORTTrainingArguments manually.

I see the same issue when using transformers==4.33.0

@shubhamiit
Copy link

shubhamiit commented Sep 12, 2023

Having the same issue without ORT also with num_workers>1:
torch= 1.13.1
datasets==2.14.5
transformers==4.33.0
optimum==1.13.1
accelerate==0.22.0
diffusers==0.20.2

Can this issue be prioritized?

@prathikr
Copy link
Contributor Author

@muellerzr @JingyaHuang any updates?

@fxmarty
Copy link
Collaborator

fxmarty commented Sep 13, 2023

Although the traceback is slightly different, the same issue exists with

torch==2.0.1
transformers==4.28.1
optimum==1.10.0
accelerate uninstalled

and using the Optimum ORT training example:

  0%|                                                                                                                                | 0/130 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/fxmarty/hf_internship/optimum/examples/onnxruntime/training/image-classification/run_image_classification.py", line 406, in <module>
    main()
  File "/home/fxmarty/hf_internship/optimum/examples/onnxruntime/training/image-classification/run_image_classification.py", line 380, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/optimum/onnxruntime/trainer.py", line 454, in train
    return inner_training_loop(
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/optimum/onnxruntime/trainer.py", line 717, in _inner_training_loop
    for step, inputs in enumerate(train_dataloader):
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
    return self._get_iterator()
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
    w.start()
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.train_transforms'
  0%|                                                                                                                                | 0/130 [00:00<?, ?it/s]

So I doubt the issue is related to accelerate, and it is likely a longstanding one.

@fxmarty
Copy link
Collaborator

fxmarty commented Sep 13, 2023

This issue was introduced in #1115, specifically:

mp.set_start_method("spawn", force=True)

It appears that the onnxruntime-training uses multiprocessing at some point, and spawn is not working. Fork is.

The above PR launches validation of the exported ONNX models in subprocesses avoiding reported memory leaks in ORT InferenceSession destruction on CUDA EP (that does not support fork). I'll put some guards and publish a patch, thank you for the notice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants