New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327
Comments
Gently pinging @pacman100 for details on the issue. |
@prathikr can you try via transformers v4.33.0? And is |
@muellerzr it is meant to be the same, though I don't think the example script in optimum has been updated in a while. For this particular issue, I copied the code from I see the same issue when using transformers==4.33.0 |
Having the same issue without ORT also with num_workers>1: Can this issue be prioritized? |
@muellerzr @JingyaHuang any updates? |
Although the traceback is slightly different, the same issue exists with
and using the Optimum ORT training example:
So I doubt the issue is related to accelerate, and it is likely a longstanding one. |
This issue was introduced in #1115, specifically: optimum/optimum/exporters/onnx/convert.py Line 57 in c631387
It appears that the onnxruntime-training uses multiprocessing at some point, and The above PR launches validation of the exported ONNX models in subprocesses avoiding reported memory leaks in ORT InferenceSession destruction on CUDA EP (that does not support fork). I'll put some guards and publish a patch, thank you for the notice. |
System Info
Who can help?
@JingyaHuang @echarlai
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Environment:
Run Command:
Error:
Expected behavior
The script is expected to train
google/vit-base-patch16-224
for image-classification task.The error occurs only when using
--dataloader_num_workers
parameter. The error callstack never passes through any onnxruntime code and seems to be isolated to the dataloader. I believe some update to the accelerate package is causing this issue with optimum. There is no error when I useORTModule
directly fromonnxruntime.training.ortmodule
library, only an issue when I useORTTrainer
fromoptimum.onnxruntime
library.Related links from the web:
This is URGENT as the AzureML Vision Team is in the final steps of releasing finetuning components for image-classification to production. Thank you in advance for any help!
The text was updated successfully, but these errors were encountered: