-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Closed
Description
System Info
Transformers 4.28.1, torch 1.13.1
When using the Seq2SeqTrainer with the MLFlow integration enabled, if I lose my connection to mlflow after the training has begun (if the server crashes or if there is a network error), MLFlow throws the exception:
raise MlflowException(f"API request to {url} failed with exception {e}")
I don't know if this is a bug, or if you have a recommended way to continue: I would like the option to have the seq2seqtrainer log the mlflow connection error but then continue training with the mlflow integration disabled.
I imagine that this behavior would apply to any integration, not just mlflow.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Start training with Seq2Seq Trainer with a connection to MLflow
- During training, stop the mlflow server
- Seq2Seq Trainer raises an MLFlow Exception
Expected behavior
I would expect that there would be an option to allow for mlflow exception to disable the integration but continue training.
Metadata
Metadata
Assignees
Labels
No labels