-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Description
Environment info
transformersversion: 4.16.2- Platform: Linux-5.11.0-40-generic-x86_64-with-debian-10.9
- Python version: 3.7.10
- PyTorch version (GPU?): 1.11.0.dev20220112+cu111 (True)
- Tensorflow version (GPU?): 2.5.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: parallel
Who can help
Information
Model I am using is bert-base-cased to replicate the bug while using 🤗 Transformers Trainer API taken from the official example.
The problem arises when using:
- the official example scripts:
- my own modified scripts: Bug arises when i use the 🤗 Transformers Trainer API inside an already started ML Flow run.
The tasks I am working on is:
- an official GLUE/SQUaD task: GLUE on IMDB Dataset
- my own task or dataset:
To reproduce
Steps to reproduce the behavior:
- Initialise a ML Flow run.
- Start a Training with 🤗 Transformers Trainer API inside the ML Flow run.
- Causes an exception while the 🤗 Transformers Trainer API tries to create another ML Flow run while a ML Flow run is already started.
Exception :
Exception: Run with UUID fad5d86248564973ababb1627466c0cb is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=TrueCode to replicate Exception:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
import mlflow
ML_FLOW_URI = '<put mlflow uri here>'
# # Setup ML Flow Run
mlflow.set_tracking_uri(ML_FLOW_URI)
def get_data():
# init Data, tokenzier, model
raw_datasets = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# Tokenize data
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
return small_train_dataset, small_eval_dataset
small_train_dataset, small_eval_dataset = get_data()
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
# Init Training
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset
)
with mlflow.start_run(run_name='my_main_run') as root_run:
trainer.train() # This line causes the ExceptionLine causing the exception:
with mlflow.start_run(run_name='my_main_run') as root_run:
trainer.train() # This line causes the ExceptionTraceback:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/scripts/trainer_bug_replication.py", line 43, in <module>
trainer.train() # This line causes the Exception
File "/usr/local/lib/python3.7/site-packages/transformers/trainer.py", line 1308, in train
self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
File "/usr/local/lib/python3.7/site-packages/transformers/trainer_callback.py", line 348, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/usr/local/lib/python3.7/site-packages/transformers/trainer_callback.py", line 399, in call_event
**kwargs,
File "/usr/local/lib/python3.7/site-packages/transformers/integrations.py", line 742, in on_train_begin
self.setup(args, state, model)
File "/usr/local/lib/python3.7/site-packages/transformers/integrations.py", line 718, in setup
self._ml_flow.start_run(run_name=args.run_name)
File "/usr/local/lib/python3.7/site-packages/mlflow/tracking/fluent.py", line 232, in start_run
).format(_active_run_stack[0].info.run_id)
Exception: Run with UUID cb409c683c154f78bdcd37001894ae7b is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=TruePossible solution
When ML Flow is setup by default during the initialisation of the MLflowCallback (given mlflow is installed), the setup should have checked for already running ML Flow run and appropriately start a nested run. Starting a nested Run would help not hamper the logs of parent run already started by the author/user.
This can be fixed by replacing LINE 718 in integrations.py
self._ml_flow.start_run(run_name=args.run_name)with
nested = True if self._ml_flow.active_run is not None else False
self._ml_flow.start_run(run_name=args.run_name, nested=nested)I can raise a PR if needed :)
Expected behavior
The 🤗 Transformers Trainer API should not raise exception if trainer is started inside an already running ML Flow run started by the user.
Rather as a user I would expect the 🤗 Transformers Trainer API to log a nested mlflow run if i have already started a parent run without interfering with my parent mlflow logs.