Skip to content

🤗 Transformers **Trainer** API raises exception on train if triggered from an already started ML Flow run. #15663

@Ataago

Description

@Ataago

Environment info

  • transformers version: 4.16.2
  • Platform: Linux-5.11.0-40-generic-x86_64-with-debian-10.9
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.11.0.dev20220112+cu111 (True)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: parallel

Who can help

@sgugger

Information

Model I am using is bert-base-cased to replicate the bug while using 🤗 Transformers Trainer API taken from the official example.

The problem arises when using:

  • the official example scripts:
  • my own modified scripts: Bug arises when i use the 🤗 Transformers Trainer API inside an already started ML Flow run.

The tasks I am working on is:

  • an official GLUE/SQUaD task: GLUE on IMDB Dataset
  • my own task or dataset:

To reproduce

Steps to reproduce the behavior:

  1. Initialise a ML Flow run.
  2. Start a Training with 🤗 Transformers Trainer API inside the ML Flow run.
  3. Causes an exception while the 🤗 Transformers Trainer API tries to create another ML Flow run while a ML Flow run is already started.

Exception :

Exception: Run with UUID fad5d86248564973ababb1627466c0cb is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True

Code to replicate Exception:

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
import mlflow


ML_FLOW_URI = '<put mlflow uri here>'
# # Setup ML Flow Run
mlflow.set_tracking_uri(ML_FLOW_URI)

def get_data():
    
    # init Data, tokenzier, model
    raw_datasets = load_dataset("imdb")
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    # Tokenize data
    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
    
    return small_train_dataset, small_eval_dataset


small_train_dataset, small_eval_dataset = get_data()
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

# Init Training
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=small_train_dataset, 
    eval_dataset=small_eval_dataset
)

with mlflow.start_run(run_name='my_main_run') as root_run:
    trainer.train() # This line causes the Exception

Line causing the exception:

with mlflow.start_run(run_name='my_main_run') as root_run:
    trainer.train() # This line causes the Exception

Traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/scripts/trainer_bug_replication.py", line 43, in <module>
    trainer.train() # This line causes the Exception
  File "/usr/local/lib/python3.7/site-packages/transformers/trainer.py", line 1308, in train
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/usr/local/lib/python3.7/site-packages/transformers/trainer_callback.py", line 348, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/usr/local/lib/python3.7/site-packages/transformers/trainer_callback.py", line 399, in call_event
    **kwargs,
  File "/usr/local/lib/python3.7/site-packages/transformers/integrations.py", line 742, in on_train_begin
    self.setup(args, state, model)
  File "/usr/local/lib/python3.7/site-packages/transformers/integrations.py", line 718, in setup
    self._ml_flow.start_run(run_name=args.run_name)
  File "/usr/local/lib/python3.7/site-packages/mlflow/tracking/fluent.py", line 232, in start_run
    ).format(_active_run_stack[0].info.run_id)
Exception: Run with UUID cb409c683c154f78bdcd37001894ae7b is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True

Possible solution

When ML Flow is setup by default during the initialisation of the MLflowCallback (given mlflow is installed), the setup should have checked for already running ML Flow run and appropriately start a nested run. Starting a nested Run would help not hamper the logs of parent run already started by the author/user.

This can be fixed by replacing LINE 718 in integrations.py

            self._ml_flow.start_run(run_name=args.run_name)

with

            nested = True if self._ml_flow.active_run is not None else False
            self._ml_flow.start_run(run_name=args.run_name, nested=nested)

I can raise a PR if needed :)

Expected behavior

The 🤗 Transformers Trainer API should not raise exception if trainer is started inside an already running ML Flow run started by the user.
Rather as a user I would expect the 🤗 Transformers Trainer API to log a nested mlflow run if i have already started a parent run without interfering with my parent mlflow logs.

Similar/Related Issues

#11115

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions