🤗 Transformers **Trainer** API raises exception on train if triggered from an already started ML Flow run.

## Environment info


- `transformers` version: 4.16.2
- Platform: Linux-5.11.0-40-generic-x86_64-with-debian-10.9
- Python version: 3.7.10
- PyTorch version (GPU?): 1.11.0.dev20220112+cu111 (True)
- Tensorflow version (GPU?): 2.5.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: parallel

### Who can help
@sgugger 


## Information

Model I am using is bert-base-cased to replicate the bug while using 🤗 Transformers **Trainer** API taken from the official [example](https://huggingface.co/docs/transformers/training#finetuning-in-pytorch-with-the-trainer-api).

The problem arises when using:
* [ ] the official example scripts:
* [x] my own modified scripts: Bug arises when i use the 🤗 Transformers **Trainer** API inside an already started ML Flow run.

The tasks I am working on is:
* [x] an official GLUE/SQUaD task: GLUE on IMDB Dataset
* [ ] my own task or dataset:

## To reproduce

Steps to reproduce the behavior:

1. Initialise a ML Flow run.  
2. Start a Training with 🤗 Transformers **Trainer** API inside the ML Flow run.
3. Causes an exception while the 🤗 Transformers **Trainer** API tries to create another ML Flow run while a ML Flow run is already started.

Exception : 
```console
Exception: Run with UUID fad5d86248564973ababb1627466c0cb is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True
```

_Code to replicate Exception:_
```python
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
import mlflow


ML_FLOW_URI = '<put mlflow uri here>'
# # Setup ML Flow Run
mlflow.set_tracking_uri(ML_FLOW_URI)

def get_data():
    
    # init Data, tokenzier, model
    raw_datasets = load_dataset("imdb")
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    # Tokenize data
    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
    
    return small_train_dataset, small_eval_dataset


small_train_dataset, small_eval_dataset = get_data()
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

# Init Training
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=small_train_dataset, 
    eval_dataset=small_eval_dataset
)

with mlflow.start_run(run_name='my_main_run') as root_run:
    trainer.train() # This line causes the Exception
```

_Line causing the exception:_ 
```python
with mlflow.start_run(run_name='my_main_run') as root_run:
    trainer.train() # This line causes the Exception
```

_Traceback:_ 
```console
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/scripts/trainer_bug_replication.py", line 43, in <module>
    trainer.train() # This line causes the Exception
  File "/usr/local/lib/python3.7/site-packages/transformers/trainer.py", line 1308, in train
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/usr/local/lib/python3.7/site-packages/transformers/trainer_callback.py", line 348, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/usr/local/lib/python3.7/site-packages/transformers/trainer_callback.py", line 399, in call_event
    **kwargs,
  File "/usr/local/lib/python3.7/site-packages/transformers/integrations.py", line 742, in on_train_begin
    self.setup(args, state, model)
  File "/usr/local/lib/python3.7/site-packages/transformers/integrations.py", line 718, in setup
    self._ml_flow.start_run(run_name=args.run_name)
  File "/usr/local/lib/python3.7/site-packages/mlflow/tracking/fluent.py", line 232, in start_run
    ).format(_active_run_stack[0].info.run_id)
Exception: Run with UUID cb409c683c154f78bdcd37001894ae7b is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True
```

## Possible solution
When ML Flow is setup by default during the initialisation of the MLflowCallback (given mlflow is installed), the setup should have checked for already running ML Flow run and appropriately start a nested run. Starting a nested Run would help not hamper the logs of parent run already started by the author/user. 

This can be fixed by replacing LINE 718 in integrations.py
```python
            self._ml_flow.start_run(run_name=args.run_name)
```
with
```python
            nested = True if self._ml_flow.active_run is not None else False
            self._ml_flow.start_run(run_name=args.run_name, nested=nested)
```
I can raise a PR if needed :)


## Expected behavior
The 🤗 Transformers **Trainer** API should not raise exception if trainer is started inside an already running ML Flow run started by the user.
Rather as a user I would expect the 🤗 Transformers **Trainer** API  to log a nested mlflow run if i have already started a parent run without interfering with my parent mlflow logs.



### Similar/Related Issues
https://github.com/huggingface/transformers/issues/11115

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤗 Transformers Trainer API raises exception on train if triggered from an already started ML Flow run. #15663

Environment info

Who can help

Information

To reproduce

Possible solution

Expected behavior

Similar/Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🤗 Transformers **Trainer** API raises exception on train if triggered from an already started ML Flow run. #15663

Description

Environment info

Who can help

Information

To reproduce

Possible solution

Expected behavior

Similar/Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

🤗 Transformers Trainer API raises exception on train if triggered from an already started ML Flow run. #15663