# Multi-task Training with Hugging Face Transformers and NLP

### Or: A recipe for multi-task training with Transformers' Trainer and NLP datasets



Hugging Face has been building a lot of exciting new NLP functionality lately. The newly released [NLP](https://github.com/huggingface/nlp) provides a wide coverage of task data sets and metrics, as well as a simple interface for processing and caching the inputs extremely efficiently. They have also recently introduced a [Trainer](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class to the Transformers library that handles all of the training and validation logic.

However, one feature that is not currently supported in Hugging Face's current offerings is *multi-task training*. While there has been some discussion about the best way to support multi-task training ([1](https://github.com/huggingface/transformers/issues/4340), [2](https://github.com/huggingface/nlp/issues/217)), the community has not yet settled on a convention for doing so. Multi-task training has been shown to improve task performance ([1](https://www.aclweb.org/anthology/P19-1441/), [2](https://arxiv.org/abs/1910.10683)) and is a common experimental setting for NLP researchers.

In this Colab notebook, we will show how to use both the new NLP library as well as the Trainer for a **multi-task** training scheme.

So let's get started!

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Library setup

First up, we will install libraries.

In [None]:
!pip install -q transformers==4.28.1
!pip install -q datasets==2.12.0
!pip install -q import-ipynb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m96.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import numpy as np
import torch
import torch.nn as nn
import transformers
import logging
import import_ipynb
logging.basicConfig(level=logging.INFO)

from datasets import load_dataset

import dataclasses
from torch.utils.data.dataloader import DataLoader
# from transformers.training_args import is_tpu_available
# from transformers.trainer import get_tpu_sampler
from transformers.data.data_collator import DataCollator, InputDataClass
from transformers.trainer_utils import get_last_checkpoint
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data.sampler import RandomSampler, SequentialSampler
from typing import List, Union, Dict

import pickle
import json
import logging
import math
import os
import random
import re
import shutil
from contextlib import contextmanager
from pathlib import Path
from typing import Callable, Dict, List, Optional, Tuple

from numpy import savetxt, loadtxt
from packaging import version
from torch import nn
from torch.utils.data.dataloader import DataLoader
from torch.utils.data.dataset import Dataset
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler
from tqdm.auto import tqdm, trange

from transformers.data.data_collator import DataCollator, DefaultDataCollator
from transformers.modeling_utils import PreTrainedModel
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, TrainOutput
from transformers.training_args import TrainingArguments

import gc

**dataset_code is an indication of each of the available tasks (datasets):**

*  dataset_code = 0: the corresponding French *task/dataset*
*  dataset_code = 1: the corresponding English *task/dataset*
*  dataset_code = 2: the corresponding Farsi *task/dataset*



**Combination** is an integer number ranging from 0 to 2, pointing out to one of the possible options in multi-task mode:



*   Combination = 0: We are considering => *English & French*
*   Combination = 1: We are considering => *English & Farsi*
*   Combination = 2: We are considering => *French & Farsi*






# Set the required configuration

This process is done by calling another notebook from diffrent location in Google drive (The notebook was named as **"Fetching_required_status"**)

---



---


Inside this notebook, there exist several practical functions, each of which is kind of dealing with setting our overall status (eg. fine-tune, the dataset type, etc.)

In [None]:
%cd "/content/drive/MyDrive/NLP Bachelors' Project/Notebooks/Forgettable"
import Fetching_required_status_multilingual2 as Fetching_required_status_multilingual

/content/drive/.shortcut-targets-by-id/17KSuF77xEb0EV_xu89tbzatr7bcePtCI/NLP Bachelors' Project/Notebooks/Forgettable
importing Jupyter notebook from Fetching_required_status_multilingual2.ipynb


In [None]:
Fine_tune, Combination, dataset_code = Fetching_required_status_multilingual.set_status()

In [None]:
path_to_checkpoint, path_to_pretrained_weight = Fetching_required_status_multilingual.set_path(Fine_tune, Combination, dataset_code, from_gdrive = True)

In [None]:
len(os.listdir(path_to_checkpoint))

In [None]:
Last_Actual_checkpoint = Fetching_required_status_multilingual.return_last_checkpoint(path_to_checkpoint)
Last_PreTrained_checkpoint = Fetching_required_status_multilingual.return_last_checkpoint(path_to_pretrained_weight)

In [None]:
Last_Actual_checkpoint

In [None]:
Last_PreTrained_checkpoint

In [None]:
model_name = Fetching_required_status_multilingual.return_model_name(Fine_tune, path_to_pretrained_weight, Last_PreTrained_checkpoint)

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

In [None]:
dataset_dict = Fetching_required_status_multilingual.return_dataset(dataset_code, Combination, Fine_tune, drop_unlearned = True)

In [None]:
dataset_dict

In [None]:
# from datasets import Dataset
# dataset_dict['English']['train'] = Dataset.from_dict(dataset_dict['English']['train'][:50])
# dataset_dict['French']['train'] = Dataset.from_dict(dataset_dict['French']['train'][:50])
# dataset_dict


## Selecting Hyperparameters (Global parameters)

In [None]:
num_labels = {'French2': 31, 'English2': 151, 'Farsi': 60} # Number of labels in each dataset
num_epochs = 3 # Number of epoch
B_size = 16 # Number of batch size
EP_saved = 3 # Saved checkpoints per EP_saved epochs
max_length = 256 # The maximum length for the pre-processing step
lr = (2.4e-5)/24 # the specified learning rate during the learning procedure.   #2.4e-5
lg_step = math.ceil(len(dataset_dict[list(dataset_dict.keys())[0]]['train'])/B_size) + math.ceil(len(dataset_dict[list(dataset_dict.keys())[1]]['train'])/B_size)
# the number of step before showing info during training

## Creating a Multi-task Model using share encoders

In [None]:
class MultitaskModel(transformers.PreTrainedModel):
    def __init__(self, encoder, taskmodels_dict):
        """
        Setting MultitaskModel up as a PretrainedModel allows us
        to take better advantage of Trainer features
        """
        super().__init__(transformers.PretrainedConfig())

        self.encoder = encoder
        self.taskmodels_dict = nn.ModuleDict(taskmodels_dict)

    @classmethod
    def create(cls, model_name, model_type_dict, model_config_dict):
        """
        This creates a MultitaskModel using the model class and config objects
        from single-task models.

        We do this by creating each single-task model, and having them share
        the same encoder transformer.
        """
        shared_encoder = None
        taskmodels_dict = {}
        for task_name, model_type in model_type_dict.items():
            model = model_type.from_pretrained(
                model_name,
                config=model_config_dict[task_name],
            )

            if shared_encoder is None:
                shared_encoder = getattr(model, cls.get_encoder_attr_name(model))
            else:
                setattr(model, cls.get_encoder_attr_name(model), shared_encoder)

            taskmodels_dict[task_name] = model

        return cls(encoder=shared_encoder, taskmodels_dict=taskmodels_dict)

    @classmethod
    def get_encoder_attr_name(cls, model):
        """
        The encoder transformer is named differently in each model "architecture".
        This method lets us get the name of the encoder attribute
        """
        model_class_name = model.__class__.__name__
        if model_class_name.startswith("Bert"):
            return "bert"
        elif model_class_name.startswith("Roberta"):
            return "roberta"
        elif model_class_name.startswith("Albert"):
            return "albert"
        elif model_class_name.startswith("XLMRoberta"):
            return "roberta"
        else:
            raise KeyError(f"Add support for new model {model_class_name}")

    def forward(self, task_name, **kwargs):
        return self.taskmodels_dict[task_name](**kwargs)

As described above, the `MultitaskModel` class consists of only two components - the shared "encoder", a dictionary to the individual task models. Now, we can simply create the corresponding task models by supplying the invidual model classes and model configs. We will use Transformers' AutoModels to further automate the choice of model class given a model architecture (in our case, let's use `roberta-base`).

# Create model or using pre-trained weights

In [None]:
multitask_model = MultitaskModel.create(
    model_name=model_name,
    model_type_dict={
        list(dataset_dict.keys())[0]: transformers.AutoModelForSequenceClassification,
        list(dataset_dict.keys())[1]: transformers.AutoModelForSequenceClassification,
    },
    model_config_dict={
        list(dataset_dict.keys())[0]: transformers.AutoConfig.from_pretrained(model_name, num_labels=num_labels[list(dataset_dict.keys())[0]]),
        list(dataset_dict.keys())[1]: transformers.AutoConfig.from_pretrained(model_name, num_labels=num_labels[list(dataset_dict.keys())[1]]),
    },
)

To confirm that all three task-models use the same encoder, we can check the data pointers of the respective encoders. In this case, we'll check that the word embeddings in each model all point to the same memory location.

## Processing our task data

We have created a dictionary of NLP datasets above, but we need to do a little more work to convert the respective task data into model inputs.

We'll start by first getting the tokenizer corresponding to our model.

In [None]:
def set_feature_input_mode(example_batch, task_order):

  '''
  Task_order specifies two different modes for our function, "1" as passing the first multi-task task,
  and "2" as passing the second multi-task task.
  '''

  Input = {} # Input initialization
  Label = {"French2": "intent", "English2": "intent","Farsi": "intent"} # label encoding
  #---------------------------------------------------------------------------------------#

  if Combination == 0:
    if task_order == 1:
      Input["English2"] = list(example_batch['utt'])
    else:
      Input["French2"] = list(example_batch['utt'])
  elif Combination == 1:
    if task_order == 1:
      Input["English2"] = list(example_batch['utt'])
    else:
      Input["Farsi"] = list(example_batch['utt'])
  else:
    if task_order == 1:
      Input["French2"] = list(example_batch['utt'])
    else:
      Input["Farsi"] = list(example_batch['utt'])

  return Input[list(dataset_dict.keys())[task_order - 1]], Label[list(dataset_dict.keys())[task_order - 1]]

#---------------------------------------------------------------------------------------------------------------#
def Convert_To_Features_Task1(example_batch):
  Input, Label = set_feature_input_mode(example_batch, task_order = 1)
  Feature = tokenizer.batch_encode_plus(Input, max_length=max_length, pad_to_max_length=True)
  Feature["labels"] = example_batch[Label]
  return Feature

#---------------------------------------------------------------------------------------------------------------#
def Convert_To_Features_Task2(example_batch):
  Input, Label = set_feature_input_mode(example_batch, task_order = 2)
  Feature = tokenizer.batch_encode_plus(Input, max_length=max_length, pad_to_max_length=True)
  Feature["labels"] = example_batch[Label]
  return Feature

#---------------------------------------------------------------------------------------------------------------#
convert_func_dict = {
    list(dataset_dict.keys())[0]: Convert_To_Features_Task1,
    list(dataset_dict.keys())[1]: Convert_To_Features_Task2,
}

Now that we have defined the above functions, we can use `dataset.map` method available in the NLP library to apply the functions over our entire datasets. The NLP library that handles the mapping efficiently and caches the features.

In [None]:
columns_dict = {
    list(dataset_dict.keys())[0]: ['input_ids', 'attention_mask', 'labels'],
    list(dataset_dict.keys())[1]: ['input_ids', 'attention_mask', 'labels'],
}

features_dict = {}
for task_name, dataset in dataset_dict.items():
    features_dict[task_name] = {}
    for phase, phase_dataset in dataset.items():
        features_dict[task_name][phase] = phase_dataset.map(
            convert_func_dict[task_name],
            batched=True,
            load_from_cache_file=False,
        )
        print(task_name, phase, len(phase_dataset), len(features_dict[task_name][phase]))
        features_dict[task_name][phase].set_format(
            type="torch",
            columns=columns_dict[task_name],
        )
    print("\n")
        #print(task_name, phase, len(phase_dataset), len(features_dict[task_name][phase]))

As a recap:

* We have created our multi-task model by fusing several single-task Transformer models
* We have created a (cached) dictionary of featurized inputs for each of our tasks, using NLP dataset

Next up, we need to

1. Set up our data loading
2. Set up our Trainer
3. Start training!

## Preparing a multi-task data loader and Trainer

Setting up a multi-task data loader should be simple in principle - we simply need to sample from multiple single-task data loaders with some probability, and feed each batch to the multi-task model above. Of course, along with each batch, we also need to tell the model what task it is for, so `MultitaskModel` knows to use the right corresponding task-model.

However, because we want to use the built-in `Trainer` class in Transformers, this gets a little tricky, since the `Trainer` expects a single data loader, and expects a very specific format of per-batch data. This slice of code is somewhat of a hack around that constraint. (This can become a lot more streamlined with some tweaks to the Trainer code from the Hugging Face folks =))

We need to define a `MultitaskDataloader` that combines several data loaders into a single "data loader" - not so different from our multi-task model above! This `MultitaskDataloader` should do what we described: sample from different single-task data loaders, and yield a task batch and the corresponding task name (we're going to add the `task_name` to the batch data).

We will also need to override the `get_train_dataloader` method of the `Trainer` to play well with our `MultitaskDataloader`. We do this with a `MultitaskTrainer`.

In [None]:
# Integrations must be imported before ML frameworks:
from transformers.integrations import hp_params
import dataclasses
import collections
import math
import sys
import time
from transformers.data.data_collator import DataCollator, InputDataClass, DefaultDataCollator
from transformers.debug_utils import DebugOption, DebugUnderflowOverflow
from transformers.trainer_callback import TrainerState
from transformers.trainer_pt_utils import IterableDatasetShard
from transformers.file_utils import WEIGHTS_NAME, CONFIG_NAME
from transformers.utils import logging
from transformers.configuration_utils import PretrainedConfig
from transformers import __version__
from transformers.trainer_utils import speed_metrics, get_last_checkpoint, ShardedDDPOption, TrainOutput
from torch.utils.data.dataloader import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
from torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler
from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union
import numpy as np
import torch
import transformers
import warnings
if TYPE_CHECKING:
    import optuna
logger = logging.get_logger(__name__)

class NLPDataCollator:
    """
    Extending the existing DataCollator to work with NLP dataset batches
    """

    def __call__(
        self, features: List[Union[InputDataClass, Dict]]
    ) -> Dict[str, torch.Tensor]:
        first = features[0]
        if isinstance(first, dict):
            # NLP data sets current works presents features as lists of dictionary
            # (one per example), so we  will adapt the collate_batch logic for that
            if "labels" in first and first["labels"] is not None:
                if first["labels"].dtype == torch.int64:
                    labels = torch.tensor(
                        [f["labels"] for f in features], dtype=torch.long
                    )
                else:
                    labels = torch.tensor(
                        [f["labels"] for f in features], dtype=torch.float
                    )
                batch = {"labels": labels}
            for k, v in first.items():
                if k != "labels" and v is not None and not isinstance(v, str):
                    batch[k] = torch.stack([f[k] for f in features])
            return batch
        else:
            # otherwise, revert to using the default collate_batch
            return DefaultDataCollator().collate_batch(features)


class StrIgnoreDevice(str):
    """
    This is a hack. The Trainer is going call .to(device) on every input
    value, but we need to pass in an additional `task_name` string.
    This prevents it from throwing an error
    """

    def to(self, device):
        return self


class DataLoaderWithTaskname:
    """
    Wrapper around a DataLoader to also yield a task name
    """

    def __init__(self, task_name, data_loader):
        self.task_name = task_name
        self.data_loader = data_loader

        self.batch_size = data_loader.batch_size
        self.dataset = data_loader.dataset

    def __len__(self):
        return len(self.data_loader)

    def __iter__(self):
        for batch in self.data_loader:
            batch["task_name"] = StrIgnoreDevice(self.task_name)
            yield batch


class MultitaskDataloader:
    """
    Data loader that combines and samples from multiple single-task
    data loaders.
    """

    def __init__(self, dataloader_dict):
        self.dataloader_dict = dataloader_dict
        self.num_batches_dict = {
            task_name: len(dataloader)
            for task_name, dataloader in self.dataloader_dict.items()
        }
        self.task_name_list = list(self.dataloader_dict)
        self.dataset = [None] * sum(
            len(dataloader.dataset) for dataloader in self.dataloader_dict.values()
        )

    def __len__(self):
        return sum(self.num_batches_dict.values())

    def __iter__(self):
        """
        For each batch, sample a task, and yield a batch from the respective
        task Dataloader.

        We use size-proportional sampling, but you could easily modify this
        to sample from some-other distribution.
        """
        task_choice_list = []
        for i, task_name in enumerate(self.task_name_list):
            task_choice_list += [i] * self.num_batches_dict[task_name]
        task_choice_list = np.array(task_choice_list)
        np.random.shuffle(task_choice_list)
        dataloader_iter_dict = {
            task_name: iter(dataloader)
            for task_name, dataloader in self.dataloader_dict.items()
        }
        for task_choice in task_choice_list:
            task_name = self.task_name_list[task_choice]
            yield next(dataloader_iter_dict[task_name])


class MultitaskTrainer(transformers.Trainer):

    def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
        """
        Perform a training step on a batch of inputs.
        Subclass and override to inject custom behavior.
        Args:
            model (:obj:`nn.Module`):
                The model to train.
            inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.
                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument :obj:`labels`. Check your model's documentation for all accepted arguments.
        Return:
            :obj:`torch.Tensor`: The tensor with training loss on this batch.
        """
        model.train()
        inputs = self._prepare_inputs(inputs)

        # if is_sagemaker_mp_enabled():
        #     scaler = self.scaler if self.use_amp else None
        #     loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps, scaler=scaler)
        #     return loss_mb.reduce_mean().detach().to(self.args.device)

        # if self.use_amp:
        #     with autocast():
        #         (loss, outputs) = self.compute_loss(model, inputs, return_outputs=True)
        # else:
        #     (loss, outputs) = self.compute_loss(model, inputs, return_outputs=True)
        (loss, outputs) = self.compute_loss(model, inputs, return_outputs=True)

        if self.args.n_gpu > 1:
            loss = loss.mean()  # mean() to average on multi-gpu parallel training

        if self.args.gradient_accumulation_steps > 1 and not self.deepspeed:
            # deepspeed handles loss scaling by gradient_accumulation_steps in its `backward`
            loss = loss / self.args.gradient_accumulation_steps

        if self.use_amp:
            self.scaler.scale(loss).backward()
        # elif self.use_apex:
        #     with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                # scaled_loss.backward()
        elif self.deepspeed:
            # loss gets scaled under gradient_accumulation_steps in deepspeed
            loss = self.deepspeed.backward(loss)
        else:
            loss.backward()

        logits = outputs["logits"] if isinstance(outputs, dict) else outputs[1]

        return loss.detach(), logits.detach() #detach()?

    def train(
        self,
        resume_from_checkpoint: Optional[Union[str, bool]] = None,
        resume_from_pretrained: Optional[Union[str, bool]] = None,
        evaluate_during_training: Optional[Union[str, bool]] = None,
        trial: Union["optuna.Trial", Dict[str, Any]] = None,
        ignore_keys_for_eval: Optional[List[str]] = None,
        **kwargs,
    ):
        """
        Main training entry point.
        Args:
            resume_from_checkpoint (:obj:`str` or :obj:`bool`, `optional`):
                If a :obj:`str`, local path to a saved checkpoint as saved by a previous instance of
                :class:`~transformers.Trainer`. If a :obj:`bool` and equals `True`, load the last checkpoint in
                `args.output_dir` as saved by a previous instance of :class:`~transformers.Trainer`. If present,
                training will resume from the model/optimizer/scheduler states loaded here.
            trial (:obj:`optuna.Trial` or :obj:`Dict[str, Any]`, `optional`):
                The trial run or the hyperparameter dictionary for hyperparameter search.
            ignore_keys_for_eval (:obj:`List[str]`, `optional`)
                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
                gathering predictions for evaluation during the training.
            kwargs:
                Additional keyword arguments used to hide deprecated arguments
        """


        self.use_amp = False #workaround!!!!

        ########################################################## Loading probabilities form array ###########################################################

        Prob_per_epoch_first_task, Prob_per_epoch_second_task = Fetching_required_status_multilingual.check_directory_status(dataset_dict, resume_from_checkpoint, path_to_checkpoint, Combination, dataset_code, num_epochs, None, None, Mode = 1)
        # print(Prob_per_epoch_first_task)

        ################################################################ END #################################################################################

        resume_from_checkpoint = None if not resume_from_checkpoint else resume_from_checkpoint

        # memory metrics - must set up as early as possible
        self._memory_tracker.start()

        args = self.args

        self.is_in_train = True

        # do_train is not a reliable argument, as it might not be set and .train() still called, so
        # the following is a workaround:
        if args.fp16_full_eval and not args.do_train:
            self._move_model_to_device(self.model, args.device)

        if "model_path" in kwargs:
            resume_from_checkpoint = kwargs.pop("model_path")
            warnings.warn(
                "`model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` "
                "instead.",
                FutureWarning,
            )
        if len(kwargs) > 0:
            raise TypeError(f"train() received got unexpected keyword arguments: {', '.join(list(kwargs.keys()))}.")
        # This might change the seed so needs to run first.
        self._hp_search_setup(trial)

        # Model re-init
        model_reloaded = False
        if self.model_init is not None:
            # Seed must be set before instantiating the model when using model_init.
            set_seed(args.seed)
            self.model = self.call_model_init(trial)
            model_reloaded = True
            # Reinitializes optimizer and scheduler
            self.optimizer, self.lr_scheduler = None, None

        # Load potential model checkpoint
        if isinstance(resume_from_checkpoint, bool) and resume_from_checkpoint:
            resume_from_checkpoint = get_last_checkpoint(args.output_dir)
            if resume_from_checkpoint is None:
                raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")

        ################################# Resume from pretrained for fine-tuning ###################################################

        if resume_from_pretrained is not None and resume_from_checkpoint is None:
            print("loading from the pretrained weights ... \n")
            if not os.path.isfile(os.path.join(resume_from_pretrained, WEIGHTS_NAME)):
                raise ValueError(f"Can't find a valid checkpoint at {resume_from_pretrained}")

            logger.info(f"Loading model from {resume_from_checkpoint}).")

            if os.path.isfile(os.path.join(resume_from_pretrained, CONFIG_NAME)):
                config = PretrainedConfig.from_json_file(os.path.join(resume_from_pretrained, CONFIG_NAME))
                checkpoint_version = config.transformers_version
                if checkpoint_version is not None and checkpoint_version != __version__:
                    logger.warn(
                        f"You are resuming training from a checkpoint trained with {checkpoint_version} of "
                        f"Transformers but your current version is {__version__}. This is not recommended and could "
                        "yield to errors or unwanted behaviors."
                    )

            state_dict = torch.load(os.path.join(resume_from_pretrained, WEIGHTS_NAME), map_location="cpu")
            # If the model is on the GPU, it still works!
            load_result = self.model.load_state_dict(state_dict, strict=True)
            # release memory
            del state_dict

        ############################################### END ###############################################

        if resume_from_checkpoint is not None:
            # print(WEIGHTS_NAME)
            print(resume_from_checkpoint)
            if not os.path.isfile(os.path.join(resume_from_checkpoint, WEIGHTS_NAME)):
                raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")

            logger.info(f"Loading model from {resume_from_checkpoint}).")

            if os.path.isfile(os.path.join(resume_from_checkpoint, CONFIG_NAME)):
                config = PretrainedConfig.from_json_file(os.path.join(resume_from_checkpoint, CONFIG_NAME))
                checkpoint_version = config.transformers_version
                if checkpoint_version is not None and checkpoint_version != __version__:
                    logger.warn(
                        f"You are resuming training from a checkpoint trained with {checkpoint_version} of "
                        f"Transformers but your current version is {__version__}. This is not recommended and could "
                        "yield to errors or unwanted behaviors."
                    )

            if args.deepspeed:
                # will be resumed in deepspeed_init
                pass
            else:
                # We load the model state dict on the CPU to avoid an OOM error.
                state_dict = torch.load(os.path.join(resume_from_checkpoint, WEIGHTS_NAME), map_location="cpu")
                # If the model is on the GPU, it still works!
                # self._load_state_dict_in_model(state_dict) #?????????
                load_result = self.model.load_state_dict(state_dict, strict=True)

                # release memory
                del state_dict

        # If model was re-initialized, put it on the right device and update self.model_wrapped
        if model_reloaded:
            if self.place_model_on_device:
                self._move_model_to_device(self.model, args.device)
            self.model_wrapped = self.model

        # Keeping track whether we can can len() on the dataset or not
        train_dataset_is_sized = isinstance(self.train_dataset, collections.abc.Sized)

        # Data loader and number of training steps
        train_dataloader = self.get_train_dataloader()

        # Setting up training control variables:
        # number of training epochs: num_train_epochs
        # number of training steps per epoch: num_update_steps_per_epoch
        # total number of training steps to execute: max_steps
        total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size
        if train_dataset_is_sized:
            num_update_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
            num_update_steps_per_epoch = max(num_update_steps_per_epoch, 1)
            if args.max_steps > 0:
                max_steps = args.max_steps
                num_train_epochs = args.max_steps // num_update_steps_per_epoch + int(
                    args.max_steps % num_update_steps_per_epoch > 0
                )
                # May be slightly incorrect if the last batch in the training datalaoder has a smaller size but it's
                # the best we can do.
                num_train_samples = args.max_steps * total_train_batch_size
            else:
                max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch)
                num_train_epochs = math.ceil(args.num_train_epochs)
                num_train_samples = len(self.train_dataset) * args.num_train_epochs
        else:
            # see __init__. max_steps is set when the dataset has no __len__
            max_steps = args.max_steps
            # Setting a very large number of epochs so we go as many times as necessary over the iterator.
            num_train_epochs = sys.maxsize
            num_update_steps_per_epoch = max_steps
            num_train_samples = args.max_steps * total_train_batch_size

        if DebugOption.UNDERFLOW_OVERFLOW in self.args.debug:
            if self.args.n_gpu > 1:
                # nn.DataParallel(model) replicates the model, creating new variables and module
                # references registered here no longer work on other gpus, breaking the module
                raise ValueError(
                    "Currently --debug underflow_overflow is not supported under DP. Please use DDP (torch.distributed.launch)."
                )
            else:
                debug_overflow = DebugUnderflowOverflow(self.model)  # noqa

        delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
        # if args.deepspeed:
        #     deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
        #         self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
        #     )
        #     self.model = deepspeed_engine.module
        #     self.model_wrapped = deepspeed_engine
        #     self.deepspeed = deepspeed_engine
        #     self.optimizer = optimizer
        #     self.lr_scheduler = lr_scheduler
        # elif not delay_optimizer_creation:
        if not delay_optimizer_creation:
            self.create_optimizer_and_scheduler(num_training_steps=max_steps)

        self.state = TrainerState()
        self.state.is_hyper_param_search = trial is not None

        model = self._wrap_model(self.model_wrapped)

        # for the rest of this function `model` is the outside model, whether it was wrapped or not
        if model is not self.model:
            self.model_wrapped = model

        if delay_optimizer_creation:
            self.create_optimizer_and_scheduler(num_training_steps=max_steps)

        # Check if saved optimizer or scheduler states exist
        self._load_optimizer_and_scheduler(resume_from_checkpoint)

        # important: at this point:
        # self.model         is the Transformers Model
        # self.model_wrapped is DDP(Transformers Model), Deepspeed(Transformers Model), etc.

        # Train!
        num_examples = (
            self.num_examples(train_dataloader) if train_dataset_is_sized else total_train_batch_size * args.max_steps
        )

        logger.info("***** Running training *****")
        logger.info(f"  Num examples = {num_examples}")
        logger.info(f"  Num Epochs = {num_train_epochs}")
        logger.info(f"  Instantaneous batch size per device = {args.per_device_train_batch_size}")
        logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}")
        logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
        logger.info(f"  Total optimization steps = {max_steps}")

        self.state.epoch = 0
        start_time = time.time()
        epochs_trained = 0
        steps_trained_in_current_epoch = 0
        steps_trained_progress_bar = None

        # Check if continuing training from a checkpoint
        if resume_from_checkpoint is not None and os.path.isfile(
            os.path.join(resume_from_checkpoint, "trainer_state.json")
        ):
            self.state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, "trainer_state.json"))
            epochs_trained = self.state.global_step // num_update_steps_per_epoch
            if not args.ignore_data_skip:
                steps_trained_in_current_epoch = self.state.global_step % (num_update_steps_per_epoch)
                steps_trained_in_current_epoch *= args.gradient_accumulation_steps
            else:
                steps_trained_in_current_epoch = 0

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info(f"  Continuing training from epoch {epochs_trained}")
            logger.info(f"  Continuing training from global step {self.state.global_step}")
            if not args.ignore_data_skip:
                logger.info(
                    f"  Will skip the first {epochs_trained} epochs then the first {steps_trained_in_current_epoch} "
                    "batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` "
                    "flag to your launch command, but you will resume the training on data already seen by your model."
                )
                if self.is_local_process_zero() and not args.disable_tqdm:
                    steps_trained_progress_bar = tqdm(total=steps_trained_in_current_epoch)
                    steps_trained_progress_bar.set_description("Skipping the first batches")

        # Update the references
        self.callback_handler.model = self.model
        self.callback_handler.optimizer = self.optimizer
        self.callback_handler.lr_scheduler = self.lr_scheduler
        self.callback_handler.train_dataloader = train_dataloader
        self.state.trial_name = self.hp_name(trial) if self.hp_name is not None else None
        self.state.trial_params = hp_params(trial) if trial is not None else None
        # This should be the same if the state has been saved but in case the training arguments changed, it's safer
        # to set this after the load.
        self.state.max_steps = max_steps
        self.state.num_train_epochs = num_train_epochs
        self.state.is_local_process_zero = self.is_local_process_zero()
        self.state.is_world_process_zero = self.is_world_process_zero()

        # tr_loss is a tensor to avoid synchronization of TPUs through .item()
        tr_loss = torch.tensor(0.0).to(args.device)
        # _total_loss_scalar is updated everytime .item() has to be called on tr_loss and stores the sum of all losses
        self._total_loss_scalar = 0.0
        self._globalstep_last_logged = self.state.global_step
        model.zero_grad()

        self.control = self.callback_handler.on_train_begin(args, self.state, self.control)

        # Skip the first epochs_trained epochs to get the random state of the dataloader at the right point.
        if not args.ignore_data_skip:
            for epoch in range(epochs_trained):
                # We just need to begin an iteration to create the randomization of the sampler.
                for _ in train_dataloader:
                    break

        for epoch in range(epochs_trained, num_train_epochs):
            if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler):
                train_dataloader.sampler.set_epoch(epoch)
            elif isinstance(train_dataloader.dataset, IterableDatasetShard):
                train_dataloader.dataset.set_epoch(epoch)

            # if is_torch_tpu_available():
            #     parallel_loader = pl.ParallelLoader(train_dataloader, [args.device]).per_device_loader(args.device)
            #     epoch_iterator = parallel_loader
            # else:
            #     epoch_iterator = train_dataloader
            epoch_iterator = train_dataloader

            train_preds_dict = {list(dataset_dict.keys())[0]: None, list(dataset_dict.keys())[1]: None} #TODO: automatic generation with for ...

            # Reset the past mems state at the beginning of each epoch if necessary.
            if args.past_index >= 0:
                self._past = None

            steps_in_epoch = (
                len(epoch_iterator) if train_dataset_is_sized else args.max_steps * args.gradient_accumulation_steps
            )
            self.control = self.callback_handler.on_epoch_begin(args, self.state, self.control)

            for step, inputs in enumerate(epoch_iterator):

                # Skip past any already trained steps if resuming training
                if steps_trained_in_current_epoch > 0:
                    steps_trained_in_current_epoch -= 1
                    if steps_trained_progress_bar is not None:
                        steps_trained_progress_bar.update(1)
                    if steps_trained_in_current_epoch == 0:
                        self._load_rng_state(resume_from_checkpoint)
                    continue
                elif steps_trained_progress_bar is not None:
                    steps_trained_progress_bar.close()
                    steps_trained_progress_bar = None

                if step % args.gradient_accumulation_steps == 0:
                    self.control = self.callback_handler.on_step_begin(args, self.state, self.control)

                if (
                    ((step + 1) % args.gradient_accumulation_steps != 0)
                    and args.local_rank != -1
                    and args._no_sync_in_gradient_accumulation
                ):
                    # Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
                    with model.no_sync():
                        (temp_loss, label_logits) = self.training_step(model, inputs)
                        tr_loss += temp_loss
                        # tr_loss += self.training_step(model, inputs)
                else:
                    (temp_loss, label_logits) = self.training_step(model, inputs)
                    tr_loss += temp_loss
                    # tr_loss += self.training_step(model, inputs)

                ################################## Extracting samples'probability ###############################################

                preds = nn.Softmax(dim=1)(label_logits) # convert to probability

                # # Move preds to the CPU
                train_preds = preds.detach().cpu().numpy()

                if train_preds_dict[inputs["task_name"]] is None:  # first batch
                     train_preds_dict[inputs["task_name"]] = train_preds
                else:
                   train_preds_dict[inputs["task_name"]] = np.vstack((train_preds_dict[inputs["task_name"]], train_preds))

                ######################################################## END ##################################################

                self.current_flos += float(self.floating_point_ops(inputs))

                # Optimizer step for deepspeed must be called on every step regardless of the value of gradient_accumulation_steps
                if self.deepspeed:
                    self.deepspeed.step()

                if (step + 1) % args.gradient_accumulation_steps == 0 or (
                    # last step in epoch but step is always smaller than gradient_accumulation_steps
                    steps_in_epoch <= args.gradient_accumulation_steps
                    and (step + 1) == steps_in_epoch
                ):
                    # Gradient clipping
                    if args.max_grad_norm is not None and args.max_grad_norm > 0 and not self.deepspeed:
                        # deepspeed does its own clipping

                        if self.use_amp:
                            # AMP: gradients need unscaling
                            self.scaler.unscale_(self.optimizer)

                        if hasattr(self.optimizer, "clip_grad_norm"):
                            # Some optimizers (like the sharded optimizer) have a specific way to do gradient clipping
                            self.optimizer.clip_grad_norm(args.max_grad_norm)
                        elif hasattr(model, "clip_grad_norm_"):
                            # Some models (like FullyShardedDDP) have a specific way to do gradient clipping
                            model.clip_grad_norm_(args.max_grad_norm)
                        else:
                            # Revert to normal clipping otherwise, handling Apex or full precision
                            nn.utils.clip_grad_norm_(
                                # amp.master_params(self.optimizer) if self.use_apex else model.parameters(),
                                model.parameters(),
                                args.max_grad_norm,
                            )

                    # Optimizer step
                    optimizer_was_run = True
                    if self.deepspeed:
                        pass  # called outside the loop
                    # elif is_torch_tpu_available():
                    #     xm.optimizer_step(self.optimizer)
                    elif self.use_amp:
                        scale_before = self.scaler.get_scale()
                        self.scaler.step(self.optimizer)
                        self.scaler.update()
                        scale_after = self.scaler.get_scale()
                        optimizer_was_run = scale_before <= scale_after
                    else:
                        self.optimizer.step()

                    if optimizer_was_run and not self.deepspeed:
                        self.lr_scheduler.step()

                    model.zero_grad()
                    self.state.global_step += 1
                    self.state.epoch = epoch + (step + 1) / steps_in_epoch
                    self.control = self.callback_handler.on_step_end(args, self.state, self.control)

                    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
                else:
                    self.control = self.callback_handler.on_substep_end(args, self.state, self.control)

                if self.control.should_epoch_stop or self.control.should_training_stop:
                    break


            #################################################### Saving tensors of probabilities per epoch and evaluating #####################################################

            Prob_per_epoch_first_task[epoch][:,:] = train_preds_dict[list(dataset_dict.keys())[0]]
            Prob_per_epoch_second_task[epoch][:,:] = train_preds_dict[list(dataset_dict.keys())[1]]

            Fetching_required_status_multilingual.check_directory_status(dataset_dict, resume_from_checkpoint, path_to_checkpoint, Combination, dataset_code, num_epochs, Prob_per_epoch_first_task, Prob_per_epoch_second_task, Mode = 0)

            # savetxt(os.path.join(path_to_checkpoint, 'Prob_per_epoch_Sentiment.csv'), Prob_per_epoch_Sentiment.reshape(Prob_per_epoch_Sentiment.shape[0], -1), delimiter=',')
            # savetxt(os.path.join(path_to_checkpoint, 'Prob_per_epoch_Paraphrase.csv'), Prob_per_epoch_Paraphrase.reshape(Prob_per_epoch_Paraphrase.shape[0],-1), delimiter=',')

            if (evaluate_during_training):
                print("\nEvaluating on test dataset: (epoch %d)" %(epoch + 1))
                Fetching_required_status_multilingual.multitask_eval_fn(model, features_dict, model_name, dataset_dict, self, tokenizer, max_length, batch_size = B_size)


            ######################################################################### END ########################################################################

            self.control = self.callback_handler.on_epoch_end(args, self.state, self.control)
            self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)

            if DebugOption.TPU_METRICS_DEBUG in self.args.debug:
                # if is_torch_tpu_available():
                #     # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
                #     xm.master_print(met.metrics_report())
                # else:
                logger.warning(
                        "You enabled PyTorch/XLA debug metrics but you don't have a TPU "
                        "configured. Check your training configuration if this is unexpected."
                    )
            if self.control.should_training_stop:
                break



        if args.past_index and hasattr(self, "_past"):
            # Clean the state at the end of training
            delattr(self, "_past")

        logger.info("\n\nTraining completed. Do not forget to share your model on huggingface.co/models =)\n\n")
        if args.load_best_model_at_end and self.state.best_model_checkpoint is not None:
            # Wait for everyone to get here so we are sur the model has been saved by process 0.
            # if is_torch_tpu_available():
            #     xm.rendezvous("load_best_model_at_end")
            # elif args.local_rank != -1:
            if args.local_rank != -1:
                dist.barrier()

            logger.info(
                f"Loading best model from {self.state.best_model_checkpoint} (score: {self.state.best_metric})."
            )

            best_model_path = os.path.join(self.state.best_model_checkpoint, WEIGHTS_NAME)
            if os.path.exists(best_model_path):
                # We load the model state dict on the CPU to avoid an OOM error.
                state_dict = torch.load(best_model_path, map_location="cpu")
                # If the model is on the GPU, it still works!
                # self._load_state_dict_in_model(state_dict) ##?????
                load_result = self.model.load_state_dict(state_dict, strict=True)

            else:
                logger.warn(
                    f"Could not locate the best model at {best_model_path}, if you are running a distributed training "
                    "on multiple nodes, you should activate `--save_on_each_node`."
                )

            if self.deepspeed:
                self.deepspeed.load_checkpoint(
                    self.state.best_model_checkpoint, load_optimizer_states=False, load_lr_scheduler_states=False
                )

        # add remaining tr_loss
        self._total_loss_scalar += tr_loss.item()
        train_loss = self._total_loss_scalar / self.state.global_step

        metrics = speed_metrics("train", start_time, num_samples=num_train_samples, num_steps=self.state.max_steps)
        self.store_flos()
        metrics["total_flos"] = self.state.total_flos
        metrics["train_loss"] = train_loss

        self.is_in_train = False

        self._memory_tracker.stop_and_update_metrics(metrics)

        self.log(metrics)

        self.control = self.callback_handler.on_train_end(args, self.state, self.control)

        # return TrainOutput(self.state.global_step, train_loss, metrics)
        return TrainOutput(self.state.global_step, train_loss, metrics), Prob_per_epoch_first_task, Prob_per_epoch_second_task ### / self.global_step?? seems ok!!!

    def get_single_train_dataloader(self, task_name, train_dataset):
        """
        Create a single-task data loader that also yields task names
        """
        if self.train_dataset is None:
            raise ValueError("Trainer: training requires a train_dataset.")

        ### change
        # train_sampler = (
        #     RandomSampler(train_dataset)
        #     if self.args.local_rank == -1
        #     else DistributedSampler(train_dataset)
        # )
        train_sampler = SequentialSampler(train_dataset)


        data_loader = DataLoaderWithTaskname(
            task_name=task_name,
            data_loader=DataLoader(
                train_dataset,
                batch_size=self.args.train_batch_size,
                sampler=train_sampler,
                collate_fn=self.data_collator,
            ),
        )
        return data_loader

    def get_train_dataloader(self):
        """
        Returns a MultitaskDataloader, which is not actually a Dataloader
        but an iterable that returns a generator that samples from each
        task Dataloader
        """
        return MultitaskDataloader(
            {
                task_name: self.get_single_train_dataloader(task_name, task_dataset)
                for task_name, task_dataset in self.train_dataset.items()
            }
        )

## Time to train!

Okay, we have done all the hard work, now it is time for it to pay off. We can now simply create our `MultitaskTrainer`, and start training!

(This takes about ~45 minutes for me on Colab, but it will depend on the GPU you are allocated.)

In [None]:
# !pip install -q wandb
os.environ["WANDB_DISABLED"] = "true"

In [None]:
for _ in range(10):
  with torch.no_grad():
      torch.cuda.empty_cache()
  gc.collect()

In [None]:
torch.cuda.memory_summary(device=None, abbreviated=False)

In [None]:
train_dataset = {
    task_name: dataset["train"]
    for task_name, dataset in features_dict.items()
}

eval_dataset = features_dict


trainer = MultitaskTrainer(
    model=multitask_model,
    args=transformers.TrainingArguments(
        output_dir = path_to_checkpoint,
        overwrite_output_dir = True,
        learning_rate = lr,
        do_train = True,
        num_train_epochs = num_epochs,
        per_device_train_batch_size = B_size,
        save_steps = lg_step * EP_saved,
        logging_steps = lg_step
    ),
    data_collator = NLPDataCollator(),
    train_dataset = train_dataset,
    eval_dataset = eval_dataset
)
an = trainer.train(resume_from_checkpoint = (None if Last_Actual_checkpoint == None else os.path.join(path_to_checkpoint,Last_Actual_checkpoint)),
                   resume_from_pretrained = (None if Last_PreTrained_checkpoint == None else os.path.join(path_to_pretrained_weight, Last_PreTrained_checkpoint)),
                   evaluate_during_training = True)

# Saving the probabilies extracted during the model's training procedure

In [None]:
[Prob_per_epoch_first_task, Prob_per_epoch_second_task] = an[1:3]

In [None]:
print(Prob_per_epoch_first_task.shape)
print(Prob_per_epoch_second_task.shape)

In [None]:
Fetching_required_status_multilingual.save_final_prop(dataset_code, Fine_tune, Combination, Prob_per_epoch_first_task, Prob_per_epoch_second_task)

All done! Now, we can evaluate our multi-task model on all three tasks. In this case, we can simply use single-task data loaders, since we are evaluating each task individually.

We will use the (private) `_prediction_loop` method from the Trainer.

# **Evaluation** on test data

In [None]:
Fetching_required_status_multilingual.multitask_eval_fn(multitask_model, features_dict, model_name, dataset_dict, trainer, tokenizer, max_length, batch_size = B_size)

### Copying from unlimited drive to limited drive as a backup

In [None]:
# !cp -r "/content/drive/Shareddrives/Gdrive/NLP Bachelors' Project/checkpoint/MLL(French,Farsi)/checkpoint-14400"  "/content/drive/MyDrive/NLP Bachelors' Project/checkpoint/MLL(French,Farsi)/"