# T5 on TPU 💥🚀

In this notebook we will see how to train T5 model on TPU with Huggingface's awesome new [trainer](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py). We will train T5 base model on SQUAD dataset for QA task. We will use the recently released amazing [nlp](https://github.com/huggingface/nlp) package to load and process the dataset in just few lines.

First make sure you are connected to the high RAM instance. This will not work on 12 GB colab instance.

In [None]:
# Crash on purpose to get more ram :
import torch
torch.tensor([10.]*10000000000)

Let's install [PyTorch/XLA](https://github.com/pytorch/xla) which enables PyTorch on TPU. Make sure you install the nightly version, as the trainer breaks on other versions.

In [None]:
VERSION = "nightly"  #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  4264  100  4264    0     0  60914      0 --:--:-- --:--:-- --:--:-- 60914
Updating TPU and VM. This may take around 2 minutes.
Updating TPU runtime to pytorch-nightly ...
Uninstalling torch-1.5.0+cu101:
Done updating TPU runtime: <Response [200]>
  Successfully uninstalled torch-1.5.0+cu101
Uninstalling torchvision-0.6.0+cu101:
  Successfully uninstalled torchvision-0.6.0+cu101
Copying gs://tpu-pytorch/wheels/torch-nightly-cp36-cp36m-linux_x86_64.whl...
- [1 files][ 91.1 MiB/ 91.1 MiB]                                                
Operation completed over 1 objects/91.1 MiB.                                     
Copying gs://tpu-pytorch/wheels/torch_xla-nightly-cp36-cp36m-linux_x86_64.whl...
- [1 files][119.8 MiB/119.8 MiB]                       

Install transformers and the nlp package. Restart colab after this

In [None]:
!git clone https://github.com/huggingface/transformers.git
!pip install ./transformers
!pip install -U nlp

Cloning into 'transformers'...
remote: Enumerating objects: 94, done.[K
remote: Counting objects:   1% (1/94)[Kremote: Counting objects:   2% (2/94)[Kremote: Counting objects:   3% (3/94)[Kremote: Counting objects:   4% (4/94)[Kremote: Counting objects:   5% (5/94)[Kremote: Counting objects:   6% (6/94)[Kremote: Counting objects:   7% (7/94)[Kremote: Counting objects:   8% (8/94)[Kremote: Counting objects:   9% (9/94)[Kremote: Counting objects:  10% (10/94)[Kremote: Counting objects:  11% (11/94)[Kremote: Counting objects:  12% (12/94)[Kremote: Counting objects:  13% (13/94)[Kremote: Counting objects:  14% (14/94)[Kremote: Counting objects:  15% (15/94)[Kremote: Counting objects:  17% (16/94)[Kremote: Counting objects:  18% (17/94)[Kremote: Counting objects:  19% (18/94)[Kremote: Counting objects:  20% (19/94)[Kremote: Counting objects:  21% (20/94)[Kremote: Counting objects:  22% (21/94)[Kremote: Counting objects:  23% (22/94)[Kremote: Coun

--2020-05-16 16:02:40--  https://raw.githubusercontent.com/huggingface/transformers/2d184cb553ee20943b03b253f44300e466357871/examples/xla_spawn.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1913 (1.9K) [text/plain]
Saving to: ‘xla_spawn.py’


2020-05-16 16:02:41 (40.3 MB/s) - ‘xla_spawn.py’ saved [1913/1913]



## Load and process data

Let's load and process the dataset using the nlp library. We will process the examples in follwoing way to cast QA task in text-to-text setting

**input**
question: question_text  context: context 

**target**
answer_text

In [None]:
import torch
import nlp
from transformers import T5Tokenizer

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




In [None]:
# process the examples in input and target text format and the eos token at the end 
def add_eos_to_examples(example):
    example['input_text'] = 'question: %s  context: %s </s>' % (example['question'], example['context'])
    example['target_text'] = '%s </s>' % example['answers']['text'][0]
    return example

# tokenize the examples
def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(example_batch['input_text'], pad_to_max_length=True, max_length=512)
    target_encodings = tokenizer.batch_encode_plus(example_batch['target_text'], pad_to_max_length=True, max_length=16)

    encodings = {
        'input_ids': input_encodings['input_ids'], 
        'attention_mask': input_encodings['attention_mask'],
        'target_ids': target_encodings['input_ids'],
        'target_attention_mask': target_encodings['attention_mask']
    }

    return encodings

In [None]:
# load train and validation split of squad
train_dataset  = nlp.load_dataset('squad', split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset('squad', split=nlp.Split.VALIDATION)

# map add_eos_to_examples function to the dataset example wise 
train_dataset = train_dataset.map(add_eos_to_examples)
# map convert_to_features batch wise
train_dataset = train_dataset.map(convert_to_features, batched=True)

valid_dataset = valid_dataset.map(add_eos_to_examples, load_from_cache_file=False)
valid_dataset = valid_dataset.map(convert_to_features, batched=True, load_from_cache_file=False)


# set the tensor type and the columns which the dataset should return
columns = ['input_ids', 'target_ids', 'attention_mask', 'target_attention_mask']
train_dataset.set_format(type='torch', columns=columns)
valid_dataset.set_format(type='torch', columns=columns)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4997.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2240.0, style=ProgressStyle(description…


Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.75 MiB, total: 119.27 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=8116577.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1054280.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0. Subsequent calls will reuse this data.


87599it [00:04, 19002.06it/s]
100%|██████████| 88/88 [00:50<00:00,  1.75it/s]
10570it [00:00, 18815.95it/s]
100%|██████████| 11/11 [00:06<00:00,  1.77it/s]


In [None]:
len(train_dataset), len(valid_dataset)

(87599, 10570)

In [None]:
# cach the dataset, so we can load it directly for training

torch.save(train_dataset, 'train_data.pt')
torch.save(valid_dataset, 'valid_data.pt')

For more details on how to use the nlp library check out this [notebook](https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb).

## Write training script

Using the `Trainer` is pretty straightforward. Here are the 4 basic steps which are needed to use trainer.

1. **Parse the arguments needed**. These are divided in 3 parts for clarity and seperation (TrainingArguments, ModelArguments and DataTrainingArguments).

  1. **TrainingArguments**: These are basicaly the training hyperparameters such as learning rate, batch size, weight decay, gradient accumulation steps etc. See all possible arguments [here](https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py). These are used by the Trainer.

  2. **ModelArguments**: These are the arguments for the model that you want to use such as the model_name_or_path, tokenizer_name etc. You'll need these to load the model and tokenizer.

  3. **DataTrainingArguments**: These are as the name suggests arguments needed for the dataset. Such as the directory name where your files are stored etc. You'll need these to load/process the dataset.

  TrainingArguments are already defined in the `TrainingArguments` class, you'll need to define `ModelArguments` and `DataTrainingArguments` classes for your task.




2. Load train and eval datasets
3. Initialize the `Trainer`

    These are the mininum parameters which you'll for initializing `Trainer`. For full list check [here](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L107)

    ```
      model: PreTrainedModel
      args: TrainingArguments
      train_dataset: Optional[Dataset]
      eval_dataset: Optional[Dataset]
    ```
4. Start training with  `trainer.train`

    Call `trainer.train` and let the magic begin!


There are lots of things which the trainer handles for you out of the box such as gradient_accumulation, fp16 training, setting up the optimizer and scheduler, logging with wandb etc. I didn't set-up wandb for this experiment, but will explore it for sure in future experiment.

In [None]:
import dataclasses
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Dict, List, Optional

import numpy as np
import torch

from transformers import T5ForConditionalGeneration, T5Tokenizer, EvalPrediction
from transformers import (
    HfArgumentParser,
    DataCollator,
    Trainer,
    TrainingArguments,
    set_seed,
)


logger = logging.getLogger(__name__)

# prepares lm_labels from target_ids, returns examples with keys as expected by the forward method
# this is necessacry because the trainer directly passes this dict as arguments to the model
# so make sure the keys match the parameter names of the forward method
@dataclass
class T2TDataCollator(DataCollator):
    def collate_batch(self, batch: List) -> Dict[str, torch.Tensor]:
        """
        Take a list of samples from a Dataset and collate them into a batch.
        Returns:
            A dictionary of tensors
        """
        input_ids = torch.stack([example['input_ids'] for example in batch])
        lm_labels = torch.stack([example['target_ids'] for example in batch])
        lm_labels[lm_labels[:, :] == 0] = -100
        attention_mask = torch.stack([example['attention_mask'] for example in batch])
        decoder_attention_mask = torch.stack([example['target_attention_mask'] for example in batch])
        

        return {
            'input_ids': input_ids, 
            'attention_mask': attention_mask,
            'lm_labels': lm_labels, 
            'decoder_attention_mask': decoder_attention_mask
        }


@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )

@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """
    train_file_path: Optional[str] = field(
        default='train_data.pt',
        metadata={"help": "Path for cached train dataset"},
    )
    valid_file_path: Optional[str] = field(
        default='valid_data.pt',
        metadata={"help": "Path for cached valid dataset"},
    )
    max_len: Optional[int] = field(
        default=512,
        metadata={"help": "Max input length for the source text"},
    )
    target_max_len: Optional[int] = field(
        default=32,
        metadata={"help": "Max input length for the target text"},
    )


def main():
    # See all possible arguments in src/transformers/training_args.py
    # or by passing the --help flag to this script.
    # We now keep distinct sets of args, for a cleaner separation of concerns.

    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))

    # we will load the arguments from a json file, 
    #make sure you save the arguments in at ./args.json
    model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath('args.json'))

    if (
        os.path.exists(training_args.output_dir)
        and os.listdir(training_args.output_dir)
        and training_args.do_train
        and not training_args.overwrite_output_dir
    ):
        raise ValueError(
            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
        )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        training_args.local_rank,
        training_args.device,
        training_args.n_gpu,
        bool(training_args.local_rank != -1),
        training_args.fp16,
    )
    logger.info("Training/evaluation parameters %s", training_args)

    # Set seed
    set_seed(training_args.seed)

    # Load pretrained model and tokenizer
    #
    # Distributed training:
    # The .from_pretrained methods guarantee that only one local process can concurrently
    # download model & vocab.

    tokenizer = T5Tokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
    )
    model = T5ForConditionalGeneration.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
    )

    # Get datasets
    print('loading data')
    train_dataset  = torch.load(data_args.train_file_path)
    valid_dataset = torch.load(data_args.valid_file_path)
    print('loading done')

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        data_collator=T2TDataCollator(),
        prediction_loss_only=True
    )

    # Training
    if training_args.do_train:
        trainer.train(
            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
        )
        trainer.save_model()
        # For convenience, we also re-save the tokenizer to the same directory,
        # so that you can share your model easily on huggingface.co/models =)
        if trainer.is_world_master():
            tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
    results = {}
    if training_args.do_eval and training_args.local_rank in [-1, 0]:
        logger.info("*** Evaluate ***")

        eval_output = trainer.evaluate()

        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results *****")
            for key in sorted(eval_output.keys()):
                logger.info("  %s = %s", key, str(eval_output[key]))
                writer.write("%s = %s\n" % (key, str(eval_output[key])))
    
        results.update(eval_output)
    
    return results


def _mp_fn(index):
    # For xla_spawn (TPUs)
    main()

## Train

In [None]:
import json

Let's write the arguments in a dict and store in a json file. The above code will load this file and parse the arguments.

In [None]:
args_dict = {
  "num_cores": 8,
  'training_script': 'train_t5_squad.py',
  "model_name_or_path": 't5-base',
  "max_len": 512 ,
  "target_max_len": 16,
  "output_dir": './models/tpu',
  "overwrite_output_dir": True,
  "per_gpu_train_batch_size": 8,
  "per_gpu_eval_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "learning_rate": 1e-4,
  "tpu_num_cores": 8,
  "num_train_epochs": 4,
  "do_train": True
}

In [None]:
with open('args.json', 'w') as f:
  json.dump(args_dict, f)

Start training!

In [None]:
import torch_xla.distributed.xla_multiprocessing as xmp

In [None]:
xmp.spawn(_mp_fn, args=(), nprocs=8, start_method='fork')

05/16/2020 09:42:27 - INFO - transformers.training_args -   PyTorch: setting up devices
05/16/2020 09:42:27 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./models/tpu', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, per_gpu_train_batch_size=8, per_gpu_eval_batch_size=8, gradient_accumulation_steps=4, learning_rate=0.0001, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=4, max_steps=-1, warmup_steps=0, logging_dir=None, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=8, tpu_metrics_debug=False)
05/16/2020 09:42:27 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model from cache at /root/.cache/torch/transformers/68f1b8dbca4350743bb54b8c4169fd38cbabaad564f85a9239337a8d0342af9f

loading data


05/16/2020 09:42:34 - INFO - nlp.utils.file_utils -   PyTorch version 1.6.0a0+bf2bbd9 available.
05/16/2020 09:42:34 - INFO - nlp.utils.file_utils -   TensorFlow version 2.2.0 available.
05/16/2020 09:42:37 - INFO - transformers.modeling_utils -   Weights of T5ForConditionalGeneration not initialized from pretrained model: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']


loading data


05/16/2020 09:42:37 - INFO - nlp.utils.file_utils -   PyTorch version 1.6.0a0+bf2bbd9 available.
05/16/2020 09:42:37 - INFO - nlp.utils.file_utils -   TensorFlow version 2.2.0 available.
05/16/2020 09:42:37 - INFO - transformers.modeling_utils -   Weights of T5ForConditionalGeneration not initialized from pretrained model: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']


loading data


05/16/2020 09:42:37 - INFO - nlp.utils.file_utils -   PyTorch version 1.6.0a0+bf2bbd9 available.
05/16/2020 09:42:37 - INFO - nlp.utils.file_utils -   TensorFlow version 2.2.0 available.
05/16/2020 09:42:37 - INFO - transformers.modeling_utils -   Weights of T5ForConditionalGeneration not initialized from pretrained model: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']


loading data


05/16/2020 09:42:37 - INFO - nlp.utils.file_utils -   PyTorch version 1.6.0a0+bf2bbd9 available.
05/16/2020 09:42:37 - INFO - nlp.utils.file_utils -   TensorFlow version 2.2.0 available.
05/16/2020 09:42:37 - INFO - transformers.modeling_utils -   Weights of T5ForConditionalGeneration not initialized from pretrained model: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']


loading data


05/16/2020 09:42:37 - INFO - nlp.utils.file_utils -   PyTorch version 1.6.0a0+bf2bbd9 available.
05/16/2020 09:42:37 - INFO - nlp.utils.file_utils -   TensorFlow version 2.2.0 available.
05/16/2020 09:42:37 - INFO - transformers.modeling_utils -   Weights of T5ForConditionalGeneration not initialized from pretrained model: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']


loading data


05/16/2020 09:42:38 - INFO - nlp.utils.file_utils -   PyTorch version 1.6.0a0+bf2bbd9 available.
05/16/2020 09:42:38 - INFO - nlp.utils.file_utils -   TensorFlow version 2.2.0 available.


loading done


05/16/2020 09:42:38 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
05/16/2020 09:42:38 - INFO - transformers.modeling_utils -   Weights of T5ForConditionalGeneration not initialized from pretrained model: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']


loading data


05/16/2020 09:42:38 - INFO - nlp.utils.file_utils -   PyTorch version 1.6.0a0+bf2bbd9 available.
05/16/2020 09:42:38 - INFO - nlp.utils.file_utils -   TensorFlow version 2.2.0 available.
05/16/2020 09:42:38 - INFO - transformers.modeling_utils -   Weights of T5ForConditionalGeneration not initialized from pretrained model: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']


loading data


05/16/2020 09:42:38 - INFO - nlp.utils.file_utils -   PyTorch version 1.6.0a0+bf2bbd9 available.
05/16/2020 09:42:38 - INFO - nlp.utils.file_utils -   TensorFlow version 2.2.0 available.


loading done


05/16/2020 09:42:40 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


loading done


05/16/2020 09:42:40 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


loading done


05/16/2020 09:42:40 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


loading done


05/16/2020 09:42:41 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


loading done


05/16/2020 09:42:41 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


loading done


05/16/2020 09:42:41 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.


loading done


05/16/2020 09:42:41 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
05/16/2020 09:43:20 - INFO - transformers.trainer -   ***** Running training *****
05/16/2020 09:43:20 - INFO - transformers.trainer -     Num examples = 87599
05/16/2020 09:43:20 - INFO - transformers.trainer -     Num Epochs = 4
05/16/2020 09:43:20 - INFO - transformers.trainer -     Instantaneous batch size per device = 8
05/16/2020 09:43:20 - INFO - transformers.trainer -     Total train batch size (w. parallel, distributed & accumulation) = 64
05/16/2020 09:43:20 - INFO - transformers.trainer -     Gradient Accumulation steps = 4
05/16/2020 09:43:20 - INFO - transformers.trainer -     Total optimization steps = 1368


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=4.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=1369.0, style=ProgressStyle(description_w…

05/16/2020 09:43:33 - INFO - transformers.trainer -   ***** Running training *****
05/16/2020 09:43:33 - INFO - transformers.trainer -     Num examples = 87599
05/16/2020 09:43:33 - INFO - transformers.trainer -     Num Epochs = 4
05/16/2020 09:43:33 - INFO - transformers.trainer -     Instantaneous batch size per device = 8
05/16/2020 09:43:33 - INFO - transformers.trainer -     Total train batch size (w. parallel, distributed & accumulation) = 64
05/16/2020 09:43:33 - INFO - transformers.trainer -     Gradient Accumulation steps = 4
05/16/2020 09:43:33 - INFO - transformers.trainer -     Total optimization steps = 1368
05/16/2020 09:43:36 - INFO - transformers.trainer -   ***** Running training *****
05/16/2020 09:43:36 - INFO - transformers.trainer -     Num examples = 87599
05/16/2020 09:43:36 - INFO - transformers.trainer -     Num Epochs = 4
05/16/2020 09:43:36 - INFO - transformers.trainer -     Instantaneous batch size per device = 8
05/16/2020 09:43:36 - INFO - transformers.tr




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=1369.0, style=ProgressStyle(description_w…

05/16/2020 10:03:44 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


05/16/2020 10:03:44 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


05/16/2020 10:03:44 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


05/16/2020 10:03:44 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)







05/16/2020 10:03:44 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=1369.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=1369.0, style=ProgressStyle(description_w…





05/16/2020 10:03:44 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


05/16/2020 10:03:44 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


05/16/2020 10:03:44 - INFO - transformers.trainer -   Saving model checkpoint to ./models/tpu
05/16/2020 10:03:45 - INFO - transformers.trainer -   Saving model checkpoint to ./models/tpu
05/16/2020 10:03:45 - INFO - transformers.trainer -   Saving model checkpoint to ./models/tpu
05/16/2020 10:03:45 - INFO - transformers.trainer -   Saving model checkpoint to ./models/tpu
05/16/2020 10:03:44 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


05/16/2020 10:03:45 - INFO - transformers.trainer -   Saving model checkpoint to ./models/tpu
05/16/2020 10:03:45 - INFO - transformers.trainer -   Saving model checkpoint to ./models/tpu
05/16/2020 10:03

## Eval

There are two gotchas here. First the metrics functionality in the nlp package is still work-in-progress so we will use the official squad evaluation script. Second, for some reason which I couldn't figure out, the `.generate` method is not working on TPU so will need to do prediction on CPU. For predicting the validation set it almost takes 40 mins.

In [None]:
## SQuAD evaluation script. Modifed slightly for this notebook

from __future__ import print_function
from collections import Counter
import string
import re
import argparse
import json
import sys


def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    return (normalize_answer(prediction) == normalize_answer(ground_truth))


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    return max(scores_for_ground_truths)


def evaluate(gold_answers, predictions):
    f1 = exact_match = total = 0

    for ground_truths, prediction in zip(gold_answers, predictions):
      total += 1
      exact_match += metric_max_over_ground_truths(
                    exact_match_score, prediction, ground_truths)
      f1 += metric_max_over_ground_truths(
          f1_score, prediction, ground_truths)
    
    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total

    return {'exact_match': exact_match, 'f1': f1}

In [None]:
import torch
import torch_xla
import torch_xla.core.xla_model as xm

import nlp
from transformers import T5ForConditionalGeneration, T5Tokenizer

from tqdm.auto import tqdm

In [None]:
model = T5ForConditionalGeneration.from_pretrained('models/tpu').to('cpu') # because its loaded on xla by default
tokenizer = T5Tokenizer.from_pretrained('models/tpu')

In [None]:
valid_dataset = torch.load('valid_data.pt')
dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=32)

In [None]:
answers = []
for batch in tqdm(dataloader):
  outs = model.generate(input_ids=batch['input_ids'], 
                        attention_mask=batch['attention_mask'],
                        max_length=16,
                        early_stopping=True)
  outs = [tokenizer.decode(ids) for ids in outs]
  answers.extend(outs)

HBox(children=(FloatProgress(value=0.0, max=331.0), HTML(value='')))




In [None]:
predictions = []
references = []
for ref, pred in zip(valid_dataset, answers):
  predictions.append(pred)
  references.append(ref['answers']['text'])

In [None]:
predictions[0], references[0]

('Denver Broncos', ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'])

In [None]:
evaluate(references, predictions)

{'exact_match': 81.56102175969725, 'f1': 89.96016967193422}