<a href="https://colab.research.google.com/github/jspisak/workshops/blob/master/HuggingFace_on_PyTorch_XLA_TPUs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to train PyTorch HuggingFace Transformers on TPUs

Over the past several months the HuggingFace and Google [`pytorch/xla`](https://github.com/pytorch/xla) teams have been collaborating bringing first class support for training HuggingFace transformers on TPUs, with significant speedups.

In this Colab we walk you through Masked Language Modeling (MLM) finetuning [RoBERTa](https://arxiv.org/abs/1907.11692) on the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) using free TPUs provided by Colab.

Last Updated: September 25, 2020

### Install and clone depedencies

In [1]:
!pip install transformers==3.2.0 \
  torch==1.6.0 \
  cloud-tpu-client==0.10 \
  https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.6-cp36-cp36m-linux_x86_64.whl
!git clone -b v3.2.0 https://github.com/huggingface/transformers

Collecting transformers==3.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/d8/f4/9f93f06dd2c57c7cd7aa515ffbf9fcfd8a084b92285732289f4a5696dd91/transformers-3.2.0-py3-none-any.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 3.4MB/s 
Collecting cloud-tpu-client==0.10
  Downloading https://files.pythonhosted.org/packages/56/9f/7b1958c2886db06feb5de5b2c191096f9e619914b6c31fdf93999fdbbd8b/cloud_tpu_client-0.10-py3-none-any.whl
Collecting torch-xla==1.6
[?25l  Downloading https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.6-cp36-cp36m-linux_x86_64.whl (133.2MB)
[K     |████████████████████████████████| 133.2MB 30kB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 13.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.py

### Download the WikiText-2 dataset

In [2]:
!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
!unzip wikitext-2-raw-v1.zip

--2020-10-08 23:24:36--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.205.5
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.205.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4721645 (4.5M) [application/zip]
Saving to: ‘wikitext-2-raw-v1.zip’


2020-10-08 23:24:36 (19.0 MB/s) - ‘wikitext-2-raw-v1.zip’ saved [4721645/4721645]

Archive:  wikitext-2-raw-v1.zip
   creating: wikitext-2-raw/
  inflating: wikitext-2-raw/wiki.test.raw  
  inflating: wikitext-2-raw/wiki.valid.raw  
  inflating: wikitext-2-raw/wiki.train.raw  


### Train the model

All TPU training functionality has been built into [`trainer.py`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) and so we'll use the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script under `examples/language-modeling` to finetune our RoBERTa model on the WikiText-2 dataset.

Note that in the following command we use [`xla_spawn.py`](https://github.com/huggingface/transformers/blob/master/examples/xla_spawn.py) to spawn 8 processes to train on the 8 cores a single v2-8/v3-8 a TPU has (TPU pods can scale all the way up to 2048 cores). All `xla_spawn.py` does it call [`xmp.spawn`](https://github.com/pytorch/xla/blob/master/torch_xla/distributed/xla_multiprocessing.py#L350), which sets up some environment metadata that's needed and calls `torch.multiprocessing.start_processes`.

The below command ends up spawning 8 processes and each of those drives one TPU core. We've set the `per_device_train_batch_size=4` and `per_device_eval_batch_size=4`, which means that the global bactch size will be `32` (`4 examples/device * 8 devices/Colab TPU = 32 examples / Colab TPU`). You can also append the `--tpu_metrics_debug` flag for additional debug metrics (ex. how long it took to compile, execute one step, etc).

The following cell should take around 5~10 minutes to run.

In [None]:
!python transformers/examples/xla_spawn.py \
    --num_cores=8 \
    transformers/examples/language-modeling/run_language_modeling.py \
    --output_dir=./output \
    --model_type=roberta \
    --model_name_or_path=roberta-base \
    --train_data_file=./wikitext-2-raw/wiki.train.raw \
    --num_train_epochs=5 \
    --logging_steps=50 \
    --save_steps=740 \
    --do_train \
    --do_eval \
    --eval_data_file=./wikitext-2-raw/wiki.valid.raw \
    --mlm \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --logging_dir=./tensorboard/ 

10/08/2020 23:25:46 - INFO - run_language_modeling -   Training/evaluation parameters TrainingArguments(output_dir='./output', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=4, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, warmup_steps=0, logging_dir='./tensorboard/', logging_first_step=False, logging_steps=50, save_steps=740, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=8, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=50, past_index=-1, run_name=None, disable_tqdm=False, remove_unused_columns=True, label_name

### Visualize Tensorboard Metrics

In [None]:
%load_ext tensorboard
%tensorboard --logdir tensorboard

## 🎉🎉🎉 **Done Training!** 🎉🎉🎉


## Run inference on finetuned model

In [None]:
import torch_xla.core.xla_model as xm
from transformers import pipeline
from transformers import FillMaskPipeline
from transformers import AutoModelForMaskedLM, AutoTokenizer

tpu_device = xm.xla_device()
model = AutoModelForMaskedLM.from_pretrained('./output').to(tpu_device)
tokenizer = AutoTokenizer.from_pretrained('./output')
fill_mask = FillMaskPipeline(model, tokenizer)
fill_mask.device = tpu_device

In [None]:
fill_mask('TPUs are much faster than <mask>!')

And just like that, you've just used your TPU fine-tuned model to run predictions also on TPU! 🎉