# Retraining RoBERTa for MLM

Retraining roberta-base for masked language model (MLM) using the RoBERTa pre-training procedure

In [1]:
# It is oftentimes desirable to re-train the LM to better capture the language characteristics of a downstream task.

# A recently published work BerTweet (Nguyen et al., 2020) provides a pre-trained BERT model 
# (using the RoBERTa procedure) on vast Twitter corpora in English. 
# They argue that BerTweet better models the characteristic of language used on the Twitter subspace, 
# outperforming previous SOTA models on Tweet NLP tasks.

# That is, the performance on downstream tasks is can be greatly influenced by what our LM captures!

### Create Virtual Environment

https://janakiev.com/blog/jupyter-virtual-envs/

In [2]:
# ! pip install --user virtualenv

In [3]:
# ! python -m venv venv_robarta

In [4]:
# ! source venv_robarta/bin/activate

In [5]:
# ! pip install --user ipykernel
# ! jupyter kernelspec uninstall myenv

In [6]:
# ! python -m ipykernel install --user --name=venv_robarta

### 0. Get Data

Get Hate Speech Detection dataset (Basile et al., 2019) made available through TweetEval (Barbieri et al., 2020). 

In [8]:
# !git clone https://github.com/cardiffnlp/tweeteval /tmp/tweeteval

## 1. Include required libraries

In [10]:
# ! pip install torch==1.4.0 torchvision==0.5.0
# ! pip install transformers==3.5.1
from transformers import RobertaTokenizer, RobertaForMaskedLM
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [11]:
# import transformers
# transformers.__version__

In [12]:
# import torch
# torch.__version__

In [13]:
# ! pip -V

## 2. Prepare Data

### 2.1 Create tokenizer and model object

In [14]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 2.2 LineByLineTextDataset class

Since our data is already present in a single file, we can go ahead and use the LineByLineTextDataset class.

In [15]:
# The block_size argument gives the largest token length supported by the LM to be trained. 
# “roberta-base” supports sequences of length 512 (including special tokens like <s> (start of sequence) and </s> (end of sequence).

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/tmp/tweeteval/datasets/hate/train_text.txt",
    block_size=512,
)



### 2.3. Data collator

The data collator object helps us to form input data batches in a form on which the LM can be trained. For example, it pads all examples of a batch to bring them to the same length.

In [16]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

## 3. Training Model

### 3.1 Training object

TrainingArguments object holds some fields that help define the training process. The Trainer finally brings all of the objects that we have created till now together to facilitate the train process.

seed=1: seeds the RNG for the Trainer so that the results can be replicated when needed.

In [17]:
training_args = TrainingArguments(
    output_dir="./roberta-retrained",
    overwrite_output_dir=True,
    num_train_epochs=25,
    per_device_train_batch_size=48,
    save_steps=500,
    save_total_limit=2,
    seed=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

### 3.2 Run training

trainer.save_model(output_dir): helps us save the model to the output_dir so that we can load it using from_pretrained (or as done below).

In [18]:
# import torch
# torch.__version__

In [19]:
trainer.train()

trainer.save_model("./roberta-retrained")

Step,Training Loss
500,2.32702
1000,0.883423
1500,0.810989
2000,0.761219
2500,0.722335
3000,0.684483
3500,0.656304
4000,0.634578
4500,0.620676


Sorce: https://towardsdatascience.com/transformers-retraining-roberta-base-using-the-roberta-mlm-procedure-7422160d5764