# PT Word Embeddings Training

In this notebook, I'll pre-train the Word Embeddingd from BERT Base Cased, while all other parameters.

This approach is the first one suggested by Artetxe, Ruder and Yogatama (2020). The rationale behind this is to test the hypothesis that BERT can learning more abstract concepts, beyond only language information. So, if we only change its Word Embeddings (the lexical part), it could still have a strong performance on downstream tasks for other languages that it was not fine tuned to.


## Parameters

Artetxe, Ruder and Yogatama (2020) pre-train a BERT from scracth in an English Corpus and, then, freeze all parameters other than the Word Embeddings, so they can train those embeddings in a language specific corpus (Portuguese, in my case).

This work will take advantage of the already pre-trained BERT in English from Huggingface transformer's library. So, in order to better reproduce the experiment made by Artetxe, Ruder and Yogatama, it is very import that the parameters are relatively aligned.

For this experiment, I'll use the following configuration, in comparision to their work:


| Parameter  |      Artetxe, Ruder and Yogatama (2020)       |  Mine (pt Embeddings) |
|------------|:-------------:|:------:|
| model architectute   |  BERT Base | BERT Base |
| model implementation | Their own, based on BERT's paper | **Huggingface transformer's library** |
| Corpus   |    Wikipedia dumps (WikiExtractor)   |   Wikipedia dumps (WikiExtractor) |
| data preprocessing   | no normalization, no lowercase |    no normalization, no lowercase |
| vocabulary   | disjoint, training on language-specific corpus |    disjoint, training on language-specific corpus |
| vocab size | 32k | **28,996** |
| special token ids (compared to English model) | all aligned | all aligned |
| pre-training objective | MLM and NLP | **MLM only** |
| trainable part | Word Embeddings | Word Embeddings |
| optimizer | LAMB | **AdamW, with LR decay** |
| sequence length | 512 | **256** |
| transfer training steps | 250k | **31,250** |

The highlighted differences are explained as follows:

1.   **model implementation**. I use the Huggingface transformer's library, which already contains a pre-trained BERT Base Cased, similar to the training made by Artetxe, Ruder and Yogatama (2020). This choice changes some parameters because I want my Word Embeddings training to be aligned with the library's pre-training, as the authors do with the English and Language Specific models.

2.   **vocab size**. This parameter is aligned with the vocab size of the pre-trained English model from Huggingface's library.

3. **pre-training objective**. I could not find the objective used in HUggingface's pre-training, but when downloading the model (**bert-base-cased**) and checking the architecture, one can see **BertForMaskedLM**, which would be the model for pre-training BERT in MLM objective (their is also a **BertForNextSentencePrediction** and **BertForPretraining** classes, but none of them appears in the configuration of the pre-trained model). So I assume this objective as the only one to be made in my experiment.

4. **optimizer**. We use AdamW with learning rate decay as the default optimizer of Huggingface's **Trainer** API. The original experiment use an optimizer called LAMB, which was proposed by You et al. (2019).

4. **sequence length**. I truncate the sequences to 256, instead of 512 as in the original experiment. This is done to fit CUDA memory in Google's Colab platform. Yet to be investigated a better way to do this.

5. **transfer training steps**. I pre-train the embeddings using a very small subset of the entire pt Wiki Corpus (which contains about 6 million sentences). I use only 1M sentences from that corpus, with 32 as batch size, leaving the training phase with 31,250 steps. This is done to keep the pre-training phase shorter in time (~5h, using 1x NVIDIA Tesla P100 in Google Colab).

I aim to make more pre-training with the Portuguese corpus, making the parameters more aligned with the original experiment in the future. 

Another important note is that this notebook is based on the one provided by [Huggingface on how to pre-train a language model from scratch](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=QDNgPls7_l13). I try to use this approach, so other paramaters and implementations are better aligned to theirs, since the **Trainer** API is an attempt to make the training of their models reproducible by others.


> For more information on WikiExtractor tool and how I prepared the data, refer to [the tool's Github](https://github.com/attardi/wikiextractor) and [my Colab Notebook](https://colab.research.google.com/drive/1aQHa7Hp5-HZFwrsBqxUDnQXbqxbgNC0r).





## Preparation

We install libraries, mount drive and copy the preprocessed sentences to local disk.

In [0]:
!pip install transformers --quiet
!pip install pytorch_lightning --quiet

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
%%time

import os
import shutil

if not os.path.exists('/content/dataset.pkl'):
    shutil.copy('/content/drive/My Drive/PF13/text/preprocessed.pkl',
                '/content/dataset.pkl')

CPU times: user 30 µs, sys: 7 µs, total: 37 µs
Wall time: 59.1 µs


## Tokenizer

I'll first load my pre-trained tokenizer (`bert-base-cased-pt`) (See [this notebook for reference](https://colab.research.google.com/drive/1aQHa7Hp5-HZFwrsBqxUDnQXbqxbgNC0r)).

In [0]:
from transformers import  BertTokenizer


TOKENIZER_VOCAB = '/content/drive/My Drive/PF13/bert-base-cased-pt-vocab.txt'

tokenizer = BertTokenizer(TOKENIZER_VOCAB,
                          do_lower_case=False,
                          model_max_length=128)

## Dataset

I create my own Dataset (instead of using the transformers `LineByLineTextDataset`) so I can load files in a lazy way.

In [0]:
import pickle


with open('/content/dataset.pkl', 'rb') as f:
    all_sentences = pickle.load(f)

In [0]:
import torch

from torch.utils.data import Dataset


class WikiPtDataset(Dataset):
    def __init__(self, sentences, tokenizer, max_length=128):
        self.examples = sentences
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        encoded = self.tokenizer.batch_encode_plus(
            [self.examples[idx]],
            max_length=self.max_length)
        
        return torch.tensor(encoded['input_ids'][0],
                            dtype=torch.long)

In [7]:
%%time

dataset = WikiPtDataset(all_sentences, tokenizer)

CPU times: user 7 µs, sys: 0 ns, total: 7 µs
Wall time: 10.3 µs


In [8]:
import random

random_sample = random.randint(0, len(dataset))

print('Dataset size=', len(dataset))
print('Raw sample=', len(dataset.examples[random_sample]), dataset.examples[random_sample])
print('Encoded sample=', dataset[random_sample][:10])
print('Decoded sample=', tokenizer.decode(dataset[random_sample]))


Dataset size= 6180082
Raw sample= 387 Ao final da investigação, a FIFA decretou a vitória da seleção brasileira por 2 x 0, baniu Roberto Rojas, o técnico Orlando Aravena, o médico da seleção chilena Daniel Rodríguez e o dirigente Sergio Stoppel e suspendeu por 4 anos o capitão Fernando Astengo e a Federação Chilena de Futebol, o que acarretou na impossibilidade desta disputar as eliminatórias para a Copa do Mundo de 1994.
Encoded sample= tensor([  101,  3493,  2552,  1926,  7893,   117,   170,  7733, 27051,   170])
Decoded sample= [CLS] Ao final da investigação, a FIFA decretou a vitória da seleção brasileira por 2 x 0, baniu Roberto Rojas, o técnico Orlando Aravena, o médico da seleção chilena Daniel Rodríguez e o dirigente Sergio Stoppel e suspendeu por 4 anos o capitão Fernando Astengo e a Federação Chilena de Futebol, o que acarretou na impossibilidade desta disputar as eliminatórias para a Copa do Mundo de 1994. [SEP]


## Training

### Training Environment

For training, I'll use the GPU provided by Google Colab.

In [9]:
!nvidia-smi

Wed May 27 14:18:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

I also make sure the the GPU is available to PyTorch.

In [10]:
torch.cuda.is_available()

True

### Loading the BERT English Model

As stated in [Parameters section](#parameters), I'll use Huggingface's pre-trained **bert-base-cased** model.

In [0]:
from transformers import BertForMaskedLM


bert_base_cased = BertForMaskedLM.from_pretrained('bert-base-cased')

### Freezing Parameters

As in the original experiment, I'll freeze all parameters, but the Word Embeddings. This enables the pre-training of the lexical part of BERT in the Portugues Corpus.

In [0]:
for param in bert_base_cased.parameters():
    param.requires_grad = False

for param in bert_base_cased.get_input_embeddings().parameters():
    param.requires_grad = True

To test that all parameters the correct parameters are froze, I'll count the parameters that requires gradient and the ones from Word Embeddings to check that they match.

In [13]:
sum([torch.tensor(x.size()).prod() for x in bert_base_cased.parameters() if x.requires_grad])

tensor(22268928)

In [14]:
print(768*bert_base_cased.config.vocab_size)

22268928


### Data Preparation

I'll use a small subset of the entire Portuguese Corpus. This decision will save up some time during pre-training. That'd be nice to evaluate later the effectiveness of this simplification.

In [0]:
sentences = all_sentences[:1000000]
train_dataset = WikiPtDataset(sentences, tokenizer, 256)

I'll also use the transfomers library's `DataCollatorForLanguageModeling` component to mask the sentences for training. This DataCollator is using during batch collation on PyTorch's `DataLoader`.

For more informatio, checkout its source code [here](https://github.com/huggingface/transformers/blob/master/src/transformers/data/data_collator.py).

In [0]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Execute training

For training, instead of using PyTorch Lightning (as in previous experiments), I'll use transformers' **Trainer** API, as in [this colab](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=QDNgPls7_l13). This is a very straightforward API for training, make explicit training parameters and also making this reproducible.

In [0]:
from transformers import Trainer, TrainingArguments

In [0]:
training_args = TrainingArguments(
    output_dir="./bert-base-cased-ptemb-2",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=32,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=bert_base_cased,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    prediction_loss_only=True,
)

In [19]:
%%time
trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=31250.0, style=ProgressStyle(description_…

{"loss": 7.833941082000733, "learning_rate": 4.92e-05, "epoch": 0.016, "step": 500}
{"loss": 7.1357386054992675, "learning_rate": 4.8400000000000004e-05, "epoch": 0.032, "step": 1000}
{"loss": 6.961196041107177, "learning_rate": 4.76e-05, "epoch": 0.048, "step": 1500}
{"loss": 6.844485320091247, "learning_rate": 4.6800000000000006e-05, "epoch": 0.064, "step": 2000}
{"loss": 6.747228501319885, "learning_rate": 4.600000000000001e-05, "epoch": 0.08, "step": 2500}
{"loss": 6.682852043151856, "learning_rate": 4.52e-05, "epoch": 0.096, "step": 3000}
{"loss": 6.588532162666321, "learning_rate": 4.44e-05, "epoch": 0.112, "step": 3500}
{"loss": 6.539473986625671, "learning_rate": 4.36e-05, "epoch": 0.128, "step": 4000}
{"loss": 6.462820938110352, "learning_rate": 4.2800000000000004e-05, "epoch": 0.144, "step": 4500}
{"loss": 6.406926966667175, "learning_rate": 4.2e-05, "epoch": 0.16, "step": 5000}
{"loss": 6.345725014686584, "learning_rate": 4.12e-05, "epoch": 0.176, "step": 5500}
{"loss": 6.30



{"loss": 5.897396026611328, "learning_rate": 3.32e-05, "epoch": 0.336, "step": 10500}
{"loss": 5.841297324180603, "learning_rate": 3.24e-05, "epoch": 0.352, "step": 11000}
{"loss": 5.827618744850159, "learning_rate": 3.16e-05, "epoch": 0.368, "step": 11500}
{"loss": 5.79259323310852, "learning_rate": 3.08e-05, "epoch": 0.384, "step": 12000}
{"loss": 5.753235119819641, "learning_rate": 3e-05, "epoch": 0.4, "step": 12500}
{"loss": 5.734581573486328, "learning_rate": 2.9199999999999998e-05, "epoch": 0.416, "step": 13000}
{"loss": 5.6961206426620485, "learning_rate": 2.84e-05, "epoch": 0.432, "step": 13500}
{"loss": 5.669636674880982, "learning_rate": 2.7600000000000003e-05, "epoch": 0.448, "step": 14000}
{"loss": 5.650984806060791, "learning_rate": 2.6800000000000004e-05, "epoch": 0.464, "step": 14500}
{"loss": 5.624766125679016, "learning_rate": 2.6000000000000002e-05, "epoch": 0.48, "step": 15000}
{"loss": 5.595482871055603, "learning_rate": 2.5200000000000003e-05, "epoch": 0.496, "step

TrainOutput(global_step=31250, training_loss=5.785344849456787)

### Saving the Model and Embeddings

I save the entire model using **Trainer**'s  save utility. I also save only the word embeddings from Pytorch **Module**.

In [0]:
trainer.save_model("/content/drive/My Drive/PF13/bert-base-cased-ptemb-v2")

In [0]:
emb = bert_base_cased.get_input_embeddings()
torch.save(emb.state_dict(), '/content/drive/My Drive/PF13/pt-embeddings-v2')

In [22]:
emb

Embedding(28996, 768, padding_idx=0)

## Checking the Model

I decided to do a minor check if the model is now able to perform the MLM in Portuguese. Remember that only the word embeddings were trained in Portuguese and the rest of the Transformer is still with its English trained weights.

In [0]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="/content/drive/My Drive/PF13/bert-base-cased-ptemb-v2",
    tokenizer=tokenizer
)

In [26]:
fill_mask('Passagem é um [MASK] do estado da Paraíba')

[{'score': 0.21943828463554382,
  'sequence': '[CLS] Passagem é um estado do estado da Paraíba [SEP]',
  'token': 2574},
 {'score': 0.08214069902896881,
  'sequence': '[CLS] Passagem é um município do estado da Paraíba [SEP]',
  'token': 2835},
 {'score': 0.031121809035539627,
  'sequence': '[CLS] Passagem é um jogador do estado da Paraíba [SEP]',
  'token': 3803},
 {'score': 0.030938852578401566,
  'sequence': '[CLS] Passagem é um centro do estado da Paraíba [SEP]',
  'token': 3635},
 {'score': 0.016195986419916153,
  'sequence': '[CLS] Passagem é um campo do estado da Paraíba [SEP]',
  'token': 3931}]

## References

Artetxe, Mikel, Sebastian Ruder, and Dani Yogatama. "On the cross-lingual transferability of monolingual representations." arXiv preprint arXiv:1910.11856 (2020).

You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76 minutes." International Conference on Learning Representations. 2019.