<a href="https://colab.research.google.com/github/isacmoura/bert-from-scratch/blob/master/BERT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets
!pip install tokenizers
!pip install transformers==4.1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 27.2 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 74.1 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 70.5 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 67.3 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 72.8 MB/s 
Installing collected 

In [None]:
import datasets
from datasets import load_dataset
from tqdm.auto import tqdm
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
from transformers import RobertaTokenizer, RobertaConfig, RobertaForMaskedLM, AdamW, pipeline
import os
import torch

# Portuguese corpus

In [None]:
dataset = load_dataset("nthngdy/oscar-mini", "unshuffled_deduplicated_pt")

Downloading builder script:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/631k [00:00<?, ?B/s]

Downloading and preparing dataset oscar-mini/unshuffled_deduplicated_pt (download: 21.83 MiB, generated: 57.60 MiB, post-processed: Unknown size, total: 79.43 MiB) to /root/.cache/huggingface/datasets/nthngdy___oscar-mini/unshuffled_deduplicated_pt/1.0.0/d61b181331745a38dd31e8c6cc23d46566b96e255384c4421f2396af24a01dff...


Downloading data:   0%|          | 0.00/22.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/195520 [00:00<?, ? examples/s]

Dataset oscar-mini downloaded and prepared to /root/.cache/huggingface/datasets/nthngdy___oscar-mini/unshuffled_deduplicated_pt/1.0.0/d61b181331745a38dd31e8c6cc23d46566b96e255384c4421f2396af24a01dff. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text'],
        num_rows: 195520
    })
})

In [None]:
dataset["train"][0]

{'id': 0,
 'text': 'Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem interromper a leitura.'}

loop throug samples

In [None]:
text_data = []
file_count = 0

for sample in tqdm(dataset["train"]):
  sample = sample["text"].replace("\n", " ")
  text_data.append(sample)

  if len(text_data) == 10_000:
    with open(f"pt_{file_count}.txt", "w", encoding="utf-8") as fp:
      fp.write("\n".join(text_data))
    text_data = []
    file_count += 1

with open(f"pt_{file_count}.txt", "w", encoding="utf-8") as fp:
      fp.write("\n".join(text_data))

  0%|          | 0/195520 [00:00<?, ?it/s]

# Building tokenizer

Getting the paths of our subsets

In [None]:
paths = [str(x) for x in Path("./").glob("*.txt")]

paths[:5]

['pt_12.txt', 'pt_4.txt', 'pt_11.txt', 'pt_1.txt', 'pt_7.txt']

Training the tokenizer.

We use a byte-level Byte-pair encoding (BPE) tokenizer. This allows us to build the vocabulary from an alphabet of single bytes, meaning all words will be decomposable into tokens.

In [None]:
tokenizer = ByteLevelBPETokenizer()

In [None]:
tokenizer.train(files=paths, vocab_size=30_522, min_frequency=2,
                special_tokens=[
                    "<s>", "<pad>", "</s>", "<unk>", "<mask>"
                ])

Save tokenizer

In [None]:
root = "/content/drive/MyDrive/Colab Notebooks/exercises/bert_from_scratch"

In [None]:
os.mkdir(f"{root}/alfredo")

tokenizer.save_model(f"{root}/alfredo")

['/content/drive/MyDrive/Colab Notebooks/exercises/bert_from_scratch/alfredo/vocab.json',
 '/content/drive/MyDrive/Colab Notebooks/exercises/bert_from_scratch/alfredo/merges.txt']

- merges.txt — performs the initial mapping of text to tokens
- vocab.json — maps the tokens to token IDs


Initializing the Tokenizer

In [None]:
tokenizer = RobertaTokenizer.from_pretrained(f"{root}/alfredo", max_len=512)

In [None]:
# test our tokenizer on a simple sentence
tokens = tokenizer('Olá, tudo bem?')

tokens

{'input_ids': [0, 5026, 16, 917, 706, 35, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokens.input_ids

[0, 5026, 16, 917, 706, 35, 2]

# Creating the Input Pipeline

In [None]:
# Preparing data
with open("pt_0.txt", "r", encoding="utf-8") as fp:
  lines = fp.read().split("\n")

batch = tokenizer(lines, max_length=512, padding="max_length", truncation=True)
len(batch)

2

creating our tensors. We'll need three tensors:

- input_ids — our token_ids with ~15% of tokens masked using the mask token <mask>.
- attention_mask — a tensor of 1s and 0s, marking the position of ‘real’ tokens/padding tokens — used in attention calculations.
- labels — our token_ids with no masking.

In [None]:
labels = torch.tensor([x for x in batch.input_ids])
mask = torch.tensor([x for x in batch.attention_mask])

In [None]:
# make copy of labels tensor, this will be input_ids
input_ids = labels.detach().clone()

rand = torch.rand(input_ids.shape)

# We gonna mask tokens that have a randonly generated value less than 15% criteria
# and are not special tokens
mask_arr = (rand < 0.15) * (input_ids > 2)

for i in range(input_ids.shape[0]):
  # get indices of mask positions from mask array
  selection = torch.flatten(mask_arr[i].nonzero()).tolist()
  # mask input_ids
  input_ids[i, selection] = 4

In [None]:
input_ids.shape

torch.Size([10000, 512])

In [None]:
input_ids[0][:10]

tensor([    0,  3035,     4,  1405,  5684,   711,  2208,     4, 13005,    16])

In [None]:
labels[0][:10]

tensor([    0,  3035,   644,  1405,  5684,   711,  2208,   331, 13005,    16])

## Defining Dataset

In [None]:
encodings = {
    "input_ids": input_ids,
    "attention_mask": mask,
    "labels": labels
}

In [None]:
class Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    # store encodings internally
    self.encodings = encodings
  
  def __len__(self):
    # return the number of samples
    return self.encodings["input_ids"].shape[0]
  
  def __getitem__(self, i):
    # return dictionary of input_ids, attention_mask, and labels for index i
    return {key: tensor[i] for key, tensor in self.encodings.items()}

In [None]:
dataset = Dataset(encodings)

In [None]:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True)

# Training the model

In [None]:
tokenizer.vocab_size

30522

Create configuration for Roberta

In [None]:
config = RobertaConfig(
    vocab_size=tokenizer.vocab_size,
    max_position_embeddings=514,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
)

In [None]:
model = RobertaForMaskedLM(config)

Begin to train

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
model.to(device)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [None]:
# activate training mode
model.train()

optim = AdamW(model.parameters(), lr=1e-4)

In [None]:
epochs = 7

for epoch in range(epochs):
  loop = tqdm(dataloader, leave=True)
  for batch in loop:
    optim.zero_grad()

    # pull all tensor batches required for training
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    # process
    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

    # extract loss
    loss = outputs.loss

    # calculate loss for every parameter that needs grad update
    loss.backward()

    # update parameters
    optim.step()

    # print relevant info to progress bar
    loop.set_description(f'Epoch {epoch}')
    loop.set_postfix(loss=loss.item())

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

In [None]:
model.save_pretrained(f"{root}/alfredo")

# Testing

In [None]:
fill = pipeline("fill-mask", model=f"{root}/alfredo", tokenizer=f"{root}/alfredo")

Some weights of RobertaModel were not initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/exercises/bert_from_scratch/alfredo and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
fill(f"Você pode ler isso enquanto {fill.tokenizer.mask_token}")

[{'sequence': '<s>Você pode ler isso enquanto.</s>',
  'score': 0.0554555244743824,
  'token': 18,
  'token_str': '.'},
 {'sequence': '<s>Você pode ler isso enquanto você</s>',
  'score': 0.03763459622859955,
  'token': 562,
  'token_str': 'ĠvocÃª'},
 {'sequence': '<s>Você pode ler isso enquanto a</s>',
  'score': 0.0338815376162529,
  'token': 263,
  'token_str': 'Ġa'},
 {'sequence': '<s>Você pode ler isso enquanto e</s>',
  'score': 0.029280278831720352,
  'token': 262,
  'token_str': 'Ġe'},
 {'sequence': '<s>Você pode ler isso enquanto para</s>',
  'score': 0.024098969995975494,
  'token': 326,
  'token_str': 'Ġpara'}]