## Training a BERT Model From Scratch
- There are a few steps to the process, let’s first summarize what we need to do. In total, there are four key parts:
    - Getting the data
    - Building a tokenizer
    - Creating an input pipeline
    - Training the model

In [40]:
# Importing Libraries
from tokenizers import ByteLevelBPETokenizer
import os
from tqdm.auto import tqdm
from pathlib import Path
from transformers import RobertaTokenizer
import torch
import numpy as np
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
from transformers import AdamW
from tqdm.auto import tqdm
from transformers import pipeline

### Getting The Data
- The data provided by the stakeholder is a Shakespeare Sonnet which we will use as our training data.

In [1]:
f = open('bert_unbiased.txt', 'r')


lines = []
with open('bert_unbiased.txt',  encoding="ISO-8859-1") as t:
    lines = t.readlines()


#### Text File
- We have to store the data in a format that we can use when building the tokenizer. 
- We need to create a set of plaintext file containing just the text feature from our dataset, and we will split each sample using a newline \n.

In [3]:

text_data = []
file_count = 0

for sample in tqdm(lines):
    sample = sample.replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 10_000:
        # once we get the 10K mark, save to file
        with open(f'/Users/gurneetbedi/Desktop/Capstone/Bert/Bert_Unbiased/Data/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 10K chunks, we will have ~2082 leftover samples, we save those now too
with open(f'/Users/gurneetbedi/Desktop/Capstone/Bert/Bert_Unbiased/Data/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))

  0%|          | 0/2434 [00:00<?, ?it/s]

### Building a tokenizer
- We’ll start with a single sample and work through the preparation logic.

- First, we will open the training data file — the same file that we saved as .txt files earlier. We split each based on newline characters \n as this indicates the individual samples.

- Next we will make a tokenizer, when using transformers we typically load a tokenizer, alongside its respective transformer model — the tokenizer is a key component in the process.


In [11]:
path = [str(x) for x in Path('./Data').glob('*.txt')]

path

['Data/text_0.txt']

In [17]:
# Intiallizing the tokenizer
tokenizer = ByteLevelBPETokenizer()

In [19]:
tokenizer.train(files=path, vocab_size = 30_522, min_frequency = 2,
               special_tokens= ['<s>', '<pad>', '</s>', "<unk>", '<mask>'])

# <s> = Start Sequence Token
# <pad> = Padding Token
# </s> = End of sequence
# <unk> = Unkown Token
# <mask>






In [14]:
# Making directory to store the json and text file.

os.mkdir('shakespeare_unbiased')

In [None]:
# Saving the model to the directory
# These are the tokenization for our tokenizer.
tokenizer.save_model('shakespeare_unbiased')

# Two steps of tokenization
# When we first feed data it goes into merges.txt they get translated into tokens,
# Vocab.json token and token ids
# merges.txt — performs the initial mapping of text to tokens
# vocab.json — maps the tokens to token IDs

In [23]:
# Initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = RobertaTokenizer.from_pretrained('shakespeare_unbiased', max_len=512)

In [32]:
# Sample Output for token.
# Once the tokenizer is ready, we can try encoding some text with it. 
# When encoding we will use the same two methods we would typically use, encode and encode_batch.
tokens = tokenizer('fairest', padding = 'max_length', max_length = 12,return_tensors ='pt')

In [31]:
# From the encodings object tokens we will be extracting the input_ids and
# attention_mask tensors for use with shakespeare_unbiased.
print(tokens.input_ids)

tensor([[  0,  74,  69, 473, 295,   2,   1,   1,   1,   1,   1,   1]])


### Creating the Input Pipeline
- Now we will move onto creating our tensors — we will be training our model through masked-language modeling (MLM). So, we need three tensors:
    - input_ids — our token_ids with ~15% of tokens masked using the mask token <mask. 
    - attention_mask — a tensor of 1s and 0s, marking the position of ‘real’ tokens/padding tokens — used in attention calculations.
    - labels — our token_ids with no masking.

- Our attention_mask and labels tensors are simply extracted from our batch. The input_ids tensors require more attention however, for this tensor we mask ~15% of the tokens — assigning them the token ID 3

In [19]:
def mlm(tensor):
    rand = torch.rand(tensor.shape) #[0, 1]
    mask_arr = (rand < 0.15) * (tensor > 2) #0,1,2
    for i in range(tensor.shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero()).tolist() #[[2,5,18]]
        tensor[i, selection] = 4
    return tensor

In [33]:
path = [str(x) for x in Path('./Data').glob('*.txt')]

path

['Data/text_0.txt']

In [27]:
# Creating a tesnsor and appending values into each list.
input_ids = []
mask = []
labels = []

for path in tqdm(paths):
    with open(path, 'r', encoding= 'utf-8') as f:
        lines = f.read().split('\n')
    sample = tokenizer(lines, max_length = 512, padding = 'max_length', truncation = True,return_tensors ='pt')
    labels.append(sample.input_ids)
    mask.append(sample.attention_mask)
    input_ids.append(mlm(sample.input_ids.detach().clone()))

  0%|          | 0/1 [00:00<?, ?it/s]

### Building the DataLoader
- Next, we define our Dataset class — which we use to initialize our three encoded tensors as PyTorch torch.utils.data.Dataset objects.
- Finally, our dataset is loaded into a PyTorch DataLoader object — which we use to load our data into our model during training.

In [None]:
# Creating a list with the specific length
input_ids = torch.cat(input_ids)
mask = torch.cat(mask)
labels = torch.cat(labels)

In [None]:
# Creating a dictionary with the input_ids, attention_mask and labels
encodings = {
    'input_ids': input_ids,
    'attention_mask' : mask,
    'labels' : labels
}

In [36]:
# Creating a dataset object.
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __len__(self):
        return self.encodings['input_ids'].shape[0]
    def __getitem__(self, i):
        return {key: tensor[i] for key, tensor in self.encodings.items()}

In [31]:
# Intiallizing the dataset class
dataset = Dataset(encodings)

In [32]:
# Intiallizing the data loader
dataloader = torch.utils.data.DataLoader(dataset, batch_size = 16, shuffle = True)

### Training the Model
- We need two things for training, our DataLoader and a model. The DataLoader we have — but no model.
    - Initializing the Model
        - For training, we need a raw (not pre-trained) BERTLMHeadModel. To create that, we first need to create a RoBERTa config object to describe the parameters we’d like to initialize FiliBERTo with.
        - Then, we import and initialize our RoBERTa model with a language modeling (LM) head.
    - Training Preparation
        - Before moving onto our training loop we need to set up a few things. First, we set up GPU/CPU usage. Then we activate the training mode of our model — and finally, initialize our optimizer.
    - Training, after this we will train the PYtorch model.

In [37]:
## Get a list of paths to each file in the directory

paths_new = [str(x) for x in Path('./').glob('*.txt')]

paths_new

['bert_unbiased.txt']

In [38]:
# Config File
config = RobertaConfig(
    vocab_size=tokenizer.vocab_size,  # we align this to the tokenizer vocab_size
    max_position_embeddings=514,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
)

In [39]:
# Initializing the model
model = RobertaForMaskedLM(config)

In [38]:
# Setup GPU/CPU usage.
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# and move our model over to the selected device
model.to(device)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(3040, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm

In [39]:
# Activate training mode
model.train()
# Initialize optimizer
optim = AdamW(model.parameters(), lr=1e-4)

In [40]:
#We train just as we usually would when training via PyTorch.
epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(dataloader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

  0%|          | 0/153 [00:00<?, ?it/s]

  0%|          | 0/153 [00:00<?, ?it/s]

In [41]:
# Saving the model.
model.save_pretrained('shakespeare_unbiased')

### Testing the Model
- Testing the output given by our model.

In [3]:
fill = pipeline('fill-mask', model = 'shakespeare_unbiased', tokenizer = 'shakespeare_unbiased')

In [4]:
fill(f'{fill.tokenizer.mask_token} the world, or else this glutton be')

[{'sequence': '? the world, or else this glutton be',
  'score': 0.166957288980484,
  'token': 35,
  'token_str': '?'},
 {'sequence': 'and the world, or else this glutton be',
  'score': 0.16617439687252045,
  'token': 304,
  'token_str': 'and'},
 {'sequence': 'the the world, or else this glutton be',
  'score': 0.01064683310687542,
  'token': 377,
  'token_str': 'the'},
 {'sequence': 'to the world, or else this glutton be',
  'score': 0.008266780525445938,
  'token': 385,
  'token_str': 'to'},
 {'sequence': 'but the world, or else this glutton be',
  'score': 0.006061549764126539,
  'token': 389,
  'token_str': 'but'}]

In [5]:
fill(f'Learning {fill.tokenizer.mask_token} the teacher')

[{'sequence': 'Learning, the teacher',
  'score': 0.07089152932167053,
  'token': 16,
  'token_str': ','},
 {'sequence': 'Learning my the teacher',
  'score': 0.023096293210983276,
  'token': 298,
  'token_str': ' my'},
 {'sequence': 'Learning I the teacher',
  'score': 0.017750972881913185,
  'token': 303,
  'token_str': ' I'},
 {'sequence': 'Learning and the teacher',
  'score': 0.017463266849517822,
  'token': 318,
  'token_str': ' and'},
 {'sequence': 'Learning of the teacher',
  'score': 0.016983861103653908,
  'token': 296,
  'token_str': ' of'}]