### Train a small sized ROBERaMLM (84M parameters, 6 layers, 12 attn heads) from scratch

To showcase training Huggingface Trassformer model from scratch. Only 10k lines are used from oscar.eo.txt file - original file has approximately 1-million lines.

Decreased number of EPOCHs to train model from to `10` - since this more of an exercise to train a Model from scratch, did not shard or execute data parallel.

Adding compute_metrics to `training` blows up memory - had to turn-it off on **Colab** since session crashes

Notebook uses ByteLevelBPETokenizer from OpenAI to generate tokens for Esperanto

REF: https://github.com/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb

### Preface
Notebook walks through the various steps on how to train a Transformer model from scratch.
The dataset is split into three - train, evaluate, and test. The test set is used for inference and predictions/completions and verfied with ROGUE, BLUE, and BERT scores.

There is a lot of literature on tokenization, datasets creation, and data loaders. If you change the model to train, please refer to model documentation on tokenization, and setup steps to train etc...

**Note**: could not execute this notebook on local m/c `GeForce GTX 1660 Ti` using complete dataset.  
To verify `code` you can split `oscar.eo.txt` file to 1000 lines and train for `1` epoch

In [1]:
PYDEVD_DISABLE_FILE_VALIDATION = 1

In [2]:
## Requirements
!pip install transformers
!pip install tokenizers
!pip install kaggle
!pip install datasets
!pip install transformers[torch]
!pip install accelerate -U
!pip install torchinfo
!pip install evaluate
# !pip install torchviz



In [3]:
import os
%load_ext autoreload
%autoreload 2
import gc
gc.collect()

26

In [4]:
## mount Google Drive is using Google Colab
# from google.colab import drive
# drive.mount('/content/gdrive', force_remount=True)

In [5]:
## Set the path to the data folder, datafile and output folder and files
root_folder = '/content'
model_name = 'RobertaMLM'
dataname = 'oscar_eo'
if os.path.exists('/content/drive/My Drive/'):
  root_folder = '/content/drive/My Drive/'

data_folder = os.path.abspath(os.path.join(root_folder, 'datasets', model_name, dataname))
model_folder = os.path.abspath(os.path.join(root_folder, 'models', model_name, dataname))
tokenizer_dir = os.path.abspath(os.path.join(root_folder, 'tokenizer',  model_name, dataname))

print(f'data folder:{data_folder}')
print(f'model folder:{model_folder}')
print(f'tokenize folder:{tokenizer_dir}')

data folder:/content/datasets/RobertaMLM/oscar_eo
model folder:/content/models/RobertaMLM/oscar_eo
tokenize folder:/content/tokenizer/RobertaMLM/oscar_eo


create directories

In [6]:
os.makedirs(data_folder, exist_ok=True)
os.makedirs(model_folder, exist_ok=True)
os.makedirs(tokenizer_dir, exist_ok=True)

### Dataset
The Notebook is more focussed on training a language model from scratch and less on Exploratory Data Analysis techniques and Data gathering techniques.

`oscar.eo.txt` is a one million line corpus

In [7]:
## Fetch dataset
## in this Notebook, we will use a text file containing Esperanto sentences
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2024-04-21 00:51:41--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 13.226.225.93, 13.226.225.98, 13.226.225.32, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|13.226.225.93|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



Use device CUDA - **Compute Unified Device Architecture**

In [8]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device, torch.cuda.is_available())

cuda:0 True


### Why use a specific Tokenizer?
REF: https://huggingface.co/docs/transformers/tokenizer_summary  
Depending on the rules we apply for tokenizing a text, a different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an input that was tokenized with the same rules that were used to tokenize its training data.  

In general, transformers models rarely have a vocabulary size greater than 50,000, especially if they are pretrained only on a single language.  

Character tokenization is very simple and would greatly reduce memory and time complexity but it makes it much harder for the model to learn meaningful input representations, and often accompanied by a loss of performance.

To get the best of both worlds, `transformers models use a hybrid` between word-level and character-level tokenization called subword tokenization.

In [9]:
# corpus file to process - total line count - 974291 ~1mil
file_to_process ='/content/oscar.eo.txt'
# file_to_process ='oscar.eo.txt'

In [10]:
# ### Only RUN this cell if you plan to decrease size of file
# ### - decreased since Colab is crashing on this dataset
!head -10000 /content/oscar.eo.txt > /content/oscar.eo_small.txt
file_to_process ='/content/oscar.eo_small.txt'

# local machine
# !head -1000 oscar.eo.txt > oscar.eo_small.txt
# file_to_process ='oscar.eo_small.txt'

In [11]:
## preprocess text
# Represents a Byte-level BPE as introduced by OpenAI with their GPT-2 model
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()

generaly special_tokens list is  "&lt;s&gt;  &lt;pad&gt; &lt;unk&gt;   &lt;/s&gt;   &lt;mask&gt;"  
since we are training from scratch, special tokens is [&nbsp;]  

### Train Tokenizer
REF: https://discuss.huggingface.co/t/tokenizer-progress-bar/1147/2



In [12]:
from tqdm import tqdm
special_tokens = ["", "", "", "", "", ""]

def tokenizer_with_progress(large_batch):
    for text in tqdm(large_batch, desc="Tokenizing", unit="text"):
        tokenizer.train(files=large_batch, vocab_size=52_000, min_frequency=2, special_tokens=special_tokens)

tokenizer_with_progress(file_to_process)

Tokenizing: 100%|██████████| 27/27 [01:19<00:00,  2.96s/text]


In [13]:
## save and check output
# did not work - tokenizer(text, vocab_size, min_frequency, special_tokens).to(device)
# this might work - encoding = tokenizer(text, return_tensors="pt").to(device)  # REF: https://github.com/huggingface/transformers/issues/16359
tokenizer.save_model(tokenizer_dir)

['/content/tokenizer/RobertaMLM/oscar_eo/vocab.json',
 '/content/tokenizer/RobertaMLM/oscar_eo/merges.txt']

In [14]:
vocab_file = os.path.join(tokenizer_dir,"vocab.json")
merges_file = os.path.join(tokenizer_dir,"merges.txt")
print(vocab_file,'\n',merges_file)

/content/tokenizer/RobertaMLM/oscar_eo/vocab.json 
 /content/tokenizer/RobertaMLM/oscar_eo/merges.txt


The tokenizer is optimized for Esperanto - native words are represented by unsplit tokens.  
To use in tokenizers combine the generated `vocab.json` and `merges.txt` files to create BBPE tokenizer and post-process

In [15]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
  vocab_file,
  merges_file,
)

tokenizer._tokenizer.post_processor = BertProcessing(
  ("", tokenizer.token_to_id("")),
  ("", tokenizer.token_to_id("")),
)

tokenizer.enable_truncation(max_length=512)

test the tokenizer

In [16]:
text_line = "li estis:Karŝena, Ŝetar, Admata, Tarŝiŝ, Meres, Marsena, kaj Memuĥan"
print(tokenizer.encode(text_line))
print(tokenizer.encode(text_line).tokens)

Encoding(num_tokens=20, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
['', 'li', 'Ġestis', ':', 'Kar', 'ÅĿena', ',', 'ĠÅľetar', ',', 'ĠAdmata', ',', 'ĠTarÅĿiÅĿ', ',', 'ĠMeres', ',', 'ĠMarsena', ',', 'Ġkaj', 'ĠMemuÄ¥an', '']


### Train a RoBERTa-like (BERT) model from scratch

In [17]:
train_batch_size = 16    # input batch size for training (check current default)
eval_batch_size = 8      # input batch size for testing (check current default)
epochs = 10               # number of epochs to train (check current default)
learning_rate = 1e-4     # learning rate (check current default)
weight_decay = 0.01
maxlength = 128
save_steps = 4096
save_total_limit = 1

Setup configuration for ROBERTa to train

In [18]:
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_dir, max_length=512)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.


In [19]:
from transformers import RobertaConfig
config = RobertaConfig(
  vocab_size=52_000,
  max_position_embeddings=514,
  num_attention_heads=12,
  num_hidden_layers=6,
  type_vocab_size=1,
)

# initialize from config (above cell)
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)

create tokenizer, and intiate model

since we are setting up training from scratch  
initialize from config (cell above) and not from a pre-trained model

### Build training set
create a `Custom` Dataset class to read the `text` file line-by-line  
- since this is `Esperanto` language file use encoding="utf8"
- maxlength = 128
- create a `examples` dataset key to store tokenized `input_ids`

In [20]:
import datasets
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, df, tokenizer):
      self.examples = []
      with open(df, 'r', encoding="utf8") as f:
        lines = f.readlines()
        num_lines = len(lines)
        # for i, line in enumerate(tqdm(f))
        for example in tqdm(lines, total=num_lines):
          # for example in lines:
          x=tokenizer.encode_plus(example, max_length = maxlength, truncation=True, padding=True)
          self.examples += [x.input_ids]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return torch.tensor(self.examples[i])

`tokenize` Esperanto text file

In [21]:
dataset = CustomDataset(file_to_process, tokenizer)

100%|██████████| 10000/10000 [00:03<00:00, 3200.09it/s]


### create a `DataCollator` for `LanguageModeling`  
REF: https://huggingface.co/docs/transformers/main_classes/data_collator  

- Data collators are objects that will form a batch by using a list of dataset elements as input.
- Apply some random data augmentation (like random masking) on the formed batch (`Language Modeling`)
- [Example Scripts](https://huggingface.co/docs/transformers/examples)
- [Example Notebooks](https://huggingface.co/docs/transformers/notebooks)

In [22]:
from transformers import DataCollatorForLanguageModeling
# from transformers import DataCollatorWithPadding

# Define the Data Collator - use Masking with 15% probability of <mask> usage
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

gc.collect()

4

### split dataset -to- train and test sets - use `random_split`

In [23]:
## split into train and test sets
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])
eval_size = int(0.5 * len(test_dataset))
eval_dataset, test_dataset = torch.utils.data.random_split(test_dataset, [eval_size, eval_size])

In [24]:
## check one to verify correctness
sample = train_dataset.dataset.examples[0]
len(sample), sample

(24,
 [46038,
  2462,
  2274,
  12522,
  1004,
  13156,
  1004,
  2508,
  12627,
  5712,
  10405,
  1004,
  9365,
  5712,
  8523,
  1004,
  2166,
  1004,
  5880,
  5712,
  4978,
  12844,
  199,
  46039])

### Visualize Model Summary and Graph

In [25]:
# print config
config

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.38.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

In [26]:
## number of parameters
model.num_parameters()

83504416

In [27]:
gc.collect()

from torchinfo import summary
summary(model)

Layer (type:depth-idx)                                       Param #
RobertaForMaskedLM                                           --
├─RobertaModel: 1-1                                          --
│    └─RobertaEmbeddings: 2-1                                --
│    │    └─Embedding: 3-1                                   39,936,000
│    │    └─Embedding: 3-2                                   394,752
│    │    └─Embedding: 3-3                                   768
│    │    └─LayerNorm: 3-4                                   1,536
│    │    └─Dropout: 3-5                                     --
│    └─RobertaEncoder: 2-2                                   --
│    │    └─ModuleList: 3-6                                  42,527,232
├─RobertaLMHead: 1-2                                         --
│    └─Linear: 2-3                                           590,592
│    └─LayerNorm: 2-4                                        1,536
│    └─Linear: 2-5                                           39,98

### Dataloaders
REF: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html?highlight=dataloader
DataLoader and Dataset allow you to use pre-loaded datasets as well as your own data  

- Dataset stores the samples and their corresponding labels
- DataLoader wraps an iterable around the Dataset to enable easy access to the samples.


In [28]:
test_dataloader = DataLoader(
    test_dataset.dataset.dataset.examples,
    batch_size=eval_batch_size,
    collate_fn=data_collator,
    shuffle=True,
)

eval_dataloader = DataLoader(
    eval_dataset.dataset.dataset.examples,
    batch_size=eval_batch_size,
    collate_fn=data_collator,
    shuffle=True,
)

In [29]:
train_dataloader = DataLoader(
    train_dataset.dataset.examples,
    batch_size=train_batch_size,
    collate_fn=data_collator,
    shuffle=True,
)

### Define TrainingArguments to Trainer
Customize how the `model` is going to be trained.  
- hyperparameters you can tune - example:
  - `learning_rate`, `weight_decay`, `batch_size` (train/eval), `epochs`
- flags for activating different training options

In [30]:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir=model_folder,
    overwrite_output_dir=True,
    evaluation_strategy = 'epoch',
    num_train_epochs=epochs,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    per_device_train_batch_size=train_batch_size, # use these instead (preferred)
    per_device_eval_batch_size=eval_batch_size,
    save_steps=save_steps,
    # eval_steps=eval_steps,
    # remove_unused_columns=False,
    save_total_limit=1,
)

### Create Trainer for the model  
REF: [Huggingface](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md)
[Google Colab](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb)
Supports `distributed training` on multiple GPUs/TPUs  
Goes hand-in-hand with `TrainingArguments` class  
`Trainer` class is optimized for `Huggingface` Transformer models makes it easier to start training instead of writing your own training loop

### Evaluation Metric
REF: [Google Colab Training](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb#scrollTo=I7_tzailp4SP)  

Trainer does not automatically evaluate model performance. pass a function to compute and report metrics  
Before passing your predictions to compute, convert predictions to logits - `Transformer` models return logits

In [31]:
# import numpy as np
# import evaluate
# metric = evaluate.load("accuracy")

# ## compute_metrics function
# def compute_metrics(eval_pred):
#   logits, labels = eval_pred
#   predictions = np.argmax(logits, axis=-1)

#   return metric.compute(predictions=predictions, references=labels)

In [32]:
gc.collect()

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataloader.dataset,
    eval_dataset=eval_dataloader.dataset,
    ## train_dataset=train_dataset.dataset.examples,
    ## eval_dataset=eval_dataset.dataset.examples,
    # compute_metrics=compute_metrics,                 ## This requires a lot of memory
    # prediction_loss_only=True,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [33]:
training_args.device

device(type='cuda', index=0)

In [34]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,8.0928,7.522412
2,7.4918,7.171335
3,7.2679,6.969954
4,6.9856,6.790073
5,6.8645,6.684762
6,6.7558,6.573242
7,6.6915,6.47845
8,6.5607,6.385944
9,6.5437,6.366913
10,6.4971,6.355762


TrainOutput(global_step=6250, training_loss=6.93523205078125, metrics={'train_runtime': 1338.8347, 'train_samples_per_second': 74.692, 'train_steps_per_second': 4.668, 'total_flos': 3201789003918336.0, 'train_loss': 6.93523205078125, 'epoch': 10.0})

### Perplexity
REF: https://huggingface.co/docs/transformers/perplexity   
Perplexity (PPL) is one of the most common metrics for evaluating language models.   

This metric applies specifically to classical language models (sometimes called autoregressive or causal language models) and is not well defined for masked language models like BERT  

Perplexity is defined as the exponentiated average negative log-likelihood of a sequence.

In [35]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
print(eval_results)

Perplexity: 575.65
{'eval_loss': 6.355495929718018, 'eval_runtime': 30.7466, 'eval_samples_per_second': 325.239, 'eval_steps_per_second': 40.655, 'epoch': 10.0}


The run is currently for `10` epoch. If you train for more epochs, the loss will decrease. Also we are training on only 10k lines of the available data.

### save model and tokenizer to disk  
save model and tokenizer to disk for future tasks

In [36]:
trainer.save_model(model_folder)
configuration = model.config             # not saving to model_folder

In [37]:
# save config.json to both model_folder and tokenizer_dir
model.save_pretrained(model_folder)
model.save_pretrained(tokenizer_dir)

### Checking the trained model using a Pipeline
Looking at the training and eval losses going down is not enough, we would like to apply our model to check if our language model is learning anything interesting. An easy way is via the FillMaskPipeline.

Pipelines are simple wrappers around tokenizers and models. We can use the `fill-mask` pipeline where we input a sequence containing a masked token

In [39]:
gc.collect()
from transformers import pipeline
# Create a Fill mask pipeline
fill_mask = pipeline(
    "fill-mask",
    model=model_folder,
    tokenizer=tokenizer_dir,
)

In [40]:
### Inference using Test Set
# Test some examples
## The test text: "Lerni Esperanton per telefono, novaĵoj Poŝtkarto 120 jaroj de fervojo Svitavy-Polička"
fill_mask("Lerni Esperanton per telefono, novaĵoj <mask> 120 jaroj de <mask> Svitavy-Polička")

[[{'score': 0.08599499613046646,
   'token': 259,
   'token_str': ' la',
   'sequence': '<s>Lerni Esperanton per telefono, novaĵoj la 120 jaroj de<mask> Svitavy-Polička</s>'},
  {'score': 0.0667266994714737,
   'token': 13,
   'token_str': '-',
   'sequence': '<s>Lerni Esperanton per telefono, novaĵoj- 120 jaroj de<mask> Svitavy-Polička</s>'},
  {'score': 0.053006891161203384,
   'token': 12,
   'token_str': ',',
   'sequence': '<s>Lerni Esperanton per telefono, novaĵoj, 120 jaroj de<mask> Svitavy-Polička</s>'},
  {'score': 0.03821343928575516,
   'token': 285,
   'token_str': ' kaj',
   'sequence': '<s>Lerni Esperanton per telefono, novaĵoj kaj 120 jaroj de<mask> Svitavy-Polička</s>'},
  {'score': 0.025875721126794815,
   'token': 270,
   'token_str': ' de',
   'sequence': '<s>Lerni Esperanton per telefono, novaĵoj de 120 jaroj de<mask> Svitavy-Polička</s>'}],
 [{'score': 0.09585653990507126,
   'token': 259,
   'token_str': ' la',
   'sequence': '<s>Lerni Esperanton per telefono, nov

In [41]:
## The test text: "La teksto disponeblas laŭ la permesilo Krea Komunaĵo Atribuite-Samkondiĉe"
fill_mask("La teksto <mask> laŭ la permesilo Krea <mask> Atribuite-Samkondiĉe")

[[{'score': 0.973590075969696,
   'token': 1055,
   'token_str': ' disponeblas',
   'sequence': '<s>La teksto disponeblas laŭ la permesilo Krea<mask> Atribuite-Samkondiĉe</s>'},
  {'score': 0.0020416462793946266,
   'token': 1068,
   'token_str': ' Neadaptita',
   'sequence': '<s>La teksto Neadaptita laŭ la permesilo Krea<mask> Atribuite-Samkondiĉe</s>'},
  {'score': 0.0009146895026788116,
   'token': 1432,
   'token_str': ' sistemo',
   'sequence': '<s>La teksto sistemo laŭ la permesilo Krea<mask> Atribuite-Samkondiĉe</s>'},
  {'score': 0.0008044862770475447,
   'token': 984,
   'token_str': ' Komunaĵo',
   'sequence': '<s>La teksto Komunaĵo laŭ la permesilo Krea<mask> Atribuite-Samkondiĉe</s>'},
  {'score': 0.0007535951444879174,
   'token': 1965,
   'token_str': ' informojn',
   'sequence': '<s>La teksto informojn laŭ la permesilo Krea<mask> Atribuite-Samkondiĉe</s>'}],
 [{'score': 0.9900994896888733,
   'token': 984,
   'token_str': ' Komunaĵo',
   'sequence': '<s>La teksto<mask> l

### check scores

### ROUGE – Recall-Oriented Understudy for Gisting Evaluation
REF: https://mlexplained.blog/2023/07/08/large-language-model-llm-evaluation-metrics-bleu-and-rouge/  
`Usually used for SUMMARIZATION` tasks  

Evaluation metric for assessing the quality of automatic summaries generated by text summarization systems. It measures the similarity between the generated summary and one or more reference summaries.  

Calculates the precision and recall scores by comparing the n-gram units (such as words or sequences of words) in the generated summary with those in the reference summaries. It focuses on the recall score, which measures how much of the important information from the reference summaries is captured by the generated summary.

### BLEU – Bilingual Evaluation Understudy
REF: https://mlexplained.blog/2023/07/08/large-language-model-llm-evaluation-metrics-bleu-and-rouge/  
https://huggingface.co/spaces/evaluate-metric/bleu  
`Usually used for MACHINE TRANSLATION` tasks  

Evaluates the quality of machine-generated translations against one or more reference translations. It measures the similarity between the machine-generated translation and the reference translations based on the n-grams.  

BLEU score ranges from 0 to 1, with a higher score indicating a better match between the generated translation and the references. A score of 1 means a perfect match, while a score of 0 means no overlap between the generated and reference translations.

### BERT Score
REF: https://huggingface.co/spaces/evaluate-metric/bertscore  
`computes a similarity score for each token in the candidate sentence with each token in the reference sentence`  

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

BERTScore is an automatic evaluation metric which leverages the pre-trained contextual embeddings from [BERT](https://huggingface.co/bert-base-uncased) models and matches words in candidate and reference sentences by cosine similarity.


### References:
**Huggingface Blog** [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train)   
**Huggingface official documentation** [Encoder-Decoder models](https://huggingface.co/transformers/model_doc/encoderdecoder.html)  
**Huggingface Model** [RoBERTa documentation](https://huggingface.co/transformers/model_doc/roberta.html)  
**Tokenizer** [Huggingface Tokenizer Documentation](https://huggingface.co/transformers/tokenizer_summary.html)  
**Wikipedia definition** [Byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding)  
**Medium Blog** [Create a Tokenizer and train a Huggingface Model](https://medium.com/analytics-vidhya/create-a-tokenizer-and-train-a-huggingface-roberta-model-from-scratch-f3ed1138180c)