## Bigram Model

The idea was to use the improved BPE algorithm from the last activity and to create a bigram model. Regarding to changes on the BPE, I:
1. Changed the main tokenizer parameter from number of merges made to the vocab_size;
2. Trained on the corpus given: last time I didn't trained, only placed the code cell and didn't executed.

In [1]:
import os, sys

LIB_PATH = os.path.join(os.getcwd(), '../')
sys.path.append(LIB_PATH)

In [2]:
from atividade_1.bpe_tokenization import BPETokenization
from atividade_2.bigram import BigramLanguageModel
import torch
import json

In [3]:
# Used my local path for the corpus folder
corpus_folder = "/home/user/unb/unb_mestrado/2_semestre/topicos_nlp/nlp/atividade_1/corpus"
full_dataset = os.listdir(corpus_folder)

# torch generator for the sake of reprodutibility
generator1 = torch.Generator().manual_seed(42)
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset,
                                                            [0.8, 0.2],
                                                            generator=generator1)

Helper function for loading the corpus

In [4]:
def corpus_generator(corpus_path: str, dataset):
    for file in dataset:
        with open(f"{corpus_path}/{file}", "r") as f:
            json_file = json.load(f)
            yield json_file["text"]

### Main process

1. Initialize the BPE tokenizer
2. Train the tokenizer on the training corpus
3. Initialize bigram model: used 2000 for vocab_size in the sake of memory consumption
4. Build the bigram matrix using the same training corpus

In [5]:
vocab_size = 2000
bpe_tokenizer = BPETokenization(vocab_size=vocab_size)

bpe_tokenizer.train(corpus_generator(
    corpus_folder,
    train_dataset
    )
)

model = BigramLanguageModel(bpe_tokenizer, vocab_size)

model.build_bigram_matrix(corpus_generator(
    corpus_folder,
    train_dataset
))

8000it [00:02, 3226.33it/s]
Training bigram matrix: 8000it [11:24, 11.69it/s]


Then we can:
1. Generate text;
2. Calculate the perplexity score for the test corpus.

In [6]:
initial_text = "Filho de"
generated_text = model.generate_text(initial_text, num_tokens=20)
print(f"Generated text: {generated_text}")

perplexity = model.calculate_perplexity(corpus_generator(
    corpus_folder,
    test_dataset
))
print(f"Perplexity: {perplexity}")

???????stutesm 2ONCho de
Perplexity: 12.734767886062848


There's still a lot of work to do!

For improvements:
- Grow the number of vocab_size: it can helps to improve the text generation;
- Speed up the training time for the bigram model.