# Word-Level Text Generation with GPT-2

GPT-2 is a large transformer-based language model trained on a dataset of 8 milion web pages. It's objective is to predict the next word, based on all the previous words within some text.

We'll use Hugging Face Tranformers library which provides over 32+ pretrained models for NLG and NLU (ready to use in PyTorch abd TensorFlow 2.0).

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%cd 'drive/MyDrive/Colab Notebooks/nlg_tales_generation'

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks/nlg_tales_generation


In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cd/40/866cbfac4601e0f74c7303d533a9c5d4a53858bd402e08e3e294dd271f25/transformers-4.2.1-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 7.3MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 35.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 49.1MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=f60205ea113882432b8

In [3]:
from transformers import (
    GPT2Tokenizer,
    DataCollatorForLanguageModeling,
    TextDataset,
    GPT2LMHeadModel,
    TrainingArguments,
    Trainer,
    pipeline)

In [None]:
train_path = 'data/train.txt'
test_path = 'data/test.txt'

## Text tokenization

Tokenizer splits text into tokens (words or subwords, punctuation etc.) and then converts them into numbers (ids) to be able to feed them to the model.

When using a pretrained transformers model, the associated pretrained tokenizer should be used in order to preserve the same way of transforming words into tokens (as during pretraining).
We can use either the tokenizer class associated to the model (eg. GPT2Tokenizer) or the AutoTokenizer class.

Size of text corpus used to train transformers results in a big vocabulary size that requires an increased memory and time complexity. To avoid it, transformers models use subword tokenization (a hybrid between word-level and character level tokenization).

GPT-2 uses Byte-Pair Encoding (BPE) with space tokenization as pretokenization. Its vocabulary size is 50,257 with 256 bytes base tokens.

Learn more:
* [GPT2Tokenizer Docs](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizer)
* [Preprocessing data](https://huggingface.co/transformers/preprocessing.html)
* [Summary of tokenizers](https://huggingface.co/transformers/tokenizer_summary.html)

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




When feeding the sentence to the tokenizer, it returns a dictionary with a list of input_ids (indeces corresponding to each token). There is also an argument called attention mask which indicates to the model which tokens should be attended to and which not (to skip padded tokens).

In [None]:
print('vocabulary size: %d, max squence length: %d' % (tokenizer.vocab_size, tokenizer.model_max_length))
print('tokenize sequence "Once upon a time in a little village":', tokenizer('Once upon a time in a little village'))

vocabulary size: 50257, max squence length: 1024
tokenize sequence "Once upon a time in a little village": {'input_ids': [7454, 2402, 257, 640, 287, 257, 1310, 15425], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


A DataCollator is a function used to form a batch from train and test dataset. DataCollatorForLanguageModelling dynamically padds inputs to the maximum length of a batch if they are not all of the same length. GPT-2 uses causual language modeling (not masked language modeling) - its goal is to predict the token following a sequence of tokens (so the model only attends to the left context). That is why, mlm should be set to False.

At this point, we don't fit tokenizer and data collator to the data - it will be loaded as a part of Trainer object later.

Learn more:
* [Casual Language Modelling](https://huggingface.co/transformers/task_summary.html#causal-language-modeling)
* [Data Collator](https://github.com/huggingface/transformers/blob/master/src/transformers/data/data_collator.py)

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Load dataset

In order to use text data in the model, we should load it as a Dataset object (from PyTorch). The Dataset object needs to contain the definition of \__init\__, \__getitem\__ and \__len\__. [This tutorial](https://huggingface.co/transformers/custom_datasets.html) provides examples of custom dataset objects.

We'll use HuggingFace implementation of TextDataset. It splits the text into consecutive blocks of certain length, e.g., it will cut the text every 1024 tokens.

Learn more:
* [Hugging Face implementation](https://github.com/huggingface/transformers/blob/master/src/transformers/data/datasets/language_modeling.py)
* [Stack Overflow explanation](https://stackoverflow.com/questions/60001698/how-exactly-should-the-input-file-be-formatted-for-the-language-model-finetuning)

In [None]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=train_path,
    block_size=128)
     
test_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=test_path,
    block_size=128)



In [None]:
print(tokenizer.decode(train_dataset[5]))

 none to bring! Shall I tell him that Bedouins drove me away, and when I returned there were no dates? And he will answer, "You had slaves, did they not fight with the Bedouins?" It is the truth that will be best, and that will I tell him.' Then he went straight to his father, and found him sitting in his verandah with his five sons round him; and the lad bowed his head. 'Give me the news from the garden,' said the sultan. And the youth answered, 'The dates have all been eaten by some bird: there is not one left.'


## Fine-tune model

Transformers library allows to fine-tune an existing (pretrained) model or train a model from scratch (with a custom configuration).

We'll use GPT-2 pretrained model by loading it with .from_pretrained() method. Just like with the tokenizer, the model can be loaded with the class associated to the model (eg. GPT2LMHeadModel) or with the AutoModel class.

GPT2LMHeadModel is the GPT-2 model dedicated to language modeling tasks.

Learn more:
* [GPT2LMHeadModel](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel)
* [Fine-tuning a model](https://huggingface.co/transformers/training.html)



In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




The Trainer class provides an interface for feature-complete training - it enables training, fine-tuning, and evaluating any transformers model. It takes as input: the model, training arguments, datasets, data collator, tokenizer etc.

The Training Arguments is a subset of arguments that relate to the training loop - we can set up eg: batch size, learning rate, number of epochs.

Learn more:
* [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer)
* [Training Arguments](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments)

In [None]:
training_args = TrainingArguments(
    output_dir = 'data/out', # the output directory for the model predictions and checkpoints
    overwrite_output_dir = True, # overwrite the content of the output directory
    per_device_train_batch_size = 32, # the batch size for training
    per_device_eval_batch_size = 32, # the batch size for evaluation
    learning_rate = 5e-5, # defaults to 5e-5
    num_train_epochs = 3, # total number of training epochs to perform
)

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator=data_collator,
    train_dataset = train_dataset,
    eval_dataset = test_dataset
)

In [None]:
trainer.train()

Step,Training Loss
500,3.429335


TrainOutput(global_step=702, training_loss=3.392490919498976)

In [None]:
trainer.save_model()

## Text generation

In order to use a model for inference, we should use a pipeline. The pipeline object is a wrapper around all the other available pipelines, eg using pipeline with task parameter set to "text-generation" references to the task-specific pipeline: TextGenerationPipeline. TextGenerationPipeline uses any ModelWithLMHead to predict the next words following a specified prefix.

The pipeline object (defined as generator in this case) takes arguments which are defined in PretrainedConfig (section: _Parameters for sequence generation_).

Learn more:
* [Pipeline](https://huggingface.co/transformers/main_classes/pipelines.html)
* [TextGenerationPipeline](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TextGenerationPipeline)
* [PretrainedConfig](https://huggingface.co/transformers/main_classes/configuration.html#transformers.PretrainedConfig)

In [4]:
generator = pipeline('text-generation', tokenizer='gpt2', model='data/out')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




In [10]:
print(generator('Once upon a time', max_length=40)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time a certain king came to visit me, and told me that my wife was dead. I found it impossible to believe him, although I must admit that he had some great wealth and


## Text generation with different decoding methods

Better decoding methods play an important role in improving performance of language models. Huggingface transformers allow to easily implement such decoding methods as: Greedy search, Beam search, Top-K sampling and Top-p sampling.

Learn more:
* [Different decoding methods for language generation](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)

_Greedy search_ simply selects the word with the highest probability as the next word (the default text generation mode). This method can possibly result in the model repeating itself and missing high probability words hidden behind low probability words.

_Beam search_ evaluates num_beams consecutive words and selects the ones with the highest overall probability. It reduces the risk of missing hidden high probability words.

In [19]:
text_beam = generator('Once upon a time',
                      max_length=40,
                      num_beams=5)
print(text_beam[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time there was an old woman who lived in a house in the village. She was very poor, and she had no money to buy clothes for the day. She had no money to


On the other hand, text generated by humans doen not follow a distribution of high probability next words, that's why it's worth introducing some randomness while decoding model output.

We can introduce _random sampling_ of next word, that is, picking the next word acording to its conditional probability distribution. What's more, by adding _softmax temperature_ we can make the distribution sharper (increasing the likelihood of high probability words and the opposite for low probability words).

In [20]:
text_random_sampling = generator('Once upon a time',
                                 max_length=40,
                                 top_k=0,
                                 do_sample=True,
                                 temperature=0.7)
print(text_random_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time a certain young lady came to the bank, and asked which of her friends was there, and the first friend said she was the farmer, and the second friend said she was the


_Tok k-sampling_ method limits the sampling pool to k words with the highest probability thus it allows us to eliminate the most unlikely words.

On the other hand, _Top-p sampling_ chooses from the smallest possible set of words whose cumulative probability exceeds the probability p.

In [27]:
text_k_sampling = generator('Once upon a time',
                            max_length=40,
                            top_k=40,
                            do_sample=True)
print(text_k_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time he said to his son, "I will give it as a present to those who go to church; for they have gone so wickedly on earth that nothing more will gain them


In [30]:
text_p_sampling = generator('Once upon a time',
                            max_length=40,
                            top_k=0,
                            top_p=0.92,
                            do_sample=True)
print(text_p_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time my friend was born. The princess had spoken of me to someone, and said, "Is it now or never that I will be able to take your daughter to your castle?"
