# Code generation

Up until now, weâ€™ve mostly been using pretrained models and fine-tuning them for new use cases by reusing the weights from pretraining. This is commonly referred to as transfer learning, and itâ€™s a very successful strategy for applying Transformer models to most real-world use cases where labeled data is sparse. In this chapter, weâ€™ll take a different approach and train a completely new model from scratch. This is a good approach to take if you have a lot of data and it is very different from the pretraining data used for the available models. However, it also requires considerably more compute resources to pretrain a language model than just to fine-tune an existing one. Examples where it can make sense to train a new model include for datasets consisting of musical notes, molecular sequences such as DNA, or programming languages. The latter have recently gained traction thanks to tools such as TabNine and GitHubâ€™s Copilot, powered by OpenAIâ€™s Codex model, that can generate long sequences of code. This task of text generation is best addressed with auto-regressive or causal language models such as GPT-2.

In this section we will build a scaled-down version of a code generation model: weâ€™ll focus on one-line completions instead of full functions or classes, using a subset of Python code. When working with data in Python you are in frequent contact with the Python data science stack, consisting of the `matplotlib`, `seaborn`, `pandas`, and `scikit-learn` libraries. When using those frameworks itâ€™s common to need to look up specific commands, so it would be nice if we could use a model to complete these calls for us.

## 1. Load the data

Python code is abundantly available from code repositories such as GitHub, which we can use to create a dataset by scraping for every Python repository. This was the approach taken in the Transformers textbook to pretrain a large GPT-2 model. Using a GitHub dump of about 180 GB containing roughly 20 million Python files called `codeparrot`, the authors built a dataset that they then shared on the Hugging Face Hub.

### The Codeparrot dataset

However, training on the full corpus is time- and compute-consuming, and we only need the subset of the dataset concerned with the Python data science stack. So, letâ€™s start by filtering the `codeparrot` dataset for all files that include any of the libraries in this stack. Because of the datasetâ€™s size, we want to avoid downloading it; instead, weâ€™ll use the streaming feature to filter it on the fly. To help us filter the code samples using the libraries we mentioned earlier, weâ€™ll use the following function:

In [1]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 1234
# torch.manual_seed(SEED)
# torch.backends.cudnn.deterministic = True

torch.cuda.get_device_name(0)

cuda


'Tesla T4'

In [2]:
def any_keyword_in_string(string, keywords):
    for keyword in keywords:
        if keyword in string:
            return True
    return False

In [3]:
filters = ["pandas", "sklearn", "matplotlib", "seaborn"]
example_1 = "import numpy as np"
example_2 = "import pandas as pd"

print(
    any_keyword_in_string(example_1, filters), any_keyword_in_string(example_2, filters)
)

False True


We can use this to create a function that will stream the dataset and filter the elements we want:

In [4]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset

def filter_streaming_dataset(dataset, filters):
    filtered_dict = defaultdict(list)
    total = 0
    for sample in tqdm(iter(dataset)):
        total += 1
        if any_keyword_in_string(sample["content"], filters):
            for k, v in sample.items():
                filtered_dict[k].append(v)
    print(f"{len(filtered_dict['content'])/total:.2%} of data after filtering.")
    return Dataset.from_dict(filtered_dict)

Then we can simply apply this function to the streaming dataset:

In [6]:
# # This cell will take a very long time to execute, so you should skip it and go the next one!
# from datasets import load_dataset

# split = "train"  # "valid"
# filters = ["pandas", "sklearn", "matplotlib", "seaborn"]

# data = load_dataset(f"transformersbook/codeparrot-{split}", split=split, streaming=True)
# filtered_data = filter_streaming_dataset(data, filters)

This leaves us with about 3% of the original dataset, which is still quite sizable â€” the resulting dataset is 6 GB and consists of 600,000 Python scripts!

Filtering the full dataset can take 2-3h depending on your machine and bandwidth. If you donâ€™t want to go through this lengthy process yourself, Hugging face provide the filtered dataset on the Hub for you to download:

In [7]:
#comment this if you are not using AIT proxy...
# import os

# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

In [8]:
from datasets import load_dataset, DatasetDict

raw_datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
ds_train = raw_datasets["train"]
ds_valid = raw_datasets["validation"]

# ds_train = load_dataset("huggingface-course/codeparrot-ds-train", split="train")
# ds_valid = load_dataset("huggingface-course/codeparrot-ds-valid", split="validation")

# #remove .shuffle if you want to train the whole dataset....

# raw_datasets = DatasetDict(
#     {
#         "train": ds_train.shuffle().select(range(50000)),
#         "valid": ds_valid.shuffle().select(range(500))
#     }
# )

raw_datasets



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

In [9]:
from datasets import load_dataset, DatasetDict
data_column = "text"

raw_datasets = DatasetDict(
    {
        "train": ds_train.shuffle().select(range(35000)), 
        "validation": ds_valid.shuffle().select(range(3500)),  
    }
)

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 35000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3500
    })
})

Letâ€™s look at an example from the dataset. Weâ€™ll just show the first 200 characters of each field:

In [11]:
for key in raw_datasets["train"][10]:
    print(f"{key.upper()}: {raw_datasets['train'][10][key][:200]}")

TEXT:  JACK : Gwendolen , it is a terrible thing for a man to find out suddenly that all his life he has been speaking nothing but the truth . Can you forgive me ? 



We can see that the content field contains the code that we want our model to train on. Now that we have a dataset, we need to prepare the texts so theyâ€™re in a format suitable for pretraining.

## 2. Preprocessing

The first step will be to tokenize the data, so we can use it for training. Since our goal is to mainly autocomplete short function calls, we can keep the context size relatively small. This has the benefit that we can train the model much faster and it requires significantly less memory. If it is important for your application to have more context (for example, if you want the model to write unit tests based on a file with the function definition), make sure you increase that number, but also keep in mind that this comes with a greater GPU memory footprint. For now, letâ€™s fix the context size at 128 tokens, as opposed to the 1,024 or 2,048 used in GPT-2 or GPT-3, respectively.

Most documents contain many more than 128 tokens, so simply truncating the inputs to the maximum length would eliminate a large fraction of our dataset. Instead, weâ€™ll use the `return_overflowing_tokens` option to tokenize the whole input and split it into several chunks. Weâ€™ll also use the `return_length` option to return the length of each created chunk automatically. Often the last chunk will be smaller than the context size, and weâ€™ll get rid of these pieces to avoid padding issues; we donâ€™t really need them as we have plenty of data anyway.

Letâ€™s see exactly how this works by looking at the first two examples:

In [11]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

outputs = tokenizer(
    raw_datasets["train"][:10]["text"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Input IDs length: 11
Input chunk lengths: [7, 0, 128, 46, 54, 6, 0, 113, 45, 102, 0]
Chunk mapping: [0, 1, 2, 2, 3, 4, 5, 6, 7, 8, 9]


We can see that we get 34 segments in total from those two examples. Looking at the chunk lengths, we can see that the chunks at the ends of both documents have less than 128 tokens (117 and 41, respectively). These represent just a small fraction of the total chunks that we have, so we can safely throw them away. With the `overflow_to_sample_mapping` field, we can also reconstruct which chunks belonged to which input samples.

With this operation weâ€™re using a handy feature of the `Dataset.map()` function in ðŸ¤— Datasets, which is that it does not require one-to-one maps; we can create batches with more or fewer elements than the input batch. This is useful when doing operations like data augmentation or data filtering that change the number of elements. In our case, when tokenizing each element into chunks of the specified context size, we create many samples from each document. We just need to make sure to delete the existing columns, since they have a conflicting size. If we wanted to keep them, we could repeat them appropriately and return them within the `Dataset.map()` call:

In [13]:
def tokenize(element):
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 13578
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 1401
    })
})

We now have 16.7 million examples with 128 tokens each, which corresponds to about 2.1 billion tokens in total. For reference, OpenAIâ€™s GPT-3 and Codex models are trained on 300 and 100 billion tokens, respectively, where the Codex models are initialized from the GPT-3 checkpoints. Our goal in this section is not to compete with these models, which can generate long, coherent texts, but to create a scaled-down version providing a quick autocomplete function for data scientists.

Now that we have the dataset ready, letâ€™s set up the model!

## 3. Model

Our first step is to freshly initialize a GPT-2 model. Weâ€™ll use the same configuration for our model as for the small GPT-2 model, so we load the pretrained configuration, make sure that the tokenizer size matches the model vocabulary size and pass the bos and eos (beginning and end of sequence) token IDs:

In [14]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

With that configuration, we can load a new model. Note that this is the first time we donâ€™t use the `from_pretrained()` function, since weâ€™re actually initializing a model ourself:

In [15]:
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.2M parameters


### Keywords

Since we are mainly interested in sensible autocompletion for the the data science libraries, it makes sense to give more weight to training samples that make more use of these libraries. We can easily identify these examples through the use of keywords such as plt, pd, sk, fit, and predict, which are the most frequent import names for `matplotlib.pyplot`, `panda`s, and `sklearn` as well as the fit/predict pattern of the latter. If these are each represented as a single token, we can easily check if they occur in the input sequence. Tokens can have a whitespace prefix, so weâ€™ll also check for those versions in the tokenizer vocabulary. To verify that it works, weâ€™ll add one test token which should be split into multiple tokens:

In [16]:
keytoken_ids = []
for keyword in [
    "plt",
    "@-@",
    "pd",
    "sk",
    "fit",
    "predict",
    " plt",
    " pd",
    " sk",
    " fit",
    " predict",
    "testtest",
]:
    ids = tokenizer([keyword]).input_ids[0]
    if len(ids) == 1:
        keytoken_ids.append(ids[0])
    else:
        print(f"Keyword has not single token: {keyword}")

Keyword has not single token: @-@
Keyword has not single token: testtest


### Loss

Great, that seems to work nicely! We can now write a custom loss function that takes the input sequence, the logits, and the key tokens we just selected as inputs. First we need to align the logits and inputs: the input sequence shifted by one to the right forms the labels, since the next token is the label for the current token. We can achieve this by starting the labels from the second token of the input sequence, since the model does not make a prediction for the first token anyway. Then we cut off the last logit, as we donâ€™t have a label for the token that follows the full input sequence. With that we can compute the loss per sample and count the occurrences of all keywords in each sample. Finally, we calculate the weighted average over all samples using the occurrences as weights. Since we donâ€™t want to throw away all the samples that have no keywords, we add 1 to the weights:

In [17]:
from torch.nn import CrossEntropyLoss
import torch

def keytoken_weighted_loss(inputs, logits, keytoken_ids, alpha=1.0):
    # Shift so that tokens < n predict n
    shift_labels = inputs[..., 1:].contiguous()
    shift_logits = logits[..., :-1, :].contiguous()
    # Calculate per-token loss
    loss_fct = CrossEntropyLoss(reduce=False) #change to reduction=None
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    # Resize and average loss per sample
    loss_per_sample = loss.view(shift_logits.size(0), shift_logits.size(1)).mean(axis=1)
    # Calculate and scale weighting
    weights = torch.stack([(inputs == kt).float() for kt in keytoken_ids]).sum(
        axis=[0, 2]
    )
    weights = alpha * (1.0 + weights)
    # Calculate weighted average
    weighted_loss = (loss_per_sample * weights).mean()
    return weighted_loss

### Dataloaders

In [18]:
from torch.utils.data.dataloader import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=32, shuffle=True)
eval_dataloader  = DataLoader(tokenized_datasets["validation"], batch_size=32)

### Optimizer

Next, we group the parameters so that the optimizer knows which ones will get an additional weight decay. Usually, all bias and LayerNorm weights terms are exempt from this; hereâ€™s how we can do this:

In [19]:
weight_decay = 0.1


def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [
        {"params": params_with_wd, "weight_decay": weight_decay},
        {"params": params_without_wd, "weight_decay": 0.0},
    ]

Since we want to evaluate the model regularly on the validation set during training, letâ€™s write a function for that as well. It just runs through the evaluation dataloader and gathers all the losses across processes:

In [20]:
def evaluate():
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(batch["input_ids"], labels=batch["input_ids"])
            outputs.loss = outputs.loss.reshape(1)
        losses.append(accelerator.gather(outputs.loss))        
    loss = torch.mean(torch.cat(losses))
    try:
        perplexity = torch.exp(loss)
    except OverflowError:
        perplexity = float("inf")
    return loss.item(), perplexity.item()

With the `evaluate()` function we can report loss and perplexity at regular intervals. Next, we redefine our model to make sure we train from scratch again:

In [21]:
model = GPT2LMHeadModel(config)

We can then define our optimizer, using the function from before to split the parameters for weight decay:

In [22]:
from torch.optim import AdamW

optimizer = AdamW(get_grouped_params(model), lr=5e-4)

### Accelerator

Now letâ€™s prepare the model, optimizer, and dataloaders so we can start training:

In [23]:
!pip install accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [23]:
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision='fp16')

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that we have sent our `train_dataloader` to `accelerator.prepare()`, we can use its length to compute the number of training steps. Remember that we should always do this after preparing the dataloader, as that method will change its length. We use a classic linear schedule from the learning rate to 0:

In [24]:
from transformers import get_scheduler

num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=1_000,
    num_training_steps=num_training_steps,
)

### Repository

Lastly, to push our model to the Hub, we will need to create a `Repository` object in a working folder. First log in to the Hugging Face Hub, if you arenâ€™t logged in already. Weâ€™ll determine the repository name from the model ID we want to give our model (feel free to replace the repo_name with your own choice; it just needs to contain your username, which is what the function `get_full_repo_name()` does):

In [25]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [27]:
!pip3 install ipywidgets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [26]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "wikitext-lab-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'briands/wikitext-lab-accelerate'

Then we can clone that repository in a local folder. If it already exists, this local folder should be an existing clone of the repository we are working with:

In [27]:
!git version

git version 2.25.1


In [28]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"

output_dir = "wikitext-lab-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

/content/wikitext-lab-accelerate is already a clone of https://huggingface.co/briands/wikitext-lab-accelerate. Make sure you pull the latest changes with `repo.git_pull()`.


We can now upload anything we save in `output_dir` by calling the `repo.push_to_hub()` method. This will help us upload the intermediate models at the end of each epoch.

## 5. Training

Before we train, letâ€™s run a quick test to see if the evaluation function works properly:

In [29]:
evaluate()

(10.975671768188477, 58435.08984375)

Those are very high values for loss and perplexity, but thatâ€™s not surprising as we havenâ€™t trained the model yet. With that, we have everything prepared to write the core part of the training script: the training loop. In the training loop we iterate over the dataloader and pass the batches to the model. With the logits, we can then evaluate our custom loss function. We scale the loss by the number of gradient accumulation steps so as not to create larger losses when aggregating more steps. Before we optimize, we also clip the gradients for better convergence. Finally, every few steps we evaluate the model on the evaluation set with our new `evaluate()` function:

In [30]:
from tqdm.notebook import tqdm

gradient_accumulation_steps = 8
eval_steps = 5_000

model.train()
completed_steps = 0
for epoch in range(num_train_epochs):
    for step, batch in tqdm(
        enumerate(train_dataloader, start=1), total=num_training_steps
    ):
        logits = model(batch["input_ids"]).logits
        loss = keytoken_weighted_loss(batch["input_ids"], logits, keytoken_ids)
        if step % 100 == 0:
            accelerator.print(
                {
                    "steps": completed_steps,
                    "loss/train": loss.item() * gradient_accumulation_steps,
                }
            )
        loss = loss / gradient_accumulation_steps
        accelerator.backward(loss)
        if step % gradient_accumulation_steps == 0:
            accelerator.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            completed_steps += 1
        if (step % (eval_steps * gradient_accumulation_steps)) == 0:
            eval_loss, perplexity = evaluate()
            accelerator.print({"loss/eval": eval_loss, "perplexity": perplexity})
            model.train()
            accelerator.wait_for_everyone()
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
            if accelerator.is_main_process:
                tokenizer.save_pretrained(output_dir)
                repo.push_to_hub(
                    commit_message=f"Training in progress step {step}", blocking=False
                )

  0%|          | 0/4250 [00:00<?, ?it/s]



{'steps': 12, 'loss/train': 83.81831359863281}
{'steps': 24, 'loss/train': 77.86198425292969}
{'steps': 37, 'loss/train': 80.0496597290039}
{'steps': 49, 'loss/train': 74.07071685791016}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 65, 'loss/train': 71.31040954589844}
{'steps': 77, 'loss/train': 66.23365020751953}
{'steps': 90, 'loss/train': 60.584495544433594}
{'steps': 102, 'loss/train': 60.019248962402344}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 118, 'loss/train': 55.540504455566406}
{'steps': 130, 'loss/train': 54.31704330444336}
{'steps': 143, 'loss/train': 51.91812515258789}
{'steps': 155, 'loss/train': 52.05693817138672}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 171, 'loss/train': 50.725738525390625}
{'steps': 183, 'loss/train': 52.33911895751953}
{'steps': 196, 'loss/train': 46.41450119018555}
{'steps': 208, 'loss/train': 46.84027099609375}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 224, 'loss/train': 48.047454833984375}
{'steps': 236, 'loss/train': 50.17557907104492}
{'steps': 249, 'loss/train': 47.079437255859375}
{'steps': 261, 'loss/train': 48.679100036621094}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 277, 'loss/train': 44.26673126220703}
{'steps': 289, 'loss/train': 42.83210754394531}
{'steps': 302, 'loss/train': 39.988067626953125}
{'steps': 314, 'loss/train': 44.098876953125}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 330, 'loss/train': 41.926876068115234}
{'steps': 342, 'loss/train': 39.613712310791016}
{'steps': 355, 'loss/train': 40.79082489013672}
{'steps': 367, 'loss/train': 38.699424743652344}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 383, 'loss/train': 39.34809112548828}
{'steps': 395, 'loss/train': 38.483131408691406}
{'steps': 408, 'loss/train': 36.67390441894531}
{'steps': 420, 'loss/train': 37.103267669677734}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 436, 'loss/train': 35.62119674682617}
{'steps': 448, 'loss/train': 40.124000549316406}
{'steps': 461, 'loss/train': 36.13454818725586}
{'steps': 473, 'loss/train': 37.20380783081055}


  0%|          | 0/4250 [00:00<?, ?it/s]

{'steps': 489, 'loss/train': 34.34217834472656}
{'steps': 501, 'loss/train': 34.298309326171875}
{'steps': 514, 'loss/train': 33.47306823730469}
{'steps': 526, 'loss/train': 34.819580078125}


In [31]:
tokenizer.save_pretrained(output_dir)

('wikitext-lab-accelerate/tokenizer_config.json',
 'wikitext-lab-accelerate/special_tokens_map.json',
 'wikitext-lab-accelerate/vocab.json',
 'wikitext-lab-accelerate/merges.txt',
 'wikitext-lab-accelerate/added_tokens.json',
 'wikitext-lab-accelerate/tokenizer.json')

In [32]:
model.train()
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)

In [33]:
repo.push_to_hub(
                    commit_message=f"Training in progress step {step}", blocking=False
                )

('https://huggingface.co/briands/wikitext-lab-accelerate/commit/9529d7ae59a4e825cfa7429ede8de324c73bd091',
 [push command, status code: running, in progress. PID: 19616])

By the way, I installed GIT as well as GIT LFS but there is an issue with GIT LFS. Consequently, I could not push my trained model into huggingface.co. I tried to find out the solution but It ended unsuccessfully. Fortunately, I figured out how to deal with it now.

## 6. Try Beam Search

Now is the moment of truth: letâ€™s see how well the trained model actually works! We can see in the logs that the loss went down steadily, but to put the model to the test letâ€™s take a look at how well it works on some prompts. To do that weâ€™ll wrap the model in a text generation pipeline, and weâ€™ll put it on the GPU for fast generations if there is one available.

In [34]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer


tokenizer = GPT2Tokenizer.from_pretrained("briands/wikitext-lab-accelerate")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("briands/wikitext-lab-accelerate", pad_token_id=tokenizer.eos_token_id)

Downloading pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

Greedy Search

In [35]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('He had a recurring role in 2003', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
He had a recurring role in 2003, and the film was released in the United States. In the United States, the United States, the United States, the United States, the United States, was released in the Un


Beam Search

In [36]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,  
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
He had a recurring role in 2003, and the film was released in the United Kingdom in the United Kingdom, and the United Kingdom. In the United Kingdom, Kingdom Hearts II


In [37]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
He had a recurring role in 2003, and the film was released in the United Kingdom. The episode was written by Jackson, who had been released as the first episode of the second season. In July 2009, the episode


Seem to be better than other first two versions

In [38]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: He had a recurring role in 2003, and the film was released in the United Kingdom. The episode was written by Jackson, who had been released as the first episode of the second season. In July 2009, the episode
1: He had a recurring role in 2003, and the film was released in the United Kingdom. The episode was written by Jackson, who had been released as the first episode of the second season. In July 2013, the episode
2: He had a recurring role in 2003, and the film was released in the United Kingdom. The episode was written by Jackson, who had been released as the first episode of the second season. In July 2008, the episode
3: He had a recurring role in 2003, and the film was released in the United Kingdom. The episode was written by Jackson, who had been released as the first episode of the second season. In June 2013, the episode
4: He had a recurring role in 2003, and the

Sampling

In [39]:
# set seed to reproduce results. Feel free to change the seed though to get different results
import torch

torch.manual_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
He had a recurring role in 2003, while Cin var. Olivier had aired from Cooper in 1991. He claimed that " Seger's perfect view she would be done. " Bartin,dependent, in every


In [40]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
He had a recurring role in 2003, and that is the case of the British British championship was played by the name of the Film. The previous film was struggled from the 2009 â€“ 1 â€“


Top-K Sampling

In [41]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
He had a recurring role in 2003, while Cinth and Olivier had aired off as the best of the album. He was " Alice in his first career with the first role, but he was going to use the final


Top-p (nucleus) sampling

In [42]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
He had a recurring role in 2003, while Cinth and Olivier had aired off as Angelou ( 2004 ), who, Seger's Orien City college, received a team, but led the chapter


In [43]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: He had a recurring role in 2003, while Olivier stated " You've not know he's best to continue it as you just you â€™ s going it've to be very much to be one, but... she is no
1: He had a recurring role in 2003, with her debut of the " Films " from which she made several seasons and her father ". She was a resembling that she was a good time with her father, but also
2: He had a recurring role in 2003 that he took place in the first season to be unreleased, and the first goal in the last week. As Ceres scored four, Sarnia on 10 March 2003 in four years old, C


Top-k and Top-p seem to be better than Beam Search

## 7. Inference

In [44]:
import torch
from transformers import pipeline

pipe = pipeline("text-generation", max_length=100, pad_token_id=0, eos_token_id=0, model="briands/wikitext-lab-accelerate")

Downloading (â€¦)/main/tokenizer.json:   0%|          | 0.00/2.09M [00:00<?, ?B/s]

In [45]:
txt = "Boulter starred in two films in 2008"
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])



Boulter starred in two films in 2008. Pennyson, the song was a senior and written by the Koniner Murlen Awards during a 1988 Slamour season, as a reignous stunts and school in English, he joined as the episode as a " Olivier ". He played the club and a " Tulder " in a civil ". A " Cantasy Villa, who,


In [46]:
txt = "In 2006 Boulter starred in the play Citizenship"
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])

In 2006 Boulter starred in the play Citizenship against the Dota 2, with the Singapore League. The game was a new record record. The Cup final, which was originally played at his first time in the 2015 Championship Cup. On January 31, it was the first ever @-@ down with the Dota 2 in the match during Chelsea, where Cryantrell made the club's first to write the
