# Code generation

Up until now, we’ve mostly been using pretrained models and fine-tuning them for new use cases by reusing the weights from pretraining. This is commonly referred to as transfer learning, and it’s a very successful strategy for applying Transformer models to most real-world use cases where labeled data is sparse. In this chapter, we’ll take a different approach and train a completely new model from scratch. This is a good approach to take if you have a lot of data and it is very different from the pretraining data used for the available models. However, it also requires considerably more compute resources to pretrain a language model than just to fine-tune an existing one. Examples where it can make sense to train a new model include for datasets consisting of musical notes, molecular sequences such as DNA, or programming languages. The latter have recently gained traction thanks to tools such as TabNine and GitHub’s Copilot, powered by OpenAI’s Codex model, that can generate long sequences of code. This task of text generation is best addressed with auto-regressive or causal language models such as GPT-2.

In this section we will build a scaled-down version of a code generation model: we’ll focus on one-line completions instead of full functions or classes, using a subset of Python code. When working with data in Python you are in frequent contact with the Python data science stack, consisting of the `matplotlib`, `seaborn`, `pandas`, and `scikit-learn` libraries. When using those frameworks it’s common to need to look up specific commands, so it would be nice if we could use a model to complete these calls for us.

In [1]:
import torch, torchdata, torchtext
from torch import nn
import torch.nn.functional as F
from tqdm.auto import tqdm
import random, math, time
from torch.autograd import Variable
import operator

device = torch.device('cuda:2' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

cpu


## 1. ETL: Loading the dataset

In [2]:
#uncomment this if you are not using our department puffer
# import os
# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

In [3]:
alice_files = open('./alice_in_wonderland.txt','rb')
lines = []
for line in alice_files:
    line = line.strip().lower()
    line = line.decode("ascii","ignore")
    if len(line) == 0:
        continue
    lines.append(line)
alice_files.close()

In [4]:
from sklearn.model_selection import train_test_split

# assume `df` is your Pandas dataframe
# split into train and test/validation
ds_train, ds_valid = train_test_split(lines, test_size=0.3, random_state=555)

In [5]:
from datasets import Dataset
from datasets import DatasetDict
import pandas as pd

raw_datasets_train = Dataset.from_pandas(pd.DataFrame(data = {'content': ds_train}))
raw_datasets_valid = Dataset.from_pandas(pd.DataFrame(data = {'content': ds_valid}))

#remove .shuffle if you want to train the whole dataset....

raw_datasets = DatasetDict(
    {
        'train':raw_datasets_train,
        'valid':raw_datasets_valid
    }
)

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['content'],
        num_rows: 1908
    })
    valid: Dataset({
        features: ['content'],
        num_rows: 818
    })
})

In [6]:
for key in raw_datasets["train"][0]:
    print(f"{key.upper()}: {raw_datasets['train'][0][key][:200]}")

CONTENT: the little door, had vanished completely.


## 2. Preprocessing

In [7]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.padding_side = "right" # "left" or "right"
tokenizer.pad_token = tokenizer.eos_token

outputs = tokenizer(
    raw_datasets["train"][:2]["content"], 
    return_tensors="pt",
    padding=True,
    )


print(f"Input IDs length: {len(outputs['input_ids'])}")
print(outputs['input_ids'])

Input IDs length: 2
tensor([[ 1169,  1310,  3420,    11,   550, 23717,  3190,    13],
        [ 4164,   287,   262, 50256, 50256, 50256, 50256, 50256]])


In [8]:
def tokenize(element):
    outputs = tokenizer(
        element["content"], 
        return_tensors="pt",
        padding=True,
    )
    
    input_batch = []
    for input_ids in outputs["input_ids"]:
        input_batch.append(input_ids)
    return {"input_ids": input_batch}

# raw_datasets_ = Dataset.from_pandas(pd.DataFrame(data=raw_datasets_train))
tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 1908
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 818
    })
})

## 3. Model

Our first step is to freshly initialize a GPT-2 model. We’ll use the same configuration for our model as for the small GPT-2 model, so we load the pretrained configuration, make sure that the tokenizer size matches the model vocabulary size and pass the bos and eos (beginning and end of sequence) token IDs:

In [9]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig
context_length  = 128
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

With that configuration, we can load a new model. Note that this is the first time we don’t use the `from_pretrained()` function, since we’re actually initializing a model ourself:

In [10]:
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.4M parameters


In [11]:
keytoken_ids = []
for keyword in [
    "alice",
    "Caterpillar",
    "Dormouse",
    "Red Queen",
    "Cheshire",
    "Tarrant"
]:
    ids = tokenizer([keyword]).input_ids[0]
    if len(ids) == 1:
        keytoken_ids.append(ids[0])
    else:
        print(f"Keyword has not single token: {keyword}")

Keyword has not single token: alice
Keyword has not single token: Caterpillar
Keyword has not single token: Dormouse
Keyword has not single token: Red Queen
Keyword has not single token: Cheshire
Keyword has not single token: Tarrant


### Loss

Great, that seems to work nicely! We can now write a custom loss function that takes the input sequence, the logits, and the key tokens we just selected as inputs. First we need to align the logits and inputs: the input sequence shifted by one to the right forms the labels, since the next token is the label for the current token. We can achieve this by starting the labels from the second token of the input sequence, since the model does not make a prediction for the first token anyway. Then we cut off the last logit, as we don’t have a label for the token that follows the full input sequence. With that we can compute the loss per sample and count the occurrences of all keywords in each sample. Finally, we calculate the weighted average over all samples using the occurrences as weights. Since we don’t want to throw away all the samples that have no keywords, we add 1 to the weights:

In [12]:
from torch.nn import CrossEntropyLoss
import torch

def keytoken_weighted_loss(inputs, logits, keytoken_ids, alpha=1.0):
    # Shift so that tokens < n predict n
    shift_labels = inputs[..., 1:].contiguous()
    shift_logits = logits[..., :-1, :].contiguous()
    # Calculate per-token loss
    loss_fct = CrossEntropyLoss(reduce=False) #change to reduction=None
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    # Resize and average loss per sample
    loss_per_sample = loss.view(shift_logits.size(0), shift_logits.size(1)).mean(axis=1)
    # Calculate and scale weighting
    # weights = torch.stack([(inputs == kt).float() for kt in keytoken_ids]).sum(
    #     axis=[0, 2]
    # )
    # weights = alpha * (1.0 + weights)
    # Calculate weighted average
    # weighted_loss = (loss_per_sample * weights).mean()
    weighted_loss = loss_per_sample.mean()
    return weighted_loss

### Dataloaders

In [13]:
from torch.utils.data.dataloader import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=32, shuffle=True)
eval_dataloader  = DataLoader(tokenized_datasets["valid"], batch_size=32)

### Optimizer

Next, we group the parameters so that the optimizer knows which ones will get an additional weight decay. Usually, all bias and LayerNorm weights terms are exempt from this; here’s how we can do this:

In [14]:
weight_decay = 0.1


def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [
        {"params": params_with_wd, "weight_decay": weight_decay},
        {"params": params_without_wd, "weight_decay": 0.0},
    ]

Since we want to evaluate the model regularly on the validation set during training, let’s write a function for that as well. It just runs through the evaluation dataloader and gathers all the losses across processes:

In [15]:
def evaluate():
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():

            outputs = model(batch["input_ids"].to(device), labels=batch["input_ids"].to(device))
            outputs.loss = outputs.loss.reshape(1)
        losses.append(accelerator.gather(outputs.loss))        
    loss = torch.mean(torch.cat(losses))
    try:
        perplexity = torch.exp(loss)
    except OverflowError:
        perplexity = float("inf")
    return loss.item(), perplexity.item()

With the `evaluate()` function we can report loss and perplexity at regular intervals. Next, we redefine our model to make sure we train from scratch again:

In [16]:
model = GPT2LMHeadModel(config)
model = model.to(device)

We can then define our optimizer, using the function from before to split the parameters for weight decay:

In [17]:
from torch.optim import AdamW

optimizer = AdamW(get_grouped_params(model), lr=5e-4)

### Accelerator

Now let’s prepare the model, optimizer, and dataloaders so we can start training:

In [18]:
# !pip install accelerate

In [19]:
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision='fp16')

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that we have sent our `train_dataloader` to `accelerator.prepare()`, we can use its length to compute the number of training steps. Remember that we should always do this after preparing the dataloader, as that method will change its length. We use a classic linear schedule from the learning rate to 0:

In [20]:
from transformers import get_scheduler

num_train_epochs = 1
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=1_000,
    num_training_steps=num_training_steps,
)

### Repository

Lastly, to push our model to the Hub, we will need to create a `Repository` object in a working folder. First log in to the Hugging Face Hub, if you aren’t logged in already. We’ll determine the repository name from the model ID we want to give our model (feel free to replace the repo_name with your own choice; it just needs to contain your username, which is what the function `get_full_repo_name()` does):

In [21]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to C:\Users\Guntsv\.huggingface\token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [23]:
!git config --global credential.helper

In [24]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "alice-in-ait-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'guntsv/alice-in-ait-accelerate'

Then we can clone that repository in a local folder. If it already exists, this local folder should be an existing clone of the repository we are working with:

In [25]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"

output_dir = "alice-in-ait-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

c:\Users\Guntsv\Documents\GitHub\DSAI-AIT-2022\Course\Natural Language Understanding\Assignment\09 - LM using Huggingface\alice-in-ait-accelerate is already a clone of https://huggingface.co/guntsv/alice-in-ait-accelerate. Make sure you pull the latest changes with `repo.git_pull()`.


We can now upload anything we save in `output_dir` by calling the `repo.push_to_hub()` method. This will help us upload the intermediate models at the end of each epoch.

## 5. Training

Before we train, let’s run a quick test to see if the evaluation function works properly:

In [26]:
evaluate()

(10.457453727722168, 34802.8203125)

Those are very high values for loss and perplexity, but that’s not surprising as we haven’t trained the model yet. With that, we have everything prepared to write the core part of the training script: the training loop. In the training loop we iterate over the dataloader and pass the batches to the model. With the logits, we can then evaluate our custom loss function. We scale the loss by the number of gradient accumulation steps so as not to create larger losses when aggregating more steps. Before we optimize, we also clip the gradients for better convergence. Finally, every few steps we evaluate the model on the evaluation set with our new `evaluate()` function:

In [39]:
from tqdm.notebook import tqdm

gradient_accumulation_steps = 8
eval_steps = 2

model.train()
completed_steps = 0
for epoch in range(num_train_epochs):
    for step, batch in tqdm(enumerate(train_dataloader, start=1), total=num_training_steps):
        logits = model(batch["input_ids"]).logits
        loss = keytoken_weighted_loss(batch["input_ids"], logits, keytoken_ids)

        if step % 10 == 0:
            accelerator.print(
                {
                    "steps": completed_steps,
                    "loss/train": loss.item() * gradient_accumulation_steps,
                }
            )
        loss = loss / gradient_accumulation_steps
        # print(loss)
        accelerator.backward(loss) #instance of optimize.backward()

        if step % gradient_accumulation_steps == 0:
            accelerator.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            completed_steps += 1
            
        if (step % (eval_steps * gradient_accumulation_steps)) == 0:
            eval_loss, perplexity = evaluate()
            accelerator.print({"loss/eval": eval_loss, "perplexity": perplexity})
            model.train()
            #save your model
            accelerator.wait_for_everyone()
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
            if accelerator.is_main_process:
                tokenizer.save_pretrained(output_dir)
                repo.push_to_hub(
                    commit_message=f"Training in progress step {step}", blocking=False
                )

  0%|          | 0/60 [00:00<?, ?it/s]

{'steps': 1, 'loss/train': 31.275535583496094}
{'loss/eval': 3.543412923812866, 'perplexity': 34.584754943847656}
{'steps': 2, 'loss/train': 27.57358169555664}
{'steps': 3, 'loss/train': 30.18174934387207}
{'loss/eval': 3.418417453765869, 'perplexity': 30.521076202392578}


Several commits (2) will be pushed upstream.


{'steps': 4, 'loss/train': 29.395278930664062}
{'loss/eval': 3.252033233642578, 'perplexity': 25.842830657958984}
{'steps': 6, 'loss/train': 26.11139488220215}
{'steps': 7, 'loss/train': 22.766937255859375}


## 6. Inference

Now is the moment of truth: let’s see how well the trained model actually works! We can see in the logs that the loss went down steadily, but to put the model to the test let’s take a look at how well it works on some prompts. To do that we’ll wrap the model in a text generation pipeline, and we’ll put it on the GPU for fast generations if there is one available.

In [42]:
import torch
from transformers import pipeline
checkpoints = "guntsv/alice-in-ait-accelerate"
pipe = pipeline("text-generation", max_length=100, pad_token_id=0, eos_token_id=0, model=checkpoints)

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [62]:
txt = "I am Alice who"
# print(pipe(txt, num_return_sequences=1)[0]["generated_text"])
pipe(txt)[0]["generated_text"]

'I am Alice who manifold'

In [58]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('Alice enjoy walking with my cute dog', return_tensors='pt')

## 6.1 Greedy Search

In [65]:
# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Alice enjoy walking with my cute dog


## 6.2 Beam search

In [60]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Alice enjoy walking with my cute dog


#### Reference
https://huggingface.co/blog/how-to-generate