# Charlie and the Chocolate Factory Text Generation

<b>Name:</b> Abhinav Lugun <b>ID:</b> st122322

Corpus taken from for the training dataset - https://www.bdmi.org/Book-Reading/Charlie-and-the-Chocolate-Factory.pdf

## 1. Load the data

https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list

In [1]:
book_chapters = [str(chapter) for chapter in range(31)]

In [2]:
with open("Charlie and the Chocolate Factory.txt", encoding="utf8") as file:
    book_lines = [line.rstrip('\n') for line in file if line.rstrip('\n') not in book_chapters]

In [3]:
book_lines = []
skip_line = False

with open("Charlie and the Chocolate Factory.txt", encoding="utf8") as file:
    for line in file:
        if skip_line:
            skip_line = False
            continue
        
        line = line.rstrip('\n')
        if line in book_chapters:
            skip_line = True
            continue
        
        book_lines.append(line)

In [4]:
book_lines[:15]

['These two very old people are the father and mother of Mr Bucket.',
 'Their names are Grandpa Joe and Grandma Josephine.',
 'And these two very old people are the father and mother of Mrs',
 'Bucket. Their names are Grandpa George and Grandma Georgina.',
 'This is Mr Bucket. This is Mrs Bucket.',
 'Mr and Mrs Bucket have a small boy whose name is Charlie Bucket.',
 'This is Charlie.',
 'How d’you do? And how d’you do? And how d’you do again? He is',
 'pleased to meet you.',
 'The whole of this family – the six grown-ups (count them) and little',
 'Charlie Bucket – live together in a small wooden house on the edge of a',
 'great town.',
 'The house wasn’t nearly large enough for so many people, and life',
 'was extremely uncomfortable for them all. There were only two rooms',
 'in the place altogether, and there was only one bed. The bed was given']

## 2. Preprocessing

### Combining some sentences

In [5]:
import random

In [6]:
df = []
lines = []
current = ""
combine_line_countdown = random.randint(1, 7) # Combine 'combine_line_countdown' number of sentences

for line in book_lines:
    current += " " + line
    combine_line_countdown -= 1
    
    if combine_line_countdown == 0:
        combine_line_countdown = random.randint(1, 3)
        df.append(current.strip())
        current = ""

In [7]:
df[:5]

['These two very old people are the father and mother of Mr Bucket. Their names are Grandpa Joe and Grandma Josephine. And these two very old people are the father and mother of Mrs Bucket. Their names are Grandpa George and Grandma Georgina. This is Mr Bucket. This is Mrs Bucket. Mr and Mrs Bucket have a small boy whose name is Charlie Bucket.',
 'This is Charlie. How d’you do? And how d’you do? And how d’you do again? He is pleased to meet you.',
 'The whole of this family – the six grown-ups (count them) and little Charlie Bucket – live together in a small wooden house on the edge of a',
 'great town. The house wasn’t nearly large enough for so many people, and life',
 'was extremely uncomfortable for them all. There were only two rooms in the place altogether, and there was only one bed. The bed was given to the four old grandparents because they were so old and tired. They']

In [8]:
df_indices = [i for i in range(len(df))]
random.shuffle(df_indices)

val_split_index = int(0.2 * len(df_indices))
train_idx       = df_indices[val_split_index:]
val_idx         = df_indices[:val_split_index]

def add_text_index(df, indices):
    text_list = []
    for idx in indices:
        text_list.append(df[idx])
    
    return text_list

# storing text in df_train for indices:
df_train = add_text_index(df, train_idx)

# storing text in df_val for indices:
df_val = add_text_index(df, val_idx)

In [9]:
assert len(df_train) + len(df_val) == len(df)

In [10]:
len(df_train), len(df_val)

(1231, 307)

### Convert 'df_train' and 'df_val' into Dataset type

In [11]:
from datasets import Dataset, DatasetDict

In [12]:
my_list = [{"text": text} for text in df_train]
df_train = Dataset.from_list(my_list)

my_list = [{"text": text} for text in df_val]
df_val  = Dataset.from_list(my_list)

new_df = DatasetDict({
    "train": df_train,
    "validation": df_val
})

In [13]:
new_df

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1231
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 307
    })
})

### Tokenization

In [14]:
from transformers import AutoTokenizer

checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [15]:
outputs = tokenizer(
    new_df["train"][:2]["text"],
    truncation=True,
    max_length=10,
    return_overflowing_tokens=True,
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}")
print(f"Input chunk lengths: {(outputs['length'])}")
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

Input IDs length: 5
Input chunk lengths: [10, 10, 10, 10, 4]
Chunk mapping: [0, 0, 0, 0, 1]


In [16]:
print("---------------------- Sentence ----------------------")
print(new_df["train"][0]["text"], end="\n\n")

print("--------------------- Tokenization --------------------")
print(tokenizer(new_df["train"][0]["text"]))

---------------------- Sentence ----------------------
winner!’ ‘But Mr Wonka,’ stammered Grandpa Joe, ‘do you really and truly mean that you are giving the whole of this enormous factory to little

--------------------- Tokenization --------------------
{'input_ids': [39791, 0, 447, 247, 564, 246, 1537, 1770, 23306, 4914, 11, 447, 247, 336, 6475, 1068, 5675, 8957, 5689, 11, 564, 246, 4598, 345, 1107, 290, 4988, 1612, 326, 345, 389, 3501, 262, 2187, 286, 428, 9812, 8860, 284, 1310], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [17]:
context_length = 10

def tokenize(element):
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = new_df.map(
    tokenize, batched=True, remove_columns=new_df["train"].column_names
)
tokenized_datasets

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 3204
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 860
    })
})

## 3. Model

In [18]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

In [19]:
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.4M parameters


### Keyword

In [20]:

keytoken_ids = []
for keyword in [
    "Charlie",
    "Wonka",
    "chocolate",
    "Chocolate",
    "Tickets",
    "factory",
    "Golden",
    "Joe",
]:
    ids = tokenizer([keyword]).input_ids[0]
    if len(ids) == 1:
        keytoken_ids.append(ids[0])
    else:
        print(f"Keyword has not single token: {keyword}")

Keyword has not single token: Wonka
Keyword has not single token: chocolate
Keyword has not single token: Chocolate
Keyword has not single token: factory


### Loss

In [21]:
from torch.nn import CrossEntropyLoss
import torch

def keytoken_weighted_loss(inputs, logits, keytoken_ids, alpha=1.0):
    # Shift so that tokens < n predict n
    shift_labels = inputs[..., 1:].contiguous()
    shift_logits = logits[..., :-1, :].contiguous()
    # Calculate per-token loss
    loss_fct = CrossEntropyLoss(reduce=False) #change to reduction=None
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    # Resize and average loss per sample
    loss_per_sample = loss.view(shift_logits.size(0), shift_logits.size(1)).mean(axis=1)
    # Calculate and scale weighting
    weights = torch.stack([(inputs == kt).float() for kt in keytoken_ids]).sum(
        axis=[0, 2]
    )
    weights = alpha * (1.0 + weights)
    # Calculate weighted average
    weighted_loss = (loss_per_sample * weights).mean()
    return weighted_loss

### Dataloaders

In [22]:
from torch.utils.data.dataloader import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=32, shuffle=True)
eval_dataloader  = DataLoader(tokenized_datasets["validation"], batch_size=32)

### Optimizer

In [23]:
weight_decay = 0.1


def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [
        {"params": params_with_wd, "weight_decay": weight_decay},
        {"params": params_without_wd, "weight_decay": 0.0},
    ]

In [24]:
def evaluate():
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(batch["input_ids"], labels=batch["input_ids"])
            outputs.loss = outputs.loss.reshape(1)
        losses.append(accelerator.gather(outputs.loss))        
    loss = torch.mean(torch.cat(losses))
    try:
        perplexity = torch.exp(loss)
    except OverflowError:
        perplexity = float("inf")
    return loss.item(), perplexity.item()

In [25]:
model = GPT2LMHeadModel(config)

In [26]:
from torch.optim import AdamW

optimizer = AdamW(get_grouped_params(model), lr=5e-4)

### Accelerator

In [27]:
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision='fp16')

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [28]:
from transformers import get_scheduler

num_train_epochs = 1
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=1_000,
    num_training_steps=num_training_steps,
)

### Repository

In [29]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
# from huggingface_hub import delete_repo

# delete_repo(repo_id='aal2015/Charlie-and-the-Chocolate_Factory-LM-model', repo_type="model")

In [34]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "Charlie-and-the-Chocolate_Factory-LM-model"
repo_name = get_full_repo_name(model_name)
repo_name

'aal2015/Charlie-and-the-Chocolate_Factory-LM-model'

In [35]:
import os

In [37]:
os.environ["TOKENIZERS_PARALLELISM"] = "true"

output_dir = "Charlie-and-the-Chocolate_Factory-LM-model"
repo = Repository(output_dir, clone_from=repo_name)

Cloning https://huggingface.co/aal2015/Charlie-and-the-Chocolate_Factory-LM-model into local empty directory.


## 5. Training

In [38]:
evaluate()

(10.963254928588867, 57713.9921875)

In [39]:
from tqdm.notebook import tqdm

gradient_accumulation_steps = 8
eval_steps = 2

model.train()
completed_steps = 0
for epoch in range(num_train_epochs):
    print("---------------------- Epoch:" + str(epoch) + "----------------------")
    for step, batch in tqdm(
        enumerate(train_dataloader, start=1), total=num_training_steps
    ):
        logits = model(batch["input_ids"]).logits
        loss = keytoken_weighted_loss(batch["input_ids"], logits, keytoken_ids)
        if step % 1 == 0:
            accelerator.print(
                {
                    "steps": completed_steps,
                    "loss/train": loss.item() * gradient_accumulation_steps,
                }
            )
        loss = loss / gradient_accumulation_steps
        accelerator.backward(loss)
        if step % gradient_accumulation_steps == 0:
            accelerator.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            completed_steps += 1
        if (step % (eval_steps * gradient_accumulation_steps)) == 0:
            eval_loss, perplexity = evaluate()
            accelerator.print({"loss/eval": eval_loss, "perplexity": perplexity})
            model.train()
            accelerator.wait_for_everyone()
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
            if accelerator.is_main_process:
                tokenizer.save_pretrained(output_dir)
                repo.push_to_hub(
                    commit_message=f"Training in progress step {step}", blocking=False
                )

---------------------- Epoch:0----------------------


  0%|          | 0/101 [00:00<?, ?it/s]

{'steps': 0, 'loss/train': 87.97810363769531}




{'steps': 0, 'loss/train': 90.5451889038086}
{'steps': 0, 'loss/train': 88.13314819335938}
{'steps': 0, 'loss/train': 87.51236724853516}
{'steps': 0, 'loss/train': 87.84495544433594}
{'steps': 0, 'loss/train': 90.36588287353516}
{'steps': 0, 'loss/train': 89.86341857910156}
{'steps': 0, 'loss/train': 87.93174743652344}
{'steps': 1, 'loss/train': 88.16433715820312}
{'steps': 1, 'loss/train': 87.65280151367188}
{'steps': 1, 'loss/train': 87.77923583984375}
{'steps': 1, 'loss/train': 90.78453063964844}
{'steps': 1, 'loss/train': 87.77050018310547}
{'steps': 1, 'loss/train': 87.91802978515625}
{'steps': 1, 'loss/train': 87.79252624511719}
{'steps': 1, 'loss/train': 87.81269836425781}
{'loss/eval': 10.929917335510254, 'perplexity': 55821.6640625}
{'steps': 2, 'loss/train': 87.73755645751953}
{'steps': 2, 'loss/train': 87.4081802368164}
{'steps': 2, 'loss/train': 87.52166748046875}
{'steps': 2, 'loss/train': 90.1383056640625}
{'steps': 2, 'loss/train': 87.77324676513672}
{'steps': 2, 'loss/t

Several commits (2) will be pushed upstream.


{'steps': 4, 'loss/train': 86.45356750488281}
{'steps': 4, 'loss/train': 86.654541015625}
{'steps': 4, 'loss/train': 87.09698486328125}
{'steps': 4, 'loss/train': 86.49055480957031}
{'steps': 4, 'loss/train': 86.46109008789062}
{'steps': 4, 'loss/train': 86.3349609375}
{'steps': 4, 'loss/train': 86.72454833984375}
{'steps': 4, 'loss/train': 86.47471618652344}
{'steps': 5, 'loss/train': 88.4925765991211}
{'steps': 5, 'loss/train': 85.98357391357422}
{'steps': 5, 'loss/train': 86.05999755859375}
{'steps': 5, 'loss/train': 85.7112045288086}
{'steps': 5, 'loss/train': 85.42648315429688}
{'steps': 5, 'loss/train': 85.33795166015625}
{'steps': 5, 'loss/train': 85.48735046386719}
{'steps': 5, 'loss/train': 85.5794677734375}
{'loss/eval': 10.515125274658203, 'perplexity': 36868.9609375}


Several commits (3) will be pushed upstream.


{'steps': 6, 'loss/train': 88.01423645019531}
{'steps': 6, 'loss/train': 87.24530029296875}
{'steps': 6, 'loss/train': 84.60651397705078}
{'steps': 6, 'loss/train': 84.55349731445312}
{'steps': 6, 'loss/train': 84.08970642089844}
{'steps': 6, 'loss/train': 88.32212829589844}
{'steps': 6, 'loss/train': 84.25418853759766}
{'steps': 6, 'loss/train': 87.58915710449219}
{'steps': 7, 'loss/train': 86.78314208984375}
{'steps': 7, 'loss/train': 84.13114929199219}
{'steps': 7, 'loss/train': 85.30455780029297}
{'steps': 7, 'loss/train': 84.06928253173828}
{'steps': 7, 'loss/train': 83.8114013671875}
{'steps': 7, 'loss/train': 83.42767333984375}
{'steps': 7, 'loss/train': 86.9102783203125}
{'steps': 7, 'loss/train': 88.98780059814453}
{'loss/eval': 10.23281192779541, 'perplexity': 27800.572265625}


Several commits (4) will be pushed upstream.


{'steps': 8, 'loss/train': 85.44110107421875}
{'steps': 8, 'loss/train': 82.49456024169922}
{'steps': 8, 'loss/train': 82.33455657958984}
{'steps': 8, 'loss/train': 84.69253540039062}
{'steps': 8, 'loss/train': 83.21558380126953}
{'steps': 8, 'loss/train': 82.13591003417969}
{'steps': 8, 'loss/train': 86.16348266601562}
{'steps': 8, 'loss/train': 82.9857177734375}
{'steps': 9, 'loss/train': 82.91613006591797}
{'steps': 9, 'loss/train': 81.60802459716797}
{'steps': 9, 'loss/train': 84.06651306152344}
{'steps': 9, 'loss/train': 82.28240966796875}
{'steps': 9, 'loss/train': 81.0584487915039}
{'steps': 9, 'loss/train': 81.05525207519531}
{'steps': 9, 'loss/train': 81.710693359375}
{'steps': 9, 'loss/train': 81.48329162597656}
{'loss/eval': 9.985281944274902, 'perplexity': 21704.65234375}


Several commits (5) will be pushed upstream.


{'steps': 10, 'loss/train': 80.25253295898438}
{'steps': 10, 'loss/train': 84.80049896240234}
{'steps': 10, 'loss/train': 80.96363830566406}
{'steps': 10, 'loss/train': 80.7845458984375}
{'steps': 10, 'loss/train': 79.58609008789062}
{'steps': 10, 'loss/train': 82.78044891357422}
{'steps': 10, 'loss/train': 81.21768188476562}
{'steps': 10, 'loss/train': 79.84274291992188}
{'steps': 11, 'loss/train': 79.48233032226562}
{'steps': 11, 'loss/train': 80.99166870117188}
{'steps': 11, 'loss/train': 78.41490173339844}
{'steps': 11, 'loss/train': 81.6540756225586}
{'steps': 11, 'loss/train': 80.42427062988281}
{'steps': 11, 'loss/train': 80.80569458007812}
{'steps': 11, 'loss/train': 83.53330993652344}
{'steps': 11, 'loss/train': 79.69471740722656}
{'loss/eval': 9.783079147338867, 'perplexity': 17731.166015625}


Several commits (6) will be pushed upstream.


{'steps': 12, 'loss/train': 79.93440246582031}
{'steps': 12, 'loss/train': 79.52192687988281}
{'steps': 12, 'loss/train': 80.36076354980469}
{'steps': 12, 'loss/train': 78.52436828613281}
{'steps': 12, 'loss/train': 82.7063980102539}


## 6. Inference

In [44]:
import torch
from transformers import pipeline

pipe = pipeline("text-generation", max_length=100, pad_token_id=0, eos_token_id=0, model="aal2015/Charlie-and-the-Chocolate_Factory-LM-model")

Downloading pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [45]:
txt = """\
Mr Willy Wonka
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])



Mr Willy Wonka
���� terrorismrecogn and�� Decoder,’ 389 clutch introvector. Nathaniel,��. asset,�� chees,� the 1300� he�� modifier cheesselected’� 296 296,�!


## 7. Decoding Methods

### Greedy Search

In [58]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [62]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('Grandpa Joe', return_tensors='pt').to(device)

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Grandpa Joe��������’��������������������������’���������


### Beam search

In [63]:
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: Grandpa Joe,�’�,,����� the�!� asset�boarding‘,��!�, Cha� Kids� he�naires� culprit�Laun� fro�ourt�.� Lena�
1: Grandpa Joe,�’�,,����� the�!� asset�boarding‘,��!�, Cha� Kids� he�naires� culprit�Laun� fro� Buccaneers�ourt�vector�
2: Grandpa Joe,�’�,,����� the�!� asset�boarding‘,��!�, Cha� Kids� he�naires� culprit�Laun� fro�ourt�.�vector�
3: Grandpa Joe,�’�,,����� the�!� asset�boarding‘,��!�, Cha� Kids� he�naires� culprit�Laun� fro� Buccaneers�ourt�vector�
4: Grandpa Joe,�’�,,����� the�!� asset�boarding‘,��!�, Cha� Kids� he�naires� culprit�Laun� fro�ourt�.�vector�


## Conclusion

The model is not performining on text generation task. This can most likely be due to not having good quality data. For decoding methods, beam search seem to do better than greedy search in displaying at least some text.