# Developing the Text Generation Model

First of all, the text generation that I will be using requires a prompt. For that I'll take the first 10 words of the 27k+ rows of data I have, and use those for the prompts. I'm okay with 10 characters being "predetermined". 

In [None]:
import random
random.seed(42)

import pandas as pd
import torch

In [None]:
torch.cuda.is_available()

In [None]:
df = pd.read_csv("../data/datasetv2.csv").dropna()

# Loading the Model

Hugging Face Transformers makes it really easy to load pretrained models.

In [None]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

In [None]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to("cuda")

In [None]:
prompts = df.description.apply(lambda x: " ".join(x.split()[:10]))

In [None]:
prompt = random.choice(prompts)
prompt_encoded = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

output = model.generate(
    prompt_encoded,
    do_sample=True, 
    max_length=500, 
    top_k=50, 
    top_p=0.95,
    no_repeat_ngram_size=5,
)
output_decoded = tokenizer.decode(output[0])


print(prompt)
print(output_decoded)

# Fine-Tuning the Model

The output already looks fantastic, but let's fine-tune the model to get even better results.

I'm going to work on the video game name generation first as a POC. We can use a special token, for example `<|name|>` as a prompt instead of needing words to prompt the title generation.
 
See: https://towardsdatascience.com/natural-language-generation-part-2-gpt-2-and-huggingface-f3acb35bc86a

So all we really need to do is format our data in with the prompt token (for this task, `<|name|>`) and the end of text token, which is built into the pretrained tokenizer: `<|endoftext|>` and fine-tune the pretrained model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
NAME_TOKEN = "<|name|>"
END_TOKEN = "<|endoftext|>"

In [None]:
# Example formatted name
name = df.name[0]
formatted_name = f"{NAME_TOKEN}{name}{END_TOKEN}"
print(formatted_name)
print(tokenizer.encode(formatted_name))

In [None]:
def save_formatted(file, list_of_texts, start_token, end_token):
    for text in list_of_texts:
        formatted_text = f"{start_token}{text}{end_token}"
        file.write(formatted_text)

In [None]:
# Split our data into train and validation
train, validation = train_test_split(df.name, train_size=0.85, random_state=42)

print("train count:", train.count())
print("validation count:", validation.count())

In [None]:
with open("../data/training/name_train.txt", "w") as f:
    save_formatted(f, train, NAME_TOKEN, END_TOKEN)

In [None]:
with open("../data/training/name_val.txt", "w") as f:
    save_formatted(f, validation, NAME_TOKEN, END_TOKEN)

In [1]:
!python ../scripts/run_clm.py \
--model_type "gpt2-medium" \
--model_name_or_path "gpt2-medium" \
--train_file "../data/training/name_train.txt" \
--do_train \
--validation_file "../data/training/name_val.txt" \
--do_eval \
--num_train_epochs 5 \
--fp16 \
--output_dir "../data/models/gpt2-name/" \
--per_gpu_train_batch_size 1

03/30/2021 16:50:23 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=../data/models/gpt2-name/, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Mar30_16-50-23_jc-ps63, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_

  0%|                                                  | 0/1240 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/jason/projects/content/tgdne/notebooks/../scripts/run_clm.py", line 444, in <module>
    main()
  File "/home/jason/projects/content/tgdne/notebooks/../scripts/run_clm.py", line 409, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/jason/miniconda3/envs/tgdne/lib/python3.9/site-packages/transformers/trainer.py", line 1095, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/jason/miniconda3/envs/tgdne/lib/python3.9/site-packages/transformers/trainer.py", line 1483, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/jason/miniconda3/envs/tgdne/lib/python3.9/site-packages/transformers/trainer.py", line 1517, in compute_loss
    outputs = model(**inputs)
  File "/home/jason/miniconda3/envs/tgdne/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
   