# Developing the Text Generation Model

First of all, the text generation that I will be using requires a prompt. For that I'll take the first 10 words of the 27k+ rows of data I have, and use those for the prompts. I'm okay with 10 characters being "predetermined". 

In [8]:
import random
random.seed(42)
from typing import Tuple

import pandas as pd
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
from sklearn.model_selection import train_test_split

In [2]:
NAME_TOKEN = "<|name|>"
DEV_TOKEN = "<|developer|>"
PUB_TOKEN = "<|publisher|>"
DESC_TOKEN = "<|description|>"
GENRES_TOKEN = "<|genres|>"
GAME_TOKEN = "<|game|>"
END_TOKEN = "<|endoftext|>"

In [15]:
PROMPT_TOKENS = [NAME_TOKEN, DEV_TOKEN, PUB_TOKEN, DESC_TOKEN, GENRES_TOKEN]

In [3]:
torch.cuda.is_available()

True

In [4]:
df = pd.read_csv("../data/datasetv2.csv").dropna()

In [5]:
df.head()

Unnamed: 0,appid,name,developer,publisher,genres,description,header_image
0,10,Counter-Strike,Valve,Valve,Action,Play the world's number 1 online action game. ...,https://steamcdn-a.akamaihd.net/steam/apps/10/...
1,20,Team Fortress Classic,Valve,Valve,Action,One of the most popular online action games of...,https://steamcdn-a.akamaihd.net/steam/apps/20/...
2,30,Day of Defeat,Valve,Valve,Action,Enlist in an intense brand of Axis vs. Allied ...,https://steamcdn-a.akamaihd.net/steam/apps/30/...
3,40,Deathmatch Classic,Valve,Valve,Action,Enjoy fast-paced multiplayer gaming with Death...,https://steamcdn-a.akamaihd.net/steam/apps/40/...
4,50,Half-Life: Opposing Force,Gearbox Software,Valve,Action,Return to the Black Mesa Research Facility as ...,https://steamcdn-a.akamaihd.net/steam/apps/50/...


# Loading and Testing the Model

Hugging Face Transformers makes it really easy to load pretrained models.

In [None]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to("cuda")

In [None]:
prompts = df.description.apply(lambda x: " ".join(x.split()[:10]))

In [None]:
prompt = random.choice(prompts)
prompt_encoded = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

output = model.generate(
    prompt_encoded,
    do_sample=True, 
    max_length=500, 
    top_k=50, 
    top_p=0.95,
    no_repeat_ngram_size=5,
)
output_decoded = tokenizer.decode(output[0])


print(prompt)
print(output_decoded)

In [None]:
del tokenizer
del model

# Fine-Tuning the Model to Generate Video Game Titles (Names)

The output already looks fantastic, but let's fine-tune the model to get even better results.

I'm going to work on the video game name generation first as a POC. We can use a special token, for example `<|name|>` as a prompt instead of needing words to prompt the title generation.
 
See: https://towardsdatascience.com/natural-language-generation-part-2-gpt-2-and-huggingface-f3acb35bc86a

So all we really need to do is format our data in with the prompt token (for this task, `<|name|>`) and the end of text token, which is built into the pretrained tokenizer: `<|endoftext|>` and fine-tune the pretrained model.

In [None]:
# Example formatted name
name = df.name[0]
formatted_name = f"{NAME_TOKEN}{name}{END_TOKEN}"
print(formatted_name)

In [None]:
def save_formatted(file, list_of_texts, start_token, end_token):
    for text in list_of_texts:
        formatted_text = f"{start_token}{text}{end_token}"
        file.write(formatted_text)

In [None]:
# Split our data into train and validation
train, validation = train_test_split(df.name, train_size=0.85, random_state=42)

print("train count:", train.count())
print("validation count:", validation.count())

In [None]:
# with open("../data/training/name_train.txt", "w") as f:
#     save_formatted(f, train, NAME_TOKEN, END_TOKEN)

In [None]:
# with open("../data/training/name_val.txt", "w") as f:
#     save_formatted(f, validation, NAME_TOKEN, END_TOKEN)

# Testing the Fine-tuned Name Model

I used a GPU cloud provider to fine-tune the pretrained GPT2-medium model using the `scripts/train-name.sh` script.

In [None]:
tokenizer = GPT2TokenizerFast.from_pretrained("../data/models/gpt2-name")
model = GPT2LMHeadModel.from_pretrained("../data/models/gpt2-name", pad_token_id=tokenizer.eos_token_id).to("cuda")

In [None]:
prompt_encoded = tokenizer.encode(NAME_TOKEN, return_tensors="pt").to("cuda")

output = model.generate(
    prompt_encoded,
    do_sample=True, 
    max_length=500, 
    top_k=50, 
    top_p=0.95,
    no_repeat_ngram_size=5,
)
output_decoded = tokenizer.decode(output[0])


print(output_decoded)

This looks great, now I'll train a model that can generate each type of text we need.

In [None]:
del tokenizer
del model

# Compiling the Dataset for All Types of Generation

We will have one unified dataset that has text that has the tokens: `<|name|>`, `<|developer|>`, `<|publisher|>`, `<|description|>`, `<|genres|>`, and of course `<|endoftext|>`.

Each example will consist of one of the class tokens, then some text, and then the end token. It would be nice to be able to generate all the text for a game with a single `<|game|>` token, so I'll also test training a model to do that.

In [None]:
columns = {
    NAME_TOKEN: df.name,
    DEV_TOKEN: df.developer,
    PUB_TOKEN: df.publisher,
    DESC_TOKEN: df.description,
    GENRES_TOKEN: df.genres
}

## Separate Token Dataset

Here I am getting only the unique values because I don't need the outputs to be representative of the distribution dataset itself, but better representative of the data itself.

In [None]:
corpus = []
labels = []

for start_token, col in columns.items():
    values = col.unique().tolist()
    corpus.extend(
        [f"{start_token}{value}{END_TOKEN}" for value in values]
    )
    labels.extend(
        [start_token for _ in values]
    )

In [None]:
len(corpus), len(labels)

We can validate the length of the corpus makes sense because we have ~27k examples, and 5 columns with many of the values in 3 of the columns being duplicates.

In [None]:
expected_length_corpus = 0
for _, v in columns.items():
    print(v.name, v.count(), len(v.unique()))
    expected_length_corpus += len(v.unique())

print("expected_length_corpus:", expected_length_corpus)

Here we split the corpus into the train and validation sets. Notice that we stratify using the labels to get a proportional number of each class in each set.

In [None]:
train, val = train_test_split(corpus, train_size=0.85, shuffle=True, stratify=labels, random_state=42)

In [None]:
len(train), train[:5]

In [None]:
len(val), val[:5]

Here is a horribly inefficient and hacky way of validating the stratification.

In [None]:
pd.Series(train).apply(lambda x: x.split("<|")[1].split("|>")[0]).value_counts(normalize=True)

In [None]:
pd.Series(val).apply(lambda x: x.split("<|")[1].split("|>")[0]).value_counts(normalize=True)

Close enough :) Now we just save our train and val sets to two text files again and train a new model with this data.

In [None]:
# with open("../data/training/all_train.txt", "w") as f:
#     f.write("".join(train))

In [None]:
# with open("../data/training/all_val.txt", "w") as f:
#     f.write("".join(val))

## Dataset for One-Shot Generation

Now I'll create another corpus that wraps all of the fields and should allow us to generate games with more cohesive attributes. I also don't expect there to be too many duplicate games in the dataset so I haven't removed any for this dataset.

I'm less optimistic about this approach, but we'll see what happens.

In [None]:
corpus = []

for idx, row in df.iterrows():
    game_text = (f"{GAME_TOKEN}{NAME_TOKEN}{row['name']}{DEV_TOKEN}{row.developer}"
                 f"{PUB_TOKEN}{row.publisher}{DESC_TOKEN}{row.description}{END_TOKEN}")
    
    corpus.append(game_text)

In [None]:
corpus[100:105]

In [None]:
train, val = train_test_split(corpus, train_size=0.85, shuffle=True, random_state=42)

In [None]:
len(train), len(val)

In [None]:
# with open("../data/training/game_train.txt", "w") as f:
#     f.write("".join(train))

In [None]:
# with open("../data/training/game_val.txt", "w") as f:
#     f.write("".join(val))

# Testing the Multi-Class Model

In [26]:
df.description.apply(lambda x: len(x.split())).describe()

count    27048.000000
mean       215.238317
std        173.034320
min          1.000000
25%        114.000000
50%        174.000000
75%        266.000000
max       8376.000000
Name: description, dtype: float64

Implementing methods to make this easier going forward.

In [9]:
def load_tokenizer_and_model(model_path: str) -> Tuple[GPT2TokenizerFast, GPT2LMHeadModel]:
    tokenizer = GPT2TokenizerFast.from_pretrained("../data/models/gpt2-all-15000/")
    model = GPT2LMHeadModel.from_pretrained("../data/models/gpt2-all-15000/", pad_token_id=tokenizer.eos_token_id)
    return tokenizer, model

In [44]:
DEFAULT_LENGTHS = {
    NAME_TOKEN: (1, None),
    DEV_TOKEN: (1, None),
    PUB_TOKEN: (1, None),
    DESC_TOKEN: (200, None),
    GENRES_TOKEN: (1, None)
}

def generate_text(tokenizer: GPT2TokenizerFast, model: GPT2LMHeadModel, start_token: str, **gen_kwargs) -> str:
    """Generate a single output of text. A different function would be needed to batch generation."""
    
    prompt_encoded = tokenizer.encode(start_token, return_tensors="pt").to(model.device)
    
    default_length = DEFAULT_LENGTHS.get(start_token, None)
    if default_length is not None:
        gen_kwargs["min_length"] = default_length[0] + len(prompt_encoded[0])
        gen_kwargs["max_length"] = default_length[1] + len(prompt_encoded[0])

    output = model.generate(
        prompt_encoded,
        do_sample=True, 
        top_k=50, 
        top_p=0.95,
        no_repeat_ngram_size=5,
        **gen_kwargs
    )
    output_decoded = tokenizer.decode(output[0])


    return output_decoded

In [45]:
tokenizer, model = load_tokenizer_and_model("../data/models/gpt2-all-15000/")

model.to("cuda")
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2):

In [46]:
game = {}

for token in PROMPT_TOKENS:
    game[token] = generate_text(tokenizer, model, token)

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

In [None]:
game

In [7]:
del tokenizer
del model