# Developing the Text Generation Model

First of all, the text generation that I will be using requires a prompt. For that I'll take the first 10 words of the 27k+ rows of data I have, and use those for the prompts. I'm okay with 10 characters being "predetermined". 

In [1]:
import random
random.seed(42)
from typing import Tuple

import pandas as pd
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
from sklearn.model_selection import train_test_split

In [2]:
NAME_TOKEN = "<|name|>"
DEV_TOKEN = "<|developer|>"
PUB_TOKEN = "<|publisher|>"
DESC_TOKEN = "<|description|>"
GENRES_TOKEN = "<|genres|>"
GAME_TOKEN = "<|game|>"
END_TOKEN = "<|endoftext|>"

In [3]:
PROMPT_TOKENS = [NAME_TOKEN, DEV_TOKEN, PUB_TOKEN, DESC_TOKEN, GENRES_TOKEN]

In [4]:
torch.cuda.is_available()

True

In [5]:
df = pd.read_csv("../data/datasetv2.csv").dropna()

In [6]:
df.head()

Unnamed: 0,appid,name,developer,publisher,genres,description,header_image
0,10,Counter-Strike,Valve,Valve,Action,Play the world's number 1 online action game. ...,https://steamcdn-a.akamaihd.net/steam/apps/10/...
1,20,Team Fortress Classic,Valve,Valve,Action,One of the most popular online action games of...,https://steamcdn-a.akamaihd.net/steam/apps/20/...
2,30,Day of Defeat,Valve,Valve,Action,Enlist in an intense brand of Axis vs. Allied ...,https://steamcdn-a.akamaihd.net/steam/apps/30/...
3,40,Deathmatch Classic,Valve,Valve,Action,Enjoy fast-paced multiplayer gaming with Death...,https://steamcdn-a.akamaihd.net/steam/apps/40/...
4,50,Half-Life: Opposing Force,Gearbox Software,Valve,Action,Return to the Black Mesa Research Facility as ...,https://steamcdn-a.akamaihd.net/steam/apps/50/...


In [7]:
def load_tokenizer_and_model(model_path: str) -> Tuple[GPT2TokenizerFast, GPT2LMHeadModel]:
    tokenizer = GPT2TokenizerFast.from_pretrained("../data/models/gpt2-all-15000/")
    model = GPT2LMHeadModel.from_pretrained("../data/models/gpt2-all-15000/", pad_token_id=tokenizer.eos_token_id)
    return tokenizer, model

In [43]:
DEFAULT_LENGTHS = {
    NAME_TOKEN: (1, 10),
    DEV_TOKEN: (1, 10),
    PUB_TOKEN: (1, 10),
    DESC_TOKEN: (200, 400),
    GENRES_TOKEN: (1, 20)
}

def generate_text(tokenizer: GPT2TokenizerFast, model: GPT2LMHeadModel, start_token: str, prompt=None, **gen_kwargs) -> str:
    """Generate a single output of text. A different function would be needed to batch generation."""
    
    if prompt is None:
        prompt_encoded = tokenizer.encode(start_token, return_tensors="pt").to(model.device)
    else:
        prompt_encoded = tokenizer.encode(start_token + prompt, return_tensors="pt").to(model.device)
        
    default_length = DEFAULT_LENGTHS.get(start_token, None)
    if default_length is not None:
        gen_kwargs["min_length"] = default_length[0] + len(prompt_encoded[0])
        gen_kwargs["max_length"] = default_length[1] + len(prompt_encoded[0])

    output = model.generate(
        prompt_encoded,
        do_sample=True, 
        top_k=50, 
        top_p=0.95,
        no_repeat_ngram_size=5,
        **gen_kwargs
    )
    output_decoded = tokenizer.decode(output[0])


    return output_decoded

# Loading and Testing the Model

Hugging Face Transformers makes it really easy to load pretrained models.

In [7]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to("cuda")

In [8]:
prompts = df.description.apply(lambda x: " ".join(x.split()[:10]))

In [9]:
prompt = random.choice(prompts)
prompt_encoded = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

output = model.generate(
    prompt_encoded,
    do_sample=True, 
    max_length=500, 
    top_k=50, 
    top_p=0.95,
    no_repeat_ngram_size=5,
)
output_decoded = tokenizer.decode(output[0])


print(prompt)
print(output_decoded)

《AngelShooter》 is a high difficulty bitmap horizontal-scroll shooting game. There
《AngelShooter》 is a high difficulty bitmap horizontal-scroll shooting game. There are 3 levels, Each level has three enemy types. The first game is the easy, middle and difficult levels. This game involves some unique mechanics which are difficult as to not kill the enemy. The easiest level is the "Challenging" level which consists of only three enemies (each with four HP) and one monster. The hardest level has more than just enemies, it's also about the death or death of a person.

The first enemy you encounter is Angel Shooter's son. He is a small boy with a purple mustache and an arrow protruding from his mouth. The only possible way to defeat him is to make use of a fire attack and shoot the monster in front of Angel Shooter's face. The second game has Angel Shooter's daughter, who has a purple smile on her face. The last game has Angel Shilla, a red haired girl with a blue hair. The third game has An

In [10]:
del tokenizer
del model

# Fine-Tuning the Model to Generate Video Game Titles (Names)

The output already looks fantastic, but let's fine-tune the model to get even better results.

I'm going to work on the video game name generation first as a POC. We can use a special token, for example `<|name|>` as a prompt instead of needing words to prompt the title generation.
 
See: https://towardsdatascience.com/natural-language-generation-part-2-gpt-2-and-huggingface-f3acb35bc86a

So all we really need to do is format our data in with the prompt token (for this task, `<|name|>`) and the end of text token, which is built into the pretrained tokenizer: `<|endoftext|>` and fine-tune the pretrained model.

In [11]:
# Example formatted name
name = df.name[0]
formatted_name = f"{NAME_TOKEN}{name}{END_TOKEN}"
print(formatted_name)

<|name|>Counter-Strike<|endoftext|>


In [12]:
def save_formatted(file, list_of_texts, start_token, end_token):
    for text in list_of_texts:
        formatted_text = f"{start_token}{text}{end_token}"
        file.write(formatted_text)

In [13]:
# Split our data into train and validation
train, validation = train_test_split(df.name, train_size=0.85, random_state=42)

print("train count:", train.count())
print("validation count:", validation.count())

train count: 22990
validation count: 4058


In [14]:
# with open("../data/training/name_train.txt", "w") as f:
#     save_formatted(f, train, NAME_TOKEN, END_TOKEN)

In [15]:
# with open("../data/training/name_val.txt", "w") as f:
#     save_formatted(f, validation, NAME_TOKEN, END_TOKEN)

# Testing the Fine-tuned Name Model

I used a GPU cloud provider to fine-tune the pretrained GPT2-medium model using the `scripts/train-name.sh` script.

In [16]:
tokenizer = GPT2TokenizerFast.from_pretrained("../data/models/gpt2-name")
model = GPT2LMHeadModel.from_pretrained("../data/models/gpt2-name", pad_token_id=tokenizer.eos_token_id).to("cuda")

In [17]:
prompt_encoded = tokenizer.encode(NAME_TOKEN, return_tensors="pt").to("cuda")

output = model.generate(
    prompt_encoded,
    do_sample=True, 
    max_length=500, 
    top_k=50, 
    top_p=0.95,
    no_repeat_ngram_size=5,
)
output_decoded = tokenizer.decode(output[0])


print(output_decoded)

<|name|>Nova Jump<|endoftext|>


This looks great, now I'll train a model that can generate each type of text we need.

In [18]:
del tokenizer
del model

# Compiling the Dataset for All Types of Generation

We will have one unified dataset that has text that has the tokens: `<|name|>`, `<|developer|>`, `<|publisher|>`, `<|description|>`, `<|genres|>`, and of course `<|endoftext|>`.

Each example will consist of one of the class tokens, then some text, and then the end token. It would be nice to be able to generate all the text for a game with a single `<|game|>` token, so I'll also test training a model to do that.

In [19]:
columns = {
    NAME_TOKEN: df.name,
    DEV_TOKEN: df.developer,
    PUB_TOKEN: df.publisher,
    DESC_TOKEN: df.description,
    GENRES_TOKEN: df.genres
}

## Separate Token Dataset

Here I am getting only the unique values because I don't need the outputs to be representative of the distribution dataset itself, but better representative of the data itself.

In [20]:
corpus = []
labels = []

for start_token, col in columns.items():
    values = col.unique().tolist()
    corpus.extend(
        [f"{start_token}{value}{END_TOKEN}" for value in values]
    )
    labels.extend(
        [start_token for _ in values]
    )

In [21]:
len(corpus), len(labels)

(87027, 87027)

We can validate the length of the corpus makes sense because we have ~27k examples, and 5 columns with many of the values in 3 of the columns being duplicates.

In [22]:
expected_length_corpus = 0
for _, v in columns.items():
    print(v.name, v.count(), len(v.unique()))
    expected_length_corpus += len(v.unique())

print("expected_length_corpus:", expected_length_corpus)

name 27048 27006
developer 27048 17095
publisher 27048 14344
description 27048 27030
genres 27048 1552
expected_length_corpus: 87027


Here we split the corpus into the train and validation sets. Notice that we stratify using the labels to get a proportional number of each class in each set.

In [23]:
train, val = train_test_split(corpus, train_size=0.85, shuffle=True, stratify=labels, random_state=42)

In [24]:
len(train), train[:5]

(73972,
 ['<|developer|>Seattletek<|endoftext|>',
  '<|name|>Pen Island VR<|endoftext|>',
  "<|description|>What is Deadly Escape?Deadly Escape is a small episodic Survival Horror game where you have to beat each chapter with one life, inspired by survival horror games of the '90s.After the complex was raided by undead monsters, our wounded protagonist is abandoned inside the infirmary. He wakes up only to find that everyone has been massacred and the place is locked down. Now he has to trace the steps of the deceased that tried to escape before him, hoping to find a way out of this nightmare.FEATURES Old school survival horror action. Ration your ammo, every bullet counts ! Collect and use key items to advance deeper inside the complex. Explore and find new weapons to increase your odds of survival. Find documents, files, and notes from the deceased personal. Uncover what happened! Compete against other players via online leaderboards! Who will be the first survivor? Listen to an all 

In [25]:
len(val), val[:5]

(13055,
  '<|name|>Galaxy of Pen & Paper +1<|endoftext|>',
  "<|description|>In Bow to Blood you will compete to become Champion of the Arena, as its inscrutable Overseers test you and your fellow challengers in a winner-takes-all reality show.TAKE COMMANDStand at the bridge of your mighty airship, pilot your vessel and fire your cannons. A wide variety of opponents will keep you busy as you employ split-second tactics to stay on top of each new situation. Your ship also comes equipped with an array of powerful weapons, each with their own specialty.Order your crew between various stations to strategize and customize your ship's strengths for every situation. When assigned to a station, your crew can use their expertise to push it beyond its normal operating capabilities. Shields provide an extra layer of protection, the turret gives you another gunner, drone control unleashes powerful automatons that menace your opponents, engines provide a significant speed boost, and sensors highlig

Here is a horribly inefficient and hacky way of validating the stratification.

In [26]:
pd.Series(train).apply(lambda x: x.split("<|")[1].split("|>")[0]).value_counts(normalize=True)

description    0.310590
name           0.310320
developer      0.196439
publisher      0.164819
genres         0.017831
dtype: float64

In [27]:
pd.Series(val).apply(lambda x: x.split("<|")[1].split("|>")[0]).value_counts(normalize=True)

description    0.310609
name           0.310303
developer      0.196400
publisher      0.164841
genres         0.017848
dtype: float64

Close enough :) Now we just save our train and val sets to two text files again and train a new model with this data.

In [28]:
# with open("../data/training/all_train.txt", "w") as f:
#     f.write("".join(train))

In [29]:
# with open("../data/training/all_val.txt", "w") as f:
#     f.write("".join(val))

## Dataset for One-Shot Generation

Now I'll create another corpus that wraps all of the fields and should allow us to generate games with more cohesive attributes. I also don't expect there to be too many duplicate games in the dataset so I haven't removed any for this dataset.

I'm less optimistic about this approach, but we'll see what happens.

In [30]:
corpus = []

for idx, row in df.iterrows():
    game_text = (f"{GAME_TOKEN}{NAME_TOKEN}{row['name']}{DEV_TOKEN}{row.developer}"
                 f"{PUB_TOKEN}{row.publisher}{DESC_TOKEN}{row.description}{END_TOKEN}")
    
    corpus.append(game_text)

In [31]:
corpus[100:105]

["<|game|><|name|>Bejeweled Twist<|developer|>PopCap Games, Inc.<|publisher|>PopCap Games, Inc.<|description|>Spin, match, explode... WOW! It's a brilliant new way to play Bejeweled! Get set for a vivid sensory rush as you spin and match explosive gems for shockwaves of fun. Rotate jewels freely to set up electrifying combos, outwit surprising obstacles like Locks and Bombs, and create high-voltage Flame and Lightning gems. When you need to dial up the intensity or fine-tune your skills, turn to Challenge mode or five-minute Blitz. And if relaxing is more your style, kick up your feet with stress-free Zen. No matter the mode, you'll discover new strategies, improve your moves, and find endless ways to win! Use the revolutionary Gem Rotator to move gems around and make matches anywhere on the board Relax, rev up, refocus or recharge in four game modes — Classic, Zen, Challenge and Blitz Clear away Locks and Bombs for magnificent bonuses Create Flame and Lightning power gems, then blast 

In [32]:
train, val = train_test_split(corpus, train_size=0.85, shuffle=True, random_state=42)

In [33]:
len(train), len(val)

(22990, 4058)

In [34]:
# with open("../data/training/game_train.txt", "w") as f:
#     f.write("".join(train))

In [35]:
# with open("../data/training/game_val.txt", "w") as f:
#     f.write("".join(val))

# Testing the Multi-Class Model

Implementing methods to make this easier going forward.

In [9]:
tokenizer, model = load_tokenizer_and_model("../data/models/gpt2-all-15000/")

model.to("cuda")
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2):

In [22]:
import random

In [46]:
game = {}

for token in PROMPT_TOKENS:
    if token == DESC_TOKEN:
        game[token] = generate_text(tokenizer, model, DESC_TOKEN, f"{game[NAME_TOKEN].split('<|name|>')[-1].split('<|endoftext|>')[0]} is a{random.choice(('', 'n'))} ")
    else:
        game[token] = generate_text(tokenizer, model, token)

In [47]:
game

{'<|name|>': '<|name|>Hue and Sorrow<|endoftext|>',
 '<|developer|>': '<|developer|>Furiously Independent Pty Ltd<|endoftext|>',
 '<|publisher|>': '<|publisher|>Netherwind Studios;Digital Extremes<|endoftext|>',
 '<|description|>': '<|description|>Hue and Sorrow is a vernacular adventure game set in a post-nuclear, post-societal future. Players take on the role of a young girl called Riley, who must learn to navigate her strange and magical world. Riley’s family was murdered, and the world outside of her family’s home has changed dramatically in the past three years. However, Riley’s father is able to repair many of his damaged relationships, and even begins to rebuild his home. Riley must navigate this post-nuclear, nuclear world to find her family, save the entire nation, and find the root of her evil. She must also discover who or what is trying to kill her.Key FeaturesPlay as Riley: a young and inquisitive soul who must learn to explore the world and to decipher its many secrets. N

In [41]:
del tokenizer
del model