This notebook contains code to train a LSTM RNN on the corpus of all Steam game store long descriptions and generate new descriptions.

Based on this blog post: http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

In [1]:
import dataset
import keras
import numpy as np
import os
from tqdm import tqdm
from pathlib import Path

Using Theano backend.
Using cuDNN version 6021 on context None
Mapped name None to device cuda: GeForce GTX 970 (0000:01:00.0)


In [2]:
db = dataset.connect(os.environ['POSTGRES_URI'])

Pull texts from our database.  Limit the number of descriptions pulled to prevent us from running out of memory (and order randomly so we get a random sample)

In [34]:
description_query = '''
SELECT long_description
FROM game_crawl
WHERE is_dlc = FALSE
  AND game_name IS NOT NULL
ORDER BY random()
LIMIT 1000
'''

corpus = [r['long_description'] for r in db.query(description_query)]
print(corpus[:1])

['ABOUT THIS GAME\n\nOnly One Hope - is a game in the genre of survival horror, where the main character gets into a shipwreck. Regaining consciousness he was on the island. His only hope is to learn how to survive and return to his home!\n\n\nForget everything about your simple past life!\nNow you will have to overcome a lot of different problems: hunger, thirst, cold and darkness...\nThe player is given the opportunity to travel over the vast territories of using a raft. It can be built on an island in its own "residence" after which he would 100% return to his island.\n\nThe game has an elaborate system of crafting, also you have to improve your skills to create more complex things.\nBuild your house, plant a tree, raise a son! In Only One Hope you have only wild islands, and everyone of them can become you new home!\n\nFeatures\n->Unique crafting system\n->Big open world\n->Smart NPC\n->Building system\nAnd more!']


Apply cleaning to help the model out.

In [35]:
def clean_description(description):
    return (description.lower())

cleaned_corpus = [clean_description(d) for d in corpus]
del corpus

Create a mapping of unique chars to integers

In [36]:
joined_corpus = ' '.join(cleaned_corpus)
del cleaned_corpus
chars = sorted(list(set(joined_corpus)))
print(chars)
char_to_int = dict((c, i) for i, c in enumerate(chars))

['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x80', '\x81', '\x8d', '\x92', '\x93', '\x94', '\x97', '\x99', '\x9d', '\x9e', '¢', '¥', '©', '«', '¬', '\xad', '®', '°', '²', '´', '·', '¹', '»', '×', 'á', 'â', 'ä', 'å', 'ç', 'è', 'é', 'ê', 'í', 'ñ', 'ó', 'ô', 'ö', 'ü', 'ğ', 'ı', '́', '̕', '̖', '̗', '̙', '̛', '̜', '̝', '̞', '̟', '̠', '̡', '̢', '̣', '̤', '̥', '̦', '̧', '̩', '̪', '̫', '̬', '̭', '̮', '̯', '̰', '̱', '̲', '̳', '̴', '̵', '̷', '̸', '̹', '̺', '̻', '̼', '̀', '́', 'ͅ', '͇', '͈', '͉', '͍', '͎', '͏', '͓', '͔', '͕', '͖', '͘', '͙', '͚', '͜', '͝', '͞', '͟', '͠', '͡', '͢', ';', 'α', 'η', 'ι', 'λ', 'υ', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с',

Total number of characters in the corpus.

In [37]:
n_chars = len(joined_corpus)
print(n_chars)

1290167


Total number of characters in the vocab

In [38]:
n_vocab = len(chars)
print(n_vocab)

455


Prepare the dataset of input to output pairs encoded as integers

In [40]:
seq_length = 100
data_x = []
data_y = []
for i in tqdm(range(0, n_chars - seq_length)):
    start = i
    end = i + seq_length
    seq_in = joined_corpus[start:end]
    seq_out = joined_corpus[end]
    data_x.append([char_to_int[char] for char in seq_in])
    data_y.append(char_to_int[seq_out])
n_patterns = len(data_x)
print(n_patterns)
del joined_corpus

100%|██████████| 1290067/1290067 [00:11<00:00, 109104.80it/s]

1290067





Reshape the X array to be [samples, time steps, features], normalize, and one-hot encode the output

In [41]:
def transform_text_samples(text_samples, n_patterns, seq_length):
    return np.reshape(text_samples, (n_patterns, seq_length, 1)) / float(n_vocab)

X = transform_text_samples(data_x, n_patterns, seq_length)
y = keras.utils.np_utils.to_categorical(data_y)

Define the model

In [42]:
model = keras.models.Sequential()
model.add(keras.layers.LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.LSTM(256))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

checkpoint_path = Path('models', 'weights-improvement-{epoch:02d}-{loss:.4f}.hdf5')
checkpoint = keras.callbacks.ModelCheckpoint(str(checkpoint_path), monitor='loss',
                                             verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [43]:
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
  86016/1290067 [=>............................] - ETA: 562s - loss: 2.7994

KeyboardInterrupt: 

Load the weights from a good model and generate some text

In [48]:
filename = Path('models', 'weights-improvement-04-2.8328.hdf5')
model.load_weights(str(filename))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Generate a reverse mapping for ints to chars

In [49]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

Generate predictions from a seed sequence

In [50]:
start = np.random.randint(0, len(data_x)-1)
pattern = data_x[start]
print("Seed:\n{}".format(''.join([int_to_char[value] for value in pattern])))
num_generated_chars = 1000

generated_str = ''

for i in range(num_generated_chars):
    x = transform_text_samples(pattern, 1, len(pattern))
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    generated_str += result
    pattern.append(index)
    pattern = pattern[1:]
    
print("Result:\n{}".format(generated_str))

Seed:
ere are over 26 classes from which to choose, each with its own unique active ability and passive bo
Result:
 toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe toe 