# Model

This notebook focuses on building and training an embedding model with Keras. 

**The idea and parts of the code were taken from [this](https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9) Medium post!**

In [1]:
import json
import os
import random

import numpy as np

In [2]:
os.chdir('..')

In [3]:
random.seed(42)
np.random.seed(42)

Load in the filtered data

In [4]:
with open('games_with_tags_double_filter.json', 'r') as in_json:
    games_with_tags = json.load(in_json)

Generate mapping for games to indices and the same for tags:

In [7]:
game_index = {game: idx for idx, game in enumerate(games_with_tags)}
index_game = {idx: game for game, idx in game_index.items()}

In [8]:
tag_count = 0
tag_index = {}
for game, tags in games_with_tags.items():
    for tag in tags:
        if tag not in tag_index:
            tag_index[tag] = tag_count
            tag_count += 1
index_tag = {idx: tag for tag, idx in tag_index.items()}

In [9]:
print(f'n game: {len(game_index)}')
print(f'n tags: {len(tag_index)}')

n game: 3027
n tags: 151


Creating a set of true pairs where a game actually has a given tag. Note the game index and the tag index are what is stored in the `pairs` list

In [10]:
pairs = []
for game, tags in games_with_tags.items():
    for tag in tags:
        pairs.append(tuple((game_index[game], tag_index[tag])))

Taking a look at the first pair:

In [11]:
pairs[0]

(0, 0)

Now, relate those indices back to a game name and tag name

In [12]:
index_game[pairs[0][0]], index_tag[pairs[0][1]]

('Counter-Strike', 'Action')

In [13]:
index_game[pairs[5000][0]], index_tag[pairs[5000][1]]

('GunZ 2: The Second Duel', 'Free to Play')

Creating a set object of the pairs. This set will be used to identify if a game - tag relationship ship exists and will be used to create a negative training data set. Using a set should speed things up a bit at the items in the set will be hashed and should be faster than `pair in pairs` as this is a linear search 

In [15]:
pairs_set = set(pairs)

In [16]:
(0, 0) in pairs_set

True

Create a generator method to yield data as required:

In [17]:
random.seed(100)

def generate_batch(pairs, n_positive = 50, negative_ratio = 1.0, classification = False):
    """Generate batches of samples for training"""
    batch_size = n_positive * (1 + negative_ratio)
    batch = np.zeros((batch_size, 3))
    
    # Adjust label based on task
    if classification:
        neg_label = 0
    else:
        neg_label = -1
    
    # This creates a generator
    while True:
        # randomly choose positive examples
        for idx, (game_id, tag_id) in enumerate(random.sample(pairs, n_positive)):
            batch[idx, :] = (game_id, tag_id, 1)

        # Increment idx by 1
        idx += 1
        
        # Add negative examples until reach batch size
        while idx < batch_size:
            
            # random selection
            random_game = random.randrange(len(game_index))
            random_tag = random.randrange(len(tag_index))
            
            # Check to make sure this is not a positive example
            if (random_game, random_tag) not in pairs_set:
                
                # Add to batch and increment index
                batch[idx, :] = (random_game, random_tag, neg_label)
                idx += 1
                
        # Make sure to shuffle order
        np.random.shuffle(batch)
        yield {'game': batch[:, 0], 'tag': batch[:, 1]}, batch[:, 2]

In [18]:
next(generate_batch(pairs, n_positive = 2, negative_ratio = 2))

({'game': array([ 637., 2151., 1775., 1863., 2998., 2889.]),
  'tag': array([ 72.,  41., 129.,  44.,  89., 100.])},
 array([ 1.,  1., -1., -1., -1., -1.]))

In [19]:
index_game[637]

'Counter-Strike: Source'

In [20]:
index_tag[72]

'Sandbox'

In [21]:
index_game[2151]

'Dungeon Fighter Online'

In [22]:
index_tag[41]

'RPG'

Have a look at a training batch:

In [23]:
x, y = next(generate_batch(pairs, n_positive = 5, negative_ratio = 1))

# Show a few example training pairs
for label, b_idx, l_idx in zip(y, x['game'], x['tag']):
    print(f'Game: {index_game[b_idx]:30} Tag: {index_tag[l_idx]:40} Label: {label}')

Game: Risen 3 - Titan Lords          Tag: Parkour                                  Label: -1.0
Game: The Technomancer               Tag: Early Access                             Label: -1.0
Game: Heroes & Generals              Tag: Simulation                               Label: 1.0
Game: PlanetSide 2                   Tag: First-Person                             Label: 1.0
Game: Armello                        Tag: Masterpiece                              Label: -1.0
Game: Icewind Dale: Enhanced Edition Tag: Comedy                                   Label: -1.0
Game: Prison Architect               Tag: Resource Management                      Label: 1.0
Game: Blast Zone! Tournament         Tag: Action                                   Label: 1.0
Game: Battle for the Galaxy          Tag: Comedy                                   Label: -1.0
Game: BATTALION 1944                 Tag: Action                                   Label: 1.0


In [24]:
from keras.layers import (
    Input,
    Embedding,
    Dot,
    Reshape,
    Dense
)
from keras.models import Model
import tensorflow as tf

Using TensorFlow backend.


In [25]:
tf.random.set_seed(42)

A function to build the embedding model

In [26]:
def game_embedding_model(embedding_size = 100, classification = False):
    """Model to embed game and tags using the functional API.
       Trained to discern if a tag is present for a game"""
    
    # Both inputs are 1-dimensional
    game = Input(name = 'game', shape = [1])
    tag = Input(name = 'tag', shape = [1])
    
    # Embedding the game (shape will be (None, 1, 50))
    game_embedding = Embedding(name = 'game_embedding',
                               input_dim = len(game_index),
                               output_dim = embedding_size)(game)
    
    # Embedding the tag (shape will be (None, 1, 50))
    tag_embedding = Embedding(name = 'tag_embedding',
                               input_dim = len(tag_index),
                               output_dim = embedding_size)(tag)
    
    # Merge the layers with a dot product along the second axis (shape will be (None, 1, 1))
    merged = Dot(name = 'dot_product', normalize = True, axes = 2)([game_embedding, tag_embedding])
    
    # Reshape to be a single number (shape will be (None, 1))
    merged = Reshape(target_shape = [1])(merged)
    
    # If classifcation, add extra layer and loss function is binary cross entropy
    if classification:
        merged = Dense(1, activation = 'sigmoid')(merged)
        model = Model(inputs = [book, link], outputs = merged)
        model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    
    # Otherwise loss function is mean squared error
    else:
        model = Model(inputs = [game, tag], outputs = merged)
        model.compile(optimizer = 'Adam', loss = 'mse')
    
    return model

In [107]:
model = game_embedding_model(embedding_size=200)
model.summary()

Model: "model_6"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
game (InputLayer)               (None, 1)            0                                            
__________________________________________________________________________________________________
tag (InputLayer)                (None, 1)            0                                            
__________________________________________________________________________________________________
game_embedding (Embedding)      (None, 1, 200)       605400      game[0][0]                       
__________________________________________________________________________________________________
tag_embedding (Embedding)       (None, 1, 200)       30200       tag[0][0]                        
____________________________________________________________________________________________

In [108]:
n_positive = 1024

gen = generate_batch(pairs, n_positive, negative_ratio = 2)

# Train
h = model.fit_generator(
    gen,
    epochs = 20, 
    steps_per_epoch = len(pairs) // n_positive,
    verbose = 2
)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/20
 - 1s - loss: 0.9937
Epoch 2/20
 - 0s - loss: 0.9639
Epoch 3/20
 - 0s - loss: 0.9253
Epoch 4/20
 - 0s - loss: 0.8609
Epoch 5/20
 - 0s - loss: 0.7672
Epoch 6/20
 - 0s - loss: 0.6686
Epoch 7/20
 - 0s - loss: 0.5873
Epoch 8/20
 - 0s - loss: 0.5427
Epoch 9/20
 - 0s - loss: 0.5138
Epoch 10/20
 - 0s - loss: 0.4902
Epoch 11/20
 - 0s - loss: 0.4799
Epoch 12/20
 - 0s - loss: 0.4695
Epoch 13/20
 - 0s - loss: 0.4706
Epoch 14/20
 - 0s - loss: 0.4545
Epoch 15/20
 - 0s - loss: 0.4488
Epoch 16/20
 - 0s - loss: 0.4430
Epoch 17/20
 - 0s - loss: 0.4432
Epoch 18/20
 - 0s - loss: 0.4385
Epoch 19/20
 - 0s - loss: 0.4378
Epoch 20/20
 - 0s - loss: 0.4316


The loss was reduced over successive epochs which means the network was learning something!

In [109]:
model.save('./models/double_filter_emb_200.h5')

In [110]:
# Extract embeddings
game_layer = model.get_layer('game_embedding')
game_weights = game_layer.get_weights()[0]
game_weights.shape

(3027, 200)

Each game is now represented as a 50-dimensional vector.

We need to normalize the embeddings so that the dot product between two embeddings becomes the cosine similarity.

In [111]:
game_weights = game_weights / np.linalg.norm(game_weights, axis = 1).reshape((-1, 1))
game_weights[0][:10]
np.sum(np.square(game_weights[0]))

1.0

In [113]:
# Method to find the closest matches to the game in question
def find_closest(game_embedding: np.array) -> None:
    dists = np.dot(game_weights, game_embedding)
    sorted_dists = np.argsort(dists)
    closest = sorted_dists[-6:]
    for c in reversed(closest):
        print(f'GAME: {index_game[c]:{40}} Similarity: {dists[c]:.{2}}')

In [114]:
find_closest(game_weights[game_index['Age of Empires II: Definitive Edition']])

GAME: Age of Empires II: Definitive Edition    Similarity: 1.0
GAME: Act of Aggression - Reboot Edition       Similarity: 0.97
GAME: Total Annihilation                       Similarity: 0.96
GAME: Tooth and Tail                           Similarity: 0.95
GAME: Warhammer 40,000: Dawn of War II Chaos Rising Similarity: 0.95
GAME: Sudden Strike 4                          Similarity: 0.95


In [115]:
find_closest(game_weights[game_index['Age of Mythology: Extended Edition']])

GAME: Age of Mythology: Extended Edition       Similarity: 1.0
GAME: Heroes of Might & Magic III - HD Edition Similarity: 0.86
GAME: Total War: WARHAMMER                     Similarity: 0.85
GAME: Rise of Nations: Extended Edition        Similarity: 0.85
GAME: Total War: ROME II - Emperor Edition     Similarity: 0.85
GAME: Total War: THREE KINGDOMS                Similarity: 0.85


In [116]:
find_closest(game_weights[game_index['Total War: WARHAMMER']])

GAME: Total War: WARHAMMER                     Similarity: 1.0
GAME: Total War: WARHAMMER II                  Similarity: 0.92
GAME: Total War: THREE KINGDOMS                Similarity: 0.89
GAME: Total War: ROME II - Emperor Edition     Similarity: 0.86
GAME: Age of Mythology: Extended Edition       Similarity: 0.85
GAME: Heroes of Might & Magic III - HD Edition Similarity: 0.83


Not bad.

In [117]:
def extract_weights(name, model):
    """Extract weights from a neural network model"""
    
    # Extract weights
    weight_layer = model.get_layer(name)
    weights = weight_layer.get_weights()[0]
    
    # Normalize
    weights = weights / np.linalg.norm(weights, axis = 1).reshape((-1, 1))
    return weights

In [118]:
tag_weights = extract_weights('tag_embedding', model)

Find closest tags to `RTS`

In [119]:
dists = np.dot(tag_weights, tag_weights[tag_index['RTS']])
sorted_dists = np.argsort(dists)
closest = sorted_dists[-6:]
for c in reversed(closest):
    print(f'GAME: {index_tag[c]:{40}} Similarity: {dists[c]:.{2}}')

GAME: RTS                                      Similarity: 1.0
GAME: Grand Strategy                           Similarity: 0.87
GAME: Historical                               Similarity: 0.85
GAME: War                                      Similarity: 0.83
GAME: Tactical                                 Similarity: 0.82
GAME: Turn-Based Strategy                      Similarity: 0.82


An idea I had would be to operations on embeddings. Say we have `Age of Mythology` which is a game we like but not so much the `fantasy` element. We could potentially remove what the model has "learned" as fantasy and make new recommendations based on this. Essentially take the `Age of Mythology` embedding and subtract the `Fantasy` tag embedding, re-normalise and make new recommendations.

In [122]:
def subtract_tag(tag: str, game: str) -> np.array:
    """
    Subtracts a tag embedding from a game embedding and normalises
    
    :type tag: str
    :param tag: Tag to subtract from game embedding
    :type game: str
    :param game: Game which tag embedding is subtracted from
    :rtype: np.array
    :return: New game array with the tag embedding removed
    """
    new_game_weight = game_weights[game_index[game]] - tag_weights[tag_index[tag]]
    return new_game_weight / np.linalg.norm(new_game_weight).reshape((-1, 1))[0]

def add_tag(tag: str, game: str) -> np.array:
    """
    Adds a tag embedding from a game embedding and normalises
    
    :type tag: str
    :param tag: Tag to add to game embedding
    :type game: str
    :param game: Game which tag embedding is added to
    :rtype: np.array
    :return: New game array with the tag embedding added
    """
    new_game_weight = game_weights[game_index[game]] + tag_weights[tag_index[tag]]
    return new_game_weight / np.linalg.norm(new_game_weight).reshape((-1, 1))[0]

In [123]:
find_closest(subtract_tag('Classic', 'Age of Mythology: Extended Edition'))

GAME: Insect Simulator                         Similarity: 0.86
GAME: CryoFall                                 Similarity: 0.85
GAME: INTERSTELLAR PRIME                       Similarity: 0.84
GAME: All Guns On Deck                         Similarity: 0.83
GAME: War for the Overworld                    Similarity: 0.83
GAME: Sandbox Anything                         Similarity: 0.82


In [124]:
find_closest(subtract_tag('Fantasy', 'Age of Mythology: Extended Edition'))

GAME: ARMA: Cold War Assault                   Similarity: 0.84
GAME: Ashes of the Singularity: Escalation     Similarity: 0.83
GAME: Takedown: Red Sabre                      Similarity: 0.82
GAME: Wargame: Red Dragon                      Similarity: 0.79
GAME: Rise of Nations: Extended Edition        Similarity: 0.79
GAME: Call to Arms                             Similarity: 0.78


In [125]:
find_closest(game_weights[game_index['Rocket League']])

GAME: Rocket League                            Similarity: 1.0
GAME: Stick Fight: The Game                    Similarity: 0.78
GAME: SpeedRunners                             Similarity: 0.78
GAME: Overcooked                               Similarity: 0.77
GAME: The Expendabros                          Similarity: 0.76
GAME: Hot Shot Burn                            Similarity: 0.75


In [126]:
find_closest(subtract_tag('Sports', 'Rocket League'))

GAME: FORCED: Slightly Better Edition          Similarity: 0.85
GAME: Witch It                                 Similarity: 0.84
GAME: Mirador                                  Similarity: 0.84
GAME: CS2D                                     Similarity: 0.84
GAME: Trine 2: Complete Story                  Similarity: 0.83
GAME: Overcooked! 2                            Similarity: 0.82


In [127]:
find_closest(subtract_tag('Sports', 'GRID 2'))

GAME: theHunter: Primal                        Similarity: 0.92
GAME: Far Cry 2: Fortune's Edition             Similarity: 0.88
GAME: Mirador                                  Similarity: 0.86
GAME: Far Cry New Dawn                         Similarity: 0.86
GAME: WT3                                      Similarity: 0.86
GAME: Generation Zero                          Similarity: 0.86


In [128]:
find_closest(game_weights[game_index['GRID 2']])

GAME: GRID 2                                   Similarity: 1.0
GAME: RaceRoom Racing Experience               Similarity: 0.86
GAME: Wreckfest                                Similarity: 0.83
GAME: Assetto Corsa                            Similarity: 0.82
GAME: Project CARS                             Similarity: 0.81
GAME: Need For Speed: Hot Pursuit              Similarity: 0.79


In [129]:
find_closest(add_tag('Fantasy', 'Surviving Mars'))

GAME: XCOM: Enemy Unknown                      Similarity: 0.72
GAME: Endless Space 2                          Similarity: 0.69
GAME: Stellaris                                Similarity: 0.69
GAME: The Elder Scrolls III: Morrowind Game of the Year Edition Similarity: 0.68
GAME: Wasteland 2: Director's Cut              Similarity: 0.68
GAME: Frostpunk                                Similarity: 0.67


In [130]:
find_closest(subtract_tag('Colony Sim', 'Surviving Mars'))

GAME: Starpoint Gemini 2                       Similarity: 0.92
GAME: X Rebirth                                Similarity: 0.92
GAME: X3: Terran Conflict                      Similarity: 0.92
GAME: Endless Sky                              Similarity: 0.91
GAME: Ashes of the Singularity: Escalation     Similarity: 0.9
GAME: Avorion                                  Similarity: 0.88


These are looked at further in the next notebook.