## !!Before Running This Notebook!!

go to the /data/ section of this repo and review the [README](../../data/readme.md), as well as the [notebook](../../data/data_creation.ipynb) located there

## Necessary Imports

In [16]:
import pandas as pd
import numpy as np
import gzip
import json
import re
import os
import pickle
import nltk
import re
import multiprocessing

from nltk.corpus import stopwords

nltk.download('words')
nltk.download('punkt')

[nltk_data] Downloading package words to
[nltk_data]     /Users/kyledecember1/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kyledecember1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
from tqdm import tqdm
from gensim.models import Doc2Vec
from sklearn import utils
from sklearn.model_selection import train_test_split
import gensim
from sklearn.linear_model import LogisticRegression
from gensim.models.doc2vec import TaggedDocument
from sklearn.metrics import accuracy_score, f1_score

import seaborn as sns
import matplotlib.pyplot as plt

## Custom Functions

In [32]:
def tokenize_text(text):
    """
    tokenizes text to be used in Doc2Vec modeling
    """
    
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

def get_recs(model, user_input, n=5):
    """
    returns n-recommendations based on user input, where the input is the description of their
    ideal game, default is set to 5
    """
    
    description = user_input.split(' ')
    desc_vec = model.infer_vector(description)
    recs = model.docvecs.most_similar([desc_vec])[:n]
    return recs

def show_game_desc(recs, df):
    """
    provides the titles of the games recommended by get_recs
    """
    
    games = {}
    for rec in recs:
        games[rec[0]] = df[df['product_id'] == rec[0]]['title']
    return games

def score(evals, n=25):
    """
    calculate the score of the given model
    """
    
    return sum(sum(evals)) / 25

## Import  Data

In [4]:
#assign data filepaths to variables
games_path = os.path.join(os.pardir, os.pardir, 'data/games.csv')
reviews_path = os.path.join(os.pardir, os.pardir, 'data/subsample_agg_reviews.p')

In [5]:
#import df_games 
df_games = pd.read_csv(games_path)

#drop null values
df_games['id'].dropna(inplace=True)

# create product_id column in df_games, based on str-formatted id column
df_games['product_id'] = df_games['id'].astype(int).astype(str)

In [6]:
# initiate the aggregated dataframe
agg_df = pd.DataFrame()

# import the aggregated data from the pickled file
with open(reviews_path, 'rb') as fp:
    loaded_file = pickle.load(fp)
    
# create columns based on the keys and values of the dictionary
agg_df['product_id'] = list(loaded_file.keys())
agg_df['reviews'] = list(loaded_file.values())

    this cell will take several minutes to run

In [8]:
# creating tagged documents for Doc2Vec
tagged_docs = agg_df.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['reviews']), tags=[r['product_id']]), axis=1)

In [None]:
# cores allows the Doc2Vec to train faster with all your processing units
cores = multiprocessing.cpu_count()

In [11]:
# create 5 test descriptions

test_sent1 = """
an intense and fast paced rpg that allows me to customize my character and defeat my enemies 
using sorcery and weapons"""

test_sent2 = """
fluffy animals and shiny colors, game that helps to educate children in basic math and reasoning skills
"""

test_sent3 = """
dragons and monsters fight to the death, extremely unforgiving combat, difficult as dark souls
"""

test_sent4 = """
similar to age of empires, where i build my kingdom from the ground up, form alliances and engage in intrigue,
research technological advancements and stand the test of time
"""

test_sent5 = """
turn based strategy game with the complexity of Civilization 5, but where I can also control an individual 
unit within combat
"""


all_sents = [test_sent1, test_sent2, test_sent3, test_sent4, test_sent5]

## Modeling & Evaluation

    Please note the subjective nature of the evaluation process; without a focus group it was impossible to obtain information on whether or not the recommendations were relevant to the user. Instead, 5 unique game descriptions were provided and evaluation was performed through research.
    
    Evaluation was performed by reviewing the descriptions of the games provided, and determining if they matched key words in the descriptions. Ultimately, each model was given a Relevance Score, which is simply the percentage of recommendations that were determined to be relevant.

## Model 1

In [12]:
model_1 = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_1.build_vocab([x for x in tqdm(tagged_docs.values)])

100%|██████████| 1996/1996 [00:00<00:00, 521154.80it/s]


In [13]:
model_1.train(tagged_docs, total_examples=len(tagged_docs), epochs=30)

In [14]:
for sent in all_sents:
    print(sent, show_game_desc(get_recs(model_1, sent), df_games))


an intense and fast paced rpg that allows me to customize my character and defeat my enemies 
using sorcery and weapons {'354500': 27178    PAYDAY: The Web Series
Name: title, dtype: object, '413850': 4622    CS:GO Player Profiles
Name: title, dtype: object, '250600': 1909    The Plan
Name: title, dtype: object, '283640': 6184    Salt and Sanctuary
Name: title, dtype: object, '546390': 20467    Brief Karate Foolish
Name: title, dtype: object}

fluffy animals and shiny colors, game that helps to educate children in basic math and reasoning skills
 {'354500': 27178    PAYDAY: The Web Series
Name: title, dtype: object, '413850': 4622    CS:GO Player Profiles
Name: title, dtype: object, '346250': 27536    The Old Tree
Name: title, dtype: object, '359390': 25763    Amnesia™: Memories
Name: title, dtype: object, '353560': 27222    Plug &amp; Play
Name: title, dtype: object}

dragons and monsters fight to the death, extremely unforgiving combat, difficult as dark souls
 {'354500': 27178    P

### Evaluation

In [25]:
evals_1 = np.array([[0,0,0,1,0],
                    [0,0,0,0,0], 
                    [0,0,0,0,0], 
                    [0,0,1,0,1],
                    [0,0,0,0,1]])

In [34]:
score_model_1 = score(evals_1)
score_model_1

0.16

## Model 2

In [40]:
model_2 = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_2.build_vocab([x for x in tqdm(tagged_docs.values)])

100%|██████████| 1996/1996 [00:00<00:00, 1198887.41it/s]


In [41]:
model_2.train(tagged_docs, total_examples=len(tagged_docs), epochs=50)

In [31]:
for sent in all_sents:
    print(sent, show_game_desc(get_recs(model_2, sent), df_games))


an intense and fast paced rpg that allows me to customize my character and defeat my enemies 
using sorcery and weapons {'283640': 6184    Salt and Sanctuary
Name: title, dtype: object, '413850': 4622    CS:GO Player Profiles
Name: title, dtype: object, '386360': 4339    SMITE®
Name: title, dtype: object, '250600': 1909    The Plan
Name: title, dtype: object, '354500': 27178    PAYDAY: The Web Series
Name: title, dtype: object}

fluffy animals and shiny colors, game that helps to educate children in basic math and reasoning skills
 {'354500': 27178    PAYDAY: The Web Series
Name: title, dtype: object, '203770': 855    Crusader Kings II
Name: title, dtype: object, '270170': 28508    Depression Quest
Name: title, dtype: object, '413850': 4622    CS:GO Player Profiles
Name: title, dtype: object, '346250': 27536    The Old Tree
Name: title, dtype: object}

dragons and monsters fight to the death, extremely unforgiving combat, difficult as dark souls
 {'413850': 4622    CS:GO Player Profil

### Evaluation

In [36]:
evals_2 = np.array([[1,0,1,0,0],
                    [0,0,0,0,0], 
                    [0,0,0,0,0], 
                    [0,1,0,0,0],
                    [0,0,0,0,1]])

In [37]:
# provides percentage of recommendations that were relevant

score_model_2 = score(evals_2)
score_model_2

0.16

## Model 3

In [42]:
model_3 = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_3.build_vocab([x for x in tqdm(tagged_docs.values)])

100%|██████████| 1996/1996 [00:00<00:00, 710802.41it/s]


In [43]:
model_3.train(tagged_docs, total_examples=len(tagged_docs), epochs=100)

In [44]:
for sent in all_sents:
    print(sent, show_game_desc(get_recs(model_3, sent), df_games))


an intense and fast paced rpg that allows me to customize my character and defeat my enemies 
using sorcery and weapons {'283640': 6184    Salt and Sanctuary
Name: title, dtype: object, '386360': 4339    SMITE®
Name: title, dtype: object, '205100': 30712    Dishonored
Name: title, dtype: object, '207140': 2983    SpeedRunners
Name: title, dtype: object, '242680': 24732    Nuclear Throne
Name: title, dtype: object}

fluffy animals and shiny colors, game that helps to educate children in basic math and reasoning skills
 {'270170': 28508    Depression Quest
Name: title, dtype: object, '203770': 855    Crusader Kings II
Name: title, dtype: object, '529180': 16694    Dark and Light
Name: title, dtype: object, '249130': 29889    LEGO® Marvel™ Super Heroes
Name: title, dtype: object, '362680': 4275    Fran Bow
Name: title, dtype: object}

dragons and monsters fight to the death, extremely unforgiving combat, difficult as dark souls
 {'268910': 15273    Cuphead
Name: title, dtype: object, '20

### Evaluation

In [45]:
evals_3 = np.array([[1,1,1,0,1],
                    [0,0,0,0,0], 
                    [0,1,0,1,0], 
                    [1,0,0,1,1],
                    [0,0,0,1,0]])

In [47]:
# provides percentage of recommendations that were relevant

score_model_3 = score(evals_3)
score_model_3

0.4

## Future Improvements 

Much of the innacurracy is coming from the recommendation of non-game materials; PAYDAY and CS:GO Player Profiles are both documentary series, which should never have been included in the data. There is no indicator in the Games dataframe that would serve to identify all of these instances, and therefore they cannot be automatically filtered from the data as it exists.

The most notable way to improve the performance would be to scrape the information from Steam directly, so as to gather more information such as the "About This Game" section. However, due to a lack of technical proficiency and time, web scraping took a back seat on this iteration of the project. 

I would also like to try running models with a thousand or more Epochs, so that Doc2Vec has more time to properly understand the relevance of words, documents, and their contexts.

Collaborative filtering would also be interesting. However, I fear recommendations based on whether the game was played/liked by other users would only serve as a proxy for defaulting back to a popular-games recommendation system. Regardless, it should be tested.