# MyGamePass #
## Content Based Filtering ##

As a kid, I used to love finding new favorites by going to the store and seeing which game caught my attention with the name and description, if I recognized the developer or knew I loved that genre of game.  

With natural language processing and machine learning we are able to create a recommender system that factors these descriptive features to find similar games.

We began by performing extensive EDA and preprocessing on the dataframes that will be used in this modeling.  The data cleanup folder contains each of the cleanup notebooks for the dataset.

In [11]:
# Begin by importing the standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# NLP Packages
import string
import nltk

# import TfidfVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer

# Import cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Import the prepared games dataset, with the combined description and non-numeric features
games_df = pd.read_csv('data/games_combo.csv')
games_df.head(3)

Unnamed: 0,appid,name,average_playtime,total_ratings,percent_positive_ratings,description,descr_combo
0,10,Counter-Strike,300,127873,0.973888,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...
1,20,Team Fortress Classic,277,3951,0.839787,One of the most popular online action games of...,One of the most popular online action games of...
2,30,Day of Defeat,187,3814,0.895648,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...


column: description

- appid: steam appid for joining
- name: video game name
- average_playtime: average playtime based on master steam store dataset
- total_ratings: total number of ratings
- percent_positive_ratings: percentage of ratings that are positive
- description: text description of the game
- descr_combo: combined non-numeric features with the description for each game

In [3]:
# Let's look at the sample output for one of my favorite game franchises
games_df[games_df['name'].str.contains('Halo',na=False)]

Unnamed: 0,appid,name,average_playtime,total_ratings,percent_positive_ratings,description,descr_combo
2374,277430,Halo: Spartan Assault,251,3488,0.801892,Halo: Spartan Assault brings the excitement of...,Halo: Spartan Assault brings the excitement of...
3548,324570,Halo: Spartan Strike,6,634,0.832808,Halo: Spartan Strike makes you a Spartan super...,Halo: Spartan Strike makes you a Spartan super...
7622,459220,Halo Wars: Definitive Edition,300,2845,0.868541,<i>Halo Wars: Definitive Edition</i> is an enh...,<i>Halo Wars: Definitive Edition</i> is an enh...


- Great.  This is excellent data for the NLP to build the similarity matrix.
- Note: there appear to be some HTML tags that will need to be addressed

## User Independent System ##

Let's start just by exploring the ratings.  A common approach for a "default" recommendation or for a new user would be to simply recommend the highest rated video game in the dataset.

In [4]:
# Top 10 highest rated games, no filtering
top_rated = games_df.sort_values(by=['percent_positive_ratings'],ascending=False)
top_rated.head(10)

Unnamed: 0,appid,name,average_playtime,total_ratings,percent_positive_ratings,description,descr_combo
18206,1064580,CaptainMarlene,0,13,1.0,"In the game you control the spacecraft, and in...","In the game you control the spacecraft, and in..."
12375,651360,Eskimo Bob: Starring Alfonzo,0,11,1.0,Eskimo Bob is an 8-bit arcade-style puzzle-pla...,Eskimo Bob is an 8-bit arcade-style puzzle-pla...
12738,667800,Loco Dojo,0,25,1.0,Enter the whimsical wooden world of <strong>Lo...,Enter the whimsical wooden world of <strong>Lo...
12736,667780,Rosebaker's Icy Treats - The VR Iceman Sim,0,17,1.0,Rosebaker's Icy Treats is a virtual reality ga...,Rosebaker's Icy Treats is a virtual reality ga...
12734,667760,Super Lumi Live,0,17,1.0,<strong>A pure classical platformer. </strong>...,<strong>A pure classical platformer. </strong>...
12716,666810,Luna,0,10,1.0,Luna is the story of a young creature cast off...,Luna is the story of a young creature cast off...
12713,666630,The Captives: Plot of the Demiurge,0,13,1.0,"<h2 class=""bb_tag"">Characters</h2>The Captives...","<h2 class=""bb_tag"">Characters</h2>The Captives..."
12675,664850,8-bit Adventure Anthology: Volume I,0,22,1.0,8-bit Adventure Anthology is a compilation fea...,8-bit Adventure Anthology is a compilation fea...
12623,662410,Circuit Dude,0,13,1.0,Help Circuit Dude build his ultimate secret in...,Help Circuit Dude build his ultimate secret in...
12611,661820,I.F.O,0,18,1.0,I.F.O is an old school LCD-style shoot'em up g...,I.F.O is an old school LCD-style shoot'em up g...


Huh...I've never heard of any of these games...

- Right off the bat it is a red flag for any game to have 100% positive ratings.  Humans can be very tough critics and rarely agree uniformally especially on entertainment preferences.

- We will combat this with a simple filter, to only include games above a minimum number of votes
    - This does have the potential to ignore any brand new games, which is a feature I would like to explore to add robustness in a future update

In [5]:
# Add a vote threshold for a more realistic Top 10
# Let's be pretty tough, filtering for very popular games
threshold = 10000
# Temporary dataframe for the games with a lot of ratings
many_ratings_df = games_df[games_df['total_ratings'] >= threshold]
# Sort by most positive
top_rated_v2 = many_ratings_df.sort_values(by=['percent_positive_ratings'],ascending=False)
# Rank the top 10
top_rated_v2.head(10)

Unnamed: 0,appid,name,average_playtime,total_ratings,percent_positive_ratings,description,descr_combo
23,620,Portal 2,300,140111,0.986504,Portal 2 draws from the award-winning formula ...,Portal 2 draws from the award-winning formula ...
6648,427520,Factorio,300,48641,0.985136,<strong>Factorio</strong> is a game in which y...,<strong>Factorio</strong> is a game in which y...
2124,264200,One Finger Death Punch,149,14437,0.982268,Experience cinematic kung-fu battles in the fa...,Experience cinematic kung-fu battles in the fa...
6591,424280,Iron Snout,300,14166,0.980446,"<strong>Iron Snout</strong> is fast, colorful ...","<strong>Iron Snout</strong> is fast, colorful ..."
6495,420530,OneShot,300,11205,0.980366,"<u><strong><a href=""https://steamcommunity.com...","<u><strong><a href=""https://steamcommunity.com..."
1888,253230,A Hat in Time,300,14311,0.979596,"<img src=""https://steamcdn-a.akamaihd.net/stea...","<img src=""https://steamcdn-a.akamaihd.net/stea..."
17,400,Portal,288,52881,0.979577,<p>Portal&trade; is a new single player game f...,<p>Portal&trade; is a new single player game f...
2782,294100,RimWorld,300,39413,0.978256,<strong>RimWorld is a sci-fi colony sim driven...,<strong>RimWorld is a sci-fi colony sim driven...
4394,351640,Eternal Senia,300,10333,0.978032,I believe that gaming should not be anything c...,I believe that gaming should not be anything c...
2719,292030,The Witcher® 3: Wild Hunt,300,207728,0.976902,"<h1>Special Offer</h1><p><img src=""https://med...","<h1>Special Offer</h1><p><img src=""https://med..."


- Much better.  And I agree, #1 Portal 2 and #10 The Witcher 3 are two of my favorite games of all time.

- These ratings filters can be used to refine the content recommendations in a similar fashion

Let's take a look at one of the most popular entertainment franchises of all time, Star Wars.  There have been a wide variety of Star Wars games made of various genres, types, with various modes, etc. and the franchise has a dedicated fanbase.  

There are so many Star Wars games it can be hard to find the ones you like, and too many options to play them all.  A recommender system is the solution!

In [6]:
# Star Wars games
games_df[games_df['name'].str.contains('STAR WARS')]

Unnamed: 0,appid,name,average_playtime,total_ratings,percent_positive_ratings,description,descr_combo
143,6000,STAR WARS™ Republic Commando™,300,6771,0.944026,Chaos has erupted throughout the galaxy. As le...,Chaos has erupted throughout the galaxy. As le...
145,6020,STAR WARS™ Jedi Knight - Jedi Academy™,300,6014,0.945128,Forge your weapon and follow the path of the J...,Forge your weapon and follow the path of the J...
146,6030,STAR WARS™ Jedi Knight II - Jedi Outcast™,127,2239,0.902188,The Legacy of Star Wars Dark Forces™ and Star ...,The Legacy of Star Wars Dark Forces™ and Star ...
548,32350,STAR WARS™ Starfighter™,6,456,0.484649,Join three heroic starfighter pilots in harrow...,Join three heroic starfighter pilots in harrow...
550,32370,STAR WARS™ - Knights of the Old Republic™,300,13159,0.87917,Note: Mac version only supports English langua...,Note: Mac version only supports English langua...
551,32380,STAR WARS™ Jedi Knight: Dark Forces II,141,1092,0.711538,Dark Forces™ set the industry standard for fir...,Dark Forces™ set the industry standard for fir...
552,32390,STAR WARS™ Jedi Knight - Mysteries of the Sith™,30,558,0.655914,"&quot;I have chosen my destiny, and it lies he...","&quot;I have chosen my destiny, and it lies he..."
553,32400,STAR WARS™ - Dark Forces,99,1672,0.897727,Behind a veil of secrecy the evil Empire is cr...,Behind a veil of secrecy the evil Empire is cr...
555,32420,STAR WARS™: The Clone Wars - Republic Heroes™,2,445,0.373034,Star Wars The Clone Wars: Republic Heroes lets...,Star Wars The Clone Wars: Republic Heroes lets...
556,32430,STAR WARS™ - The Force Unleashed™ Ultimate Sit...,195,3676,0.622144,The story and action of Star Wars®: The Force ...,The story and action of Star Wars®: The Force ...


## Natural Language Processing ##

Using the Natural Language Toolkit (NLTK) package we will process the descriptive data into numeric values for our cosine similarity matrix.

descr_combo: the combined description and non-numeric features (developer, publisher, categories, tags)

- TF IDF vectorization of descr_combo
    - TF IDF was chosen as it puts an increased weight on the most unique words found in each description
    - This is preferred as these are the most "descriptive" words as opposed to the most frequent word count
- compared to count vectorization, this method will put more emphasis on the unique words for each description, and less emphasis on the words found throughout all descriptions

In [7]:
# Create a custom function to remove the html tags from the descriptions
# This will be called by the tokenizer first
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [8]:
# We will use the nltk for our stemming and stopwords in the custome tokenizer function
# initialize the porter stemmer from nltk
# A stemmer will find the root word for a more robust approach
stemmer = nltk.stem.PorterStemmer()
# Download our stopwords 
nltk.download('stopwords')
from nltk.corpus import stopwords
# Use english stopwords - "throwaway" words that add no value for our matrix
stopwords = stopwords.words('english')

# Custom tokenizer to remove html tags, punctuation, set to lowercase, and remove stopwords
def my_tokenizer(sentence):
    # Remove HTML tags with custom function
    sentence = remove_html_tags(sentence)
    
    # remove punctuation using string attribute
    for punct in string.punctuation:
        # set to lower case with built in functions
        sentence = sentence.replace(punct,'').lower()

    # split into words
    words = sentence.split(' ')
    stemmed_list = []
    
    # remove stopwords and any tokens that are just empty strings
    for word in words:
        if (not word in stopwords) and (word!=''):
            # Stem words
            stemmed = stemmer.stem(word)
            stemmed_list.append(stemmed)

    return stemmed_list

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bpolzin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
%%time
# Set up our TF IDF vectorizer
# Initial thresholds
minimum_descr_count = 5 # do not count words unless they occur in at least this many descriptions
maximum_descr_perc = 0.90 # drop words that occur in 90% or more of the descriptions

# Initialize vectorizer with our stop words, thresholds, and custom tokenizer
vectorizer = TfidfVectorizer(stop_words = stopwords, min_df=minimum_descr_count, 
                             max_df=maximum_descr_perc, tokenizer=my_tokenizer)

# Prepare the descr_combo column for processing
games_df['descr_combo'] = games_df['descr_combo'].fillna('')

# Create the TF_IDF matrix of the prepared descr_combo column
TF_IDF_matrix = vectorizer.fit_transform(games_df['descr_combo'])

# Check the shape (This will be a matrix of every game and every descriptive word, may be quite large)
TF_IDF_matrix.shape



CPU times: user 1min 19s, sys: 371 ms, total: 1min 19s
Wall time: 1min 20s


(18207, 15950)

- Great.  We now have our TF_IDF matrix of the descriptive features.  18,207 rows (games) and 15,950 features

In [10]:
# Let's look again at the Halo games in the dataset from before
games_df[games_df['name'].str.contains('Halo',na=False)]

Unnamed: 0,appid,name,average_playtime,total_ratings,percent_positive_ratings,description,descr_combo
2374,277430,Halo: Spartan Assault,251,3488,0.801892,Halo: Spartan Assault brings the excitement of...,Halo: Spartan Assault brings the excitement of...
3548,324570,Halo: Spartan Strike,6,634,0.832808,Halo: Spartan Strike makes you a Spartan super...,Halo: Spartan Strike makes you a Spartan super...
7622,459220,Halo Wars: Definitive Edition,300,2845,0.868541,<i>Halo Wars: Definitive Edition</i> is an enh...,<i>Halo Wars: Definitive Edition</i> is an enh...


- Now that the game descriptive information is numeric a TF_IDF matrix, we can easily compare the similarities of each game using the cosine_similarity function
- Cosine similarity will compare each game and assign a score from 0 (not at all similar) to 1.0 (100% similar)

In [13]:
# For example, let's compare two of the halo games listed above
# Compare the values from the TF_IDF_matrix for each game
game_1 = TF_IDF_matrix[(games_df['name']=='Halo: Spartan Assault').values,]
game_2 = TF_IDF_matrix[(games_df['name']=='Halo: Spartan Strike').values,]

print("Similarity:", cosine_similarity(game_1,game_2))

Similarity: [[0.68640148]]


- 68.6% is a very high score for a straightforward content-based filtering approach
- Having played both of these games, I can confirm they are extremely similar as one is nearly a copy and paste sequel to the other

In order to create a recommender system from this data we will now need to compare the similarities of every individual game to every other game in the dataset.  This is where the power of machine learning is great as this would have been an extraordinarily time consuming exercise to do by hand (and imagine applying that approach to all 1.2 million games available today!)

cosine_similarity is and extremely powerful function, as it can perform this analysis on our TF_IDF_matrix and create a similarity matrix used to pull recommendations from if they meet our criteria.

In [14]:
%%time
# We compare the similarities of every game to every other game.
similarities = cosine_similarity(TF_IDF_matrix, dense_output=False)

CPU times: user 12.4 s, sys: 1.35 s, total: 13.8 s
Wall time: 14 s


- Wow!  All of that in 14 seconds.  That's amazing!

In [15]:
# Let's test it out with a very popular and influential game, Grand Theft Auto III
games_df[games_df['name']=='Grand Theft Auto III']

Unnamed: 0,appid,name,average_playtime,total_ratings,percent_positive_ratings,description,descr_combo
272,12100,Grand Theft Auto III,98,5468,0.862838,The sprawling crime epic that changed open-wor...,The sprawling crime epic that changed open-wor...


In [16]:
# Grab the index of the game to compare to
game_index = games_df[games_df['name']=='Grand Theft Auto III'].index

# Create a dataframe with game appid, name, and similarity
sim_df = pd.DataFrame({'appid':games_df['appid'],'game':games_df['name'],'similarity':np.array(similarities[game_index,:].todense()).squeeze()})

# Sort the dataframe and limit the results to the top 10 most similar games from the matrix
sim_df.sort_values(by='similarity',ascending=False).head(10)

Unnamed: 0,appid,game,similarity
272,12100,Grand Theft Auto III,1.0
2288,271590,Grand Theft Auto V,0.454594
282,12220,Grand Theft Auto: Episodes from Liberty City,0.29512
279,12180,Grand Theft Auto 2,0.288992
8325,490450,Tokyo 42,0.26422
281,12210,Grand Theft Auto IV,0.264072
2852,297000,Heroes® of Might & Magic® III - HD Edition,0.222714
12683,665270,弹幕音乐绘 ～风雷幻奏曲～ / Barrage Musical ~A Fantasy of...,0.196604
13334,696370,BROKE PROTOCOL: Online City RPG,0.194469
13122,685310,Transport Defender,0.164838


- Somewhat interesting results.  As expected, the other entries in the franchise appear on the list.
- However I would expect more similar games such as Saints Row, or possibly another Rockstar openworld game such as Red Dead Redemption to appear
- In the future this model could be improved by putting more weights on certain factors.
    - As you may recall, we simply included the Developer, Publisher, and other information such as Genre and Categories with the description text.  
    - This likely does not put enough emphasis on these features
    - That could explain why another popular Rockstar game with similar features such as Red Dead Redemption does not appear on this list.
    - Although the game itself is similar (open world, action/adventure, made by the same Developer/Publisher, etc) the descriptions are much different (Wild West vs Urban Crime)
    
This is a great starting point, with many opportunities to improve.  To refine the recommendaions we will add thresholds for the rating information.

As we saw before, it is important to set a minimum vote threshold.  In addition, a minimum rating will be considered.  

These settings can be tuned for each user, as some gamers may prefer only the most popular and highest rated games, whereas others may be more likely to explore to find a lesser-known hit.

In [17]:
# Add a function to restrict results above a total voter and rating percentage threshold 
def content_recommender(name, games, similarities, vote_threshold=1000, rating_threshold=0.7) :
    
    # Get the game by the title
    game_index = games[games['name']==name].index
    
    # Create a dataframe with the game id, name, and rating information with similarity
    sim_df = pd.DataFrame(
        {'appid': games['appid'],
         'game': games['name'], 
         'similarity': np.array(similarities[game_index, :].todense()).squeeze(),
         'vote_count': games['total_ratings'],
         'percent_positive_ratings': games['percent_positive_ratings']
        })
    
    # Get the top 10 games that satisfy our thresholds
    top_games = sim_df[(sim_df['vote_count']>vote_threshold) & 
                       (sim_df['percent_positive_ratings']>rating_threshold)].sort_values(by='similarity', ascending=False).head(10)
    
    return top_games

Excellent!  We now have a working content-based recommender system with adjustable thresholds to customize for each user.

Let's test it out below with some popular games and see how it does now.

In [24]:
# Test the recommender
# Show the 10 most similar games above the threshold, sorted by rating
# Increase the rating threshold to 80% positive reviews
similar_games = content_recommender('Grand Theft Auto III', games_df, similarities, 
                                    rating_threshold=0.80)
similar_games.head(5).sort_values(by='percent_positive_ratings',ascending=False)

Unnamed: 0,appid,game,similarity,vote_count,percent_positive_ratings
273,12110,Grand Theft Auto: Vice City,0.135675,10636,0.922997
5148,374320,DARK SOULS™ III,0.132796,110365,0.909817
902,67370,The Darkness II,0.128959,13226,0.906699
115,3910,Sid Meier's Civilization® III Complete,0.147702,2983,0.864231
272,12100,Grand Theft Auto III,1.0,5468,0.862838


- As we can see, the results are much different from before.  Now, more emphasis is being placed on finding "good" games that are also similar as opposed to descriptive similarity alone

In [31]:
# Mass Effect Recommendation
similar_games2 = content_recommender('Mass Effect', games_df, similarities, 
                                     vote_threshold=1000, rating_threshold=0.80)
similar_games2.head(5).sort_values(by='percent_positive_ratings',ascending=False)

Unnamed: 0,appid,game,similarity,vote_count,percent_positive_ratings
488,24980,Mass Effect 2,0.215164,11217,0.95168
143,6000,STAR WARS™ Republic Commando™,0.237015,6771,0.944026
382,17460,Mass Effect,1.0,10773,0.938179
415,20540,Company of Heroes: Tales of Valor,0.165758,1438,0.929764
945,91200,Anomaly: Warzone Earth,0.18034,3781,0.872785


- As expected, the sequel to the great Mass Effect is the top recommended game.
- Star Wars: Republic Commando looks to be another great option to check out as well.
    - Another Sci-Fi, action game with futuristic guns and vehicles, that is a pretty solid recommendation
    
Which reminds me of our Star Wars example from before, with so many different games in the franchise it can be hard to pick which one to try next.  I know one of my favorites of all time was Star Wars: Knights of the Old Republic, let's try that.

In [32]:
# Star Wars: Knights of the Old Republic Recommendation
similar_games = content_recommender('STAR WARS™ - Knights of the Old Republic™', games_df, similarities, 
                                     vote_threshold=1000, rating_threshold=0.80)
similar_games.head(5).sort_values(by='percent_positive_ratings',ascending=False)

Unnamed: 0,appid,game,similarity,vote_count,percent_positive_ratings
145,6020,STAR WARS™ Jedi Knight - Jedi Academy™,0.336279,6014,0.945128
148,6060,"Star Wars: Battlefront 2 (Classic, 2005)",0.327936,31574,0.923196
146,6030,STAR WARS™ Jedi Knight II - Jedi Outcast™,0.397158,2239,0.902188
1204,208580,STAR WARS™ Knights of the Old Republic™ II - T...,0.434218,9025,0.891634
550,32370,STAR WARS™ - Knights of the Old Republic™,1.0,13159,0.87917


- Great!  Not surprised to see only Star Wars games on the list, but this is a great way to find a new favorite.  
- I do love Jedi games with lightsabers and force powers, I will have to check out Jedi Knight!

The next notebook will utilize user data for collaborative filtering.

Additionally, further work could be done to properly emphasize the features with custom weights as mentioned previously.

Topic clustering can be performed as well in an effort to enhance the "cold user" approach where someone has never played any games before, but would like a more personalized recommendation on where to start.

A hybrid model utilizing a blend of user-item recommendations and content filtering would be the best approach.  You could find the top games based on similar profiles, and additionally look for more games like those to expand the recommendations further.

# MyGamePass #
## Ben Polzin ##