# **Content-based videogames recommendation system**

We are going to work with a dataset of videogames from [Steam](https://store.steampowered.com) shop.

It is available on the [Kaggle website](https://www.kaggle.com). We can download it in this [site](https://www.kaggle.com/trolukovich/steam-games-complete-dataset), where we can get a simple and useful description of the data.

For example, in the image below, we can see that for the 'type' column:
- its unique values, 
- the percentage for each value, 
- its most common value, 
- its missing values count.

![alt text](images/kaggle_data_descr.png "Kaggle data description")

So, the first decision on this data will be work with the rows, whose value for the `type` variable is app. Also, we will use a subset of the columns that will be useful for starting to provide recommendations.

We will develop two content-based commendation systems:
- the first system receives the name of a game as an argument and outputs a list of games that are most similar based on their descriptions;
- the second takes a host of features, such as the developer company, genre, tags, and game details, into consideration and provides recommendations that are the most similar concerning these features

## **Data profiling**

In [3]:
import pandas as pd
from pandas_profiling import ProfileReport

In [137]:
cols_to_use = ['name', 'release_date', 'types', 'desc_snippet', 'game_description', 'developer', 'publisher', 'popular_tags', 'game_details', 'genre']

df = pd.read_csv("data/steam_games.csv", header=0, usecols=cols_to_use)
df = df[df['types'] == 'app'].copy()
df.head()

Unnamed: 0,types,name,desc_snippet,release_date,developer,publisher,popular_tags,game_details,genre,game_description
0,app,DOOM,Now includes all three premium DLC packs (Unto...,"May 12, 2016",id Software,"Bethesda Softworks,Bethesda Softworks","FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...",Action,"About This Game Developed by id software, the..."
1,app,PLAYERUNKNOWN'S BATTLEGROUNDS,PLAYERUNKNOWN'S BATTLEGROUNDS is a battle roya...,"Dec 21, 2017",PUBG Corporation,"PUBG Corporation,PUBG Corporation","Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","Action,Adventure,Massively Multiplayer",About This Game PLAYERUNKNOWN'S BATTLEGROUND...
2,app,BATTLETECH,Take command of your own mercenary outfit of '...,"Apr 24, 2018",Harebrained Schemes,"Paradox Interactive,Paradox Interactive","Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","Action,Adventure,Strategy",About This Game From original BATTLETECH/Mec...
3,app,DayZ,The post-soviet country of Chernarus is struck...,"Dec 13, 2018",Bohemia Interactive,"Bohemia Interactive,Bohemia Interactive","Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","Action,Adventure,Massively Multiplayer",About This Game The post-soviet country of Ch...
4,app,EVE Online,EVE Online is a community-driven spaceship MMO...,"May 6, 2003",CCP,"CCP,CCP","Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","Action,Free to Play,Massively Multiplayer,RPG,...",About This Game


In [138]:
# Number of rows and columns of the dataset
df.shape

(38021, 10)

In [7]:
# Instantiate a ProfileReport object from the dataframe
report = ProfileReport(df)

# Visualize the report on this notebook
report.to_notebook_iframe()

In [13]:
# execute this cell if you need export this report in a html file
report.to_file('videogames_data_report.html')

From the profiling report generated above, we will focus only on the missing values present in the dataset.

## **Data pre-processing**

In the profiling report, we have seen some columns with missing values. So, we are going to fill these values with an empty string.

In [139]:
df.fillna('', inplace=True)

We are going to review the columns with null values.

In [140]:
df.isnull().sum()

types               0
name                0
desc_snippet        0
release_date        0
developer           0
publisher           0
popular_tags        0
game_details        0
genre               0
game_description    0
dtype: int64

We are going to work with some columns that describe to the game and will be useful for generate recommendations. These are: `genre`, `popular_tags`, `game_details`, `developer`, `publisher`, `game_description`, `desc_snippet`. The values in the first three columns are items separated by comma or only one item so, we are going to transform them in a list

In [142]:
col_features = ['popular_tags', 'genre', 'game_details']

In [143]:
## Viewing example values of these features
for c in col_features:
    print('Column:', c)
    print('Value: ', df.iloc[0][c])
    print('-')

Column: popular_tags
Value:  FPS,Gore,Action,Demons,Shooter,First-Person,Great Soundtrack,Multiplayer,Singleplayer,Fast-Paced,Sci-fi,Horror,Classic,Atmospheric,Difficult,Blood,Remake,Zombies,Co-op,Memes
-
Column: genre
Value:  Action
-
Column: game_details
Value:  Single-player,Multi-player,Co-op,Steam Achievements,Steam Trading Cards,Partial Controller Support,Steam Cloud
-


In [144]:
# transform each of the values in these columns into a list
for c in col_features:
    df[c] = df[c].apply(lambda x: x.split(','))

In [145]:
# viewing the transformation's result.
df.loc[0, col_features]

popular_tags    [FPS, Gore, Action, Demons, Shooter, First-Per...
genre                                                    [Action]
game_details    [Single-player, Multi-player, Co-op, Steam Ach...
Name: 0, dtype: object

For the case of `developer` and `publisher` columns, we will strip the blank spaces between the names of these companies, so our vectorizer will be able to distinguish between them.

In [146]:
col_companies = ['developer', 'publisher']

for c in col_companies:
    df[c] = df[c].apply(lambda x: x.replace(' ', ''))

For each game, we are going to group all these features in a string, in a new column

In [147]:
# Function that agroups the features of a game in a single string
def group_features(x):
    return f"{' '.join(x['popular_tags'])} {' '.join(x['genre'])} {' '.join(x['game_details'])} {x['developer']} {x['publisher']}"

In [148]:
# create a column with all those features grouped in a string
df['game_features'] = df.apply(group_features, axis=1)

In [149]:
# display an example of this new column
df.iloc[0]['game_features']

'FPS Gore Action Demons Shooter First-Person Great Soundtrack Multiplayer Singleplayer Fast-Paced Sci-fi Horror Classic Atmospheric Difficult Blood Remake Zombies Co-op Memes Action Single-player Multi-player Co-op Steam Achievements Steam Trading Cards Partial Controller Support Steam Cloud idSoftware BethesdaSoftworks,BethesdaSoftworks'

Finally, the data are now ready to build the recommendation models.

## **Recommendations**

### **Recommendations based on game features**
We will build a recommendation system based on game metadata, which are its genre, its associated tags, and its details.
Each game has all these features grouped in a string in the `game_features` column.

We will use a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to represent each feature string as an n-dimensional vector, where n is the number of unique words present across the entire column, and each dimension represents the number of times a particular word occurs in the features string.

In [150]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [151]:
# we will only work with a subset of the original dataset due to processing capacity limitations
df_filtered = df.iloc[:10000]

In [152]:
# Define a new CountVectorizer object and remove all engilsh stop words
count = CountVectorizer(stop_words='english')

# Create the count matrix from game_features dataframe column
count_matrix = count.fit_transform(df_filtered['game_features'])
count_matrix.shape

(10000, 7715)

In [153]:
# Compute the cosine similarity score
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [168]:

def content_recommender(top, name, cosine_sim=cosine_sim, df=df):
    '''
    Function that receives the name of a game and generates the top n of the most similar games 
    based on their similarities
    '''
    
    # create a pandas series with the index as the name of the game and the value as the corresponding index in the main DataFrame,
    # and drop duplicates, if any
    indices = pd.Series(df.index, index=df['name']).drop_duplicates()
    
    # Obtain the index of the game that matches the name
    idx = indices[name]
    
    # Get the pairwise similarity scores of all games with that game
    # and convert it into a list of tuples
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the games based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar game. 
    # Ignore the first because it is the same game.
    sim_scores = sim_scores[1:top+1]
    
    # Get the indices of these games.
    game_indices = [i[0] for i in sim_scores]
    
    # Return the top n  most similar games
    return df.iloc[game_indices]['name']

We are going to see examples of recommendations.
The name of the game passed as an argument to the function must match the name in the filtered dataset.

In [174]:
# Get recommendations for DOOM 
content_recommender(10, 'DOOM', df=df_filtered)

839                     Doom 3: BFG Edition
1294            Painkiller Hell & Damnation
1303                            Quake Live™
2348    Serious Sam HD: The First Encounter
788                                DOOM VFR
1823                                   RAGE
183             Call of Duty®: Black Ops II
1343                               Quake IV
1458       The Typing of The Dead: Overkill
139                                    DUSK
Name: name, dtype: object

In [175]:
# Get recommendations for Grand Theft Auto: San Andreas
content_recommender(5, 'Grand Theft Auto: San Andreas', df=df_filtered)

1437    Burnout Paradise: The Ultimate Box
2323                 Need for Speed: Shift
4138             Need for Speed Undercover
2092                             The Crew™
3089                     Shift 2 Unleashed
Name: name, dtype: object

### **Recommendations based on game descriptions**

In this case, we will build a recommender based on the game's description, which suggests the most similar games based on the similarity between their descriptions.

We will be using [TF-IDF Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) because some words occur much more frequently in the descriptions than others. This vectorizer assigns weights to each word according to a certain formula. The weight of a word in a document is greater if it occurs more frequently in that document and is present in fewer documents

In [165]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF Vectorizer object. Remove all english stop words
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')

# Create the count matrix from game_description dataframe column
bag_of_words = vectorizer.fit_transform(df_filtered['game_description'])

In [166]:
print('Distinct words: ', bag_of_words.shape[1])

Distinct words:  60065


In [167]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim2 = linear_kernel(bag_of_words, bag_of_words)

Finally, we can reuse the `content_recommender` function defined previously by passing in cosine_sim2 as an argument

In [176]:
content_recommender(10, 'DOOM', cosine_sim2, df_filtered)

839             Doom 3: BFG Edition
788                        DOOM VFR
366                    DOOM Eternal
2105    DOOM 3 Resurrection of Evil
1652           Hell is Other Demons
96                    Ultimate Doom
7687       The Haunted: Hells Reach
7780                         HordeZ
8548       Hellbound: Survival Mode
8648                   Reflex Arena
Name: name, dtype: object

In [181]:
description_recommender(10, 'Grand Theft Auto: San Andreas', cosine_sim2, df_filtered)

3089                            Shift 2 Unleashed
2323                        Need for Speed: Shift
2474                                     F1™ 2017
4138                    Need for Speed Undercover
1947                               GRID Autosport
5890                              Race for Tuning
5135                                Battle Riders
8311    Rocket League® - Hot Wheels® Bone Shaker™
4973                              Late Night 1320
6041               Idle Racing GO: Clicker Tycoon
Name: name, dtype: object