# Simple Video Game Recommendation
Jonathan Truong (GitHub: jontruong05)

This is a basic video game recommender system using games from the dataset called "Popular Video Games 1980 - 2023" (https://www.kaggle.com/datasets/arnabchaki/popular-video-games-1980-2023). After taking in a query sent by the user, the recommender will suggest five games and provide some info on each one.

Before any code is run, make sure that all packages have been installed by running the block of code below (be sure to uncomment).

In [365]:
# !pip install numpy pandas nltk

## Step 1: Load and Preprocess the Dataset

First, let's load the dataset and make it suitable for recommendation-making. 

In [324]:
import numpy as np
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

In [325]:
# Load data from .csv file
game_data = pd.read_csv('games.csv')
game_data.head()

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,4.6K,4.8K
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21K,3.2K,6.3K,3.6K
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30K,2.5K,5K,2.6K
3,3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"['Adventure', 'Indie', 'RPG', 'Turn Based Stra...","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...,28K,679,4.9K,1.8K
4,4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"['Adventure', 'Indie', 'Platform']",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with...",21K,2.4K,8.3K,2.3K


There are some things to note about this dataset. First, there are duplicate entries for some games:

In [326]:
game_data['Title'].value_counts()[:5]

Title
Doom                      7
Dead Space                5
Shadow of the Colossus    5
Resident Evil 2           5
God of War                4
Name: count, dtype: int64

Next, some columns have missing values.

In [327]:
game_data.isna().mean()[:-4]

Unnamed: 0           0.000000
Title                0.000000
Release Date         0.000000
Team                 0.000661
Rating               0.008598
Times Listed         0.000000
Number of Reviews    0.000000
Genres               0.000000
Summary              0.000661
Reviews              0.000000
dtype: float64

To fix this, let's get rid of the duplicate entries. This will avoid having multiple occurrences of the same game name being suggested by the recommender. Also, we'll fill in the missing values with a blank. This will avoid encountering any errors during the computations we will make later on.

In [328]:
game_data = game_data.drop_duplicates(subset=['Title'])
game_data = game_data.fillna('')
game_data.head()

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,4.6K,4.8K
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21K,3.2K,6.3K,3.6K
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30K,2.5K,5K,2.6K
3,3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"['Adventure', 'Indie', 'RPG', 'Turn Based Stra...","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...,28K,679,4.9K,1.8K
4,4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"['Adventure', 'Indie', 'Platform']",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with...",21K,2.4K,8.3K,2.3K


Now, duplicate entries for games have been removed and missing entries have been filled with a blank string. To ensure that the data is easy to handle, let's make recommendations from just the first 500 games in the dataset. 

In [329]:
game_data = game_data[0:500]

When looking at the game data, it seems like the `Genres`, `Summary`, and `Reviews` columns have information that will help provide good recommendations.  

The recommendations will be made by first forming a text corpus for each game. This will be done by amassing the text in the three aforementioned columns. Next, each corpus will be tokenized, along with the user-submitted query. The tokenized query will be compared to each tokenized corpus via cosine similarity, and the games of the corresponding corpuses with the top five largest cosine similarity scores will be used as the recommendations.

We now begin the text gathering. First, let's take a look at the `Genres` column. Although the items in the `Genres` column look like they are lists, they are actually strings. Let's fix that by removing the single quotes and the square brackets.

In [330]:
game_data['Genres'] = game_data['Genres'].apply(lambda x: x.replace('[', '').replace(']', '').replace('\'', ''))
game_data.head()

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"Adventure, RPG","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,4.6K,4.8K
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"Adventure, Brawler, Indie, RPG",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21K,3.2K,6.3K,3.6K
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"Adventure, RPG",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30K,2.5K,5K,2.6K
3,3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"Adventure, Indie, RPG, Turn Based Strategy","A small child falls into the Underground, wher...",['soundtrack is tied for #1 with nier automata...,28K,679,4.9K,1.8K
4,4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"Adventure, Indie, Platform",A 2D metroidvania with an emphasis on close co...,"[""this games worldbuilding is incredible, with...",21K,2.4K,8.3K,2.3K


The same idea applies to the `Reviews` column, so we will remove any single quotes, double quotes, and square brackets.

In [331]:
game_data['Reviews'] = game_data['Reviews'].apply(lambda x: x.replace('\'', '').replace('"', '').replace('[', '').replace(']', ''))
game_data.head()

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"Adventure, RPG","Elden Ring is a fantasy, action and open world...",The first playthrough of elden ring is one of ...,17K,3.8K,4.6K,4.8K
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"Adventure, Brawler, Indie, RPG",A rogue-lite hack and slash dungeon crawler in...,convinced this is a roguelike for people who d...,21K,3.2K,6.3K,3.6K
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"Adventure, RPG",The Legend of Zelda: Breath of the Wild is the...,This game is the game (that is not CS:GO) that...,30K,2.5K,5K,2.6K
3,3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"Adventure, Indie, RPG, Turn Based Strategy","A small child falls into the Underground, wher...",soundtrack is tied for #1 with nier automata. ...,28K,679,4.9K,1.8K
4,4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"Adventure, Indie, Platform",A 2D metroidvania with an emphasis on close co...,"this games worldbuilding is incredible, with i...",21K,2.4K,8.3K,2.3K


Now, we can make the corpus by simply adding the text from the `Genres`, `Summary`, and `Reviews` columns together for each game. A new column named `corpus` will be added to the dataset to store the corpuses.

In [332]:
game_data['corpus'] = (game_data['Genres'] + ' ' + game_data['Summary'] + ' ' + game_data['Reviews']).apply(str.lower)
game_data.head()

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist,corpus
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"Adventure, RPG","Elden Ring is a fantasy, action and open world...",The first playthrough of elden ring is one of ...,17K,3.8K,4.6K,4.8K,"adventure, rpg elden ring is a fantasy, action..."
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"Adventure, Brawler, Indie, RPG",A rogue-lite hack and slash dungeon crawler in...,convinced this is a roguelike for people who d...,21K,3.2K,6.3K,3.6K,"adventure, brawler, indie, rpg a rogue-lite ha..."
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"Adventure, RPG",The Legend of Zelda: Breath of the Wild is the...,This game is the game (that is not CS:GO) that...,30K,2.5K,5K,2.6K,"adventure, rpg the legend of zelda: breath of ..."
3,3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"Adventure, Indie, RPG, Turn Based Strategy","A small child falls into the Underground, wher...",soundtrack is tied for #1 with nier automata. ...,28K,679,4.9K,1.8K,"adventure, indie, rpg, turn based strategy a s..."
4,4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"Adventure, Indie, Platform",A 2D metroidvania with an emphasis on close co...,"this games worldbuilding is incredible, with i...",21K,2.4K,8.3K,2.3K,"adventure, indie, platform a 2d metroidvania w..."


It's important to note that not every single word in a summary or a review of a game is important. Therefore, the next step will involve removing any stop words that appear in the corpuses.

In [333]:
# Load a tokenizer to tokenize the corpuses and import the list of stop words from the nltk package
tkn = RegexpTokenizer(r'\w+')
stop_words = stopwords.words('english')

In [334]:
# For each game's corpus, tokenize it. If any of the tokens are stop words, remove them. Have a list of fixed corpuses to be used for a new column.
fixed_corpuses = []
fixed_corpus = ''
for corpus in game_data['corpus']:
    tokens = tkn.tokenize(corpus)
    for token in tokens:
        if token not in stop_words:
            fixed_corpus += token + ' '
    fixed_corpuses.append(fixed_corpus)
    fixed_corpus = ''

In [335]:
# Make the list of fixed corpuses be a column in the dataset called `fixed_corpus`
game_data['fixed_corpus'] = fixed_corpuses
game_data.head()

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist,corpus,fixed_corpus
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"Adventure, RPG","Elden Ring is a fantasy, action and open world...",The first playthrough of elden ring is one of ...,17K,3.8K,4.6K,4.8K,"adventure, rpg elden ring is a fantasy, action...",adventure rpg elden ring fantasy action open w...
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"Adventure, Brawler, Indie, RPG",A rogue-lite hack and slash dungeon crawler in...,convinced this is a roguelike for people who d...,21K,3.2K,6.3K,3.6K,"adventure, brawler, indie, rpg a rogue-lite ha...",adventure brawler indie rpg rogue lite hack sl...
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"Adventure, RPG",The Legend of Zelda: Breath of the Wild is the...,This game is the game (that is not CS:GO) that...,30K,2.5K,5K,2.6K,"adventure, rpg the legend of zelda: breath of ...",adventure rpg legend zelda breath wild first 3...
3,3,Undertale,"Sep 15, 2015","['tobyfox', '8-4']",4.2,3.5K,3.5K,"Adventure, Indie, RPG, Turn Based Strategy","A small child falls into the Underground, wher...",soundtrack is tied for #1 with nier automata. ...,28K,679,4.9K,1.8K,"adventure, indie, rpg, turn based strategy a s...",adventure indie rpg turn based strategy small ...
4,4,Hollow Knight,"Feb 24, 2017",['Team Cherry'],4.4,3K,3K,"Adventure, Indie, Platform",A 2D metroidvania with an emphasis on close co...,"this games worldbuilding is incredible, with i...",21K,2.4K,8.3K,2.3K,"adventure, indie, platform a 2d metroidvania w...",adventure indie platform 2d metroidvania empha...


## Step 2: Convert Text Data to Vectors

The next step will involve making a vector of word counts for each corpus. An entry in the vector will indicate how many times a particular word appeared in the corpus. To avoid producing a lengthy output, the number of unique words among all corpuses will be displayed instead.

In [336]:
# Find all of the unique words among all the corpuses
unique_words = game_data['fixed_corpus'].str.split().explode().value_counts()
len(list(unique_words.index))

14929

Now that we know the unique words and how many there are, it's time to vectorize each game's corpus. The following block of code will find the count of each unique word for each corpus and output it into a DataFrame. A row will correspond to a game's vector of word counts.

In [338]:
# Heads up, this block of code takes roughly 2-3 minutes to run!
counts_dict = {}
for word in unique_words.index:
    re_pat = fr'\b{word}\b'
    counts_dict[word] = game_data['fixed_corpus'].str.count(re_pat).astype(int).tolist()
    
counts_df = pd.DataFrame(counts_dict).set_index(game_data['Title'])
counts_df.head()

Unnamed: 0_level_0,game,n,like,games,one,de,adventure,time,really,que,...,ganon,chatgpt,pérola,satisfatória,ganondorf,auhuhahhhh,unfavourable,guilt,yooooo,helicopters
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Elden Ring,6,2,0,2,2,0,1,4,0,0,...,0,0,0,0,0,0,0,0,0,0
Hades,4,0,4,1,1,1,1,1,3,3,...,0,0,0,0,0,0,0,0,0,0
The Legend of Zelda: Breath of the Wild,8,0,1,2,1,8,1,1,3,12,...,0,0,0,0,0,0,0,0,0,0
Undertale,5,0,0,0,1,0,1,0,1,2,...,0,0,0,0,0,0,0,0,0,0
Hollow Knight,9,12,3,3,2,0,2,2,1,1,...,0,0,0,0,0,0,0,0,0,0


## Step 3: Implement a Function to Compute Similarity

The similarity between the query and the item descriptions will be calculated using cosine similarity. The following block of code defines a function that calculates the cosine similarity between two vectors represented by one-dimensional NumPy arrays.

In [360]:
# Calculate the cosine similarity between two vectors using NumPy
def cosine_similarity(a, b):
    if np.linalg.norm(b) == 0: # In case the query doesn't have any of the significant words
        return np.dot(a, b) / np.linalg.norm(a)
    else:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

## Step 4: Return the Top Matches

With the cosine similarity function implemented, we just need the user query so that we can output the top five video game recommendations. First, the function that will make the recommendations will be implemented.

In [361]:
def make_recommendations(query):
    # Reused code for vectorizing the corpuses, but this is for the user query
    query_word_counts_dict = {}
    for word in unique_words.index:
        re_pat = fr'\b{word}\b'
        query_word_counts_dict[word] = pd.Series([query]).str.count(re_pat).astype(int).tolist()

    query_word_counts_df = pd.DataFrame(query_word_counts_dict)
    # query_word_counts_df

    # Stores all cosine similarity scores between the user query and each game
    cosine_similarity_dict = {} 

    # For each game's vectorized corpus, calculate the cosine similarity between it and the user query. Store the results into the dictionary defined above.
    for game in counts_df.index:
        cosine_similarity_dict[game] = cosine_similarity(np.array(counts_df.loc[game]), np.array(query_word_counts_df.iloc[0]))
    cosine_similarity_df = pd.DataFrame()
    cosine_similarity_df['cos_sim_scores'] = cosine_similarity_dict.values()

    # Turn the dictionary of C.S. scores into a DataFrame, returning the top five game recommendations
    cosine_similarity_df = cosine_similarity_df.set_index(pd.Series(cosine_similarity_dict.keys())).sort_values(by='cos_sim_scores', ascending=False)
    return list(cosine_similarity_df[:5].index)

Before running the cell below, be sure that ALL cells above this block have been run in order. The code block below will ask you to input a query. You can make other queries simply by rerunning the cell. After submitting the query, you will get a list of five game recommendations. Consider giving some of them a try!

In [363]:
query = input()
make_recommendations(query)

['Elden Ring',
 'Borderlands 2',
 'Bully',
 'Mario Party Superstars',
 'Rain World']

# Salary Expectation

When I found the job posting, I saw that the hourly rate would be $20-$30 per hour for 20 hours a week. Therefore, I would expect the monthly salary to be around $1,600-$2,400.