# MyGamePass #
## Collaborative Based Filtering ##

One of the most compelling ways I would discover new games as a kid was when my best friends would recommend them to me.  My cousin changed my life when he loaned me The Legend of Zelda: Ocarina of Time as a kid and I fell in love.

This experience can be taken to a whole new level with machine learning.  With extensive user engagement data, we can perform analysis to find similar user profiles as your own to predict what ratings you would give a game you've not played before.  

Using these predicted ratings that satisfy our requirements recommendations are provided based on playing habits.

## Matrix Factorization ##

One of the major challenges in putting recommender system applications into production is the nearly impossible compute requirements of an in-memory system when deployed at scale.  Consider if we had 10s of millions of users and millions of games, the standard memory based collaborative filtering would become very challenging.

The solution for this is Matrix Factorization.  We will be using the Funk Singular Vector Decomposition method, or FunkSVD named after the inventor Simon Funk.  It is a part of the Surprise package which has become very popular for its built-in recommender system evaluation tools.

The SVD will generate a predicted rating for every user-item pair in our prepared dataset.

In [25]:
# Import the trinity
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

# Import the surprise packages
from surprise import Dataset
from surprise.reader import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD

# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV

# Import train_test_split
from surprise.model_selection import train_test_split

Again to highlight, additional extensive data cleaning and preprocessing steps were performed in the data cleanup notebooks.  The prepared data will be loaded from here.

In [2]:
# Read in the prepared dataframe from the user_cleanup notebook
user_df = pd.read_csv('data/user_clean.csv')
user_df.head()

Unnamed: 0,user_id,app_name,hours_played,appid,user_num_games,game_popularity,rating_5,rating_10
0,151603712,Fallout 4,87.0,377160,16,144,5,10
1,25096601,Fallout 4,1.6,377160,25,144,2,4
2,4834220,Fallout 4,19.8,377160,5,144,4,8
3,65229865,Fallout 4,0.5,377160,16,144,1,2
4,65958466,Fallout 4,123.0,377160,46,144,5,10


column: description

- user_id: unique id for each user
- app_name: name of video game
- hours_played: amount of hours this user played the video game
- appid: steam appid for joining
- user_num_games: total number of games the user has played in this dataset
- game_popularity: total number of users who have played this game in the dataset
- rating_5: converted the hours_played to a 1-5 rating for Surprise package modeling
- rating_10: converted the hours_played to a 1-10 rating for Surprise package modeling

two ratings were prepared to test impact on accuracy of the model

In [3]:
# And read in the prepared games dataframe from the steam
games_df = pd.read_csv('data/games.csv')
games_df.head()

Unnamed: 0,appid,name,average_playtime,total_ratings,percent_positive_ratings,description
0,10,Counter-Strike,300,127873,0.973888,Play the world's number 1 online action game. ...
1,20,Team Fortress Classic,277,3951,0.839787,One of the most popular online action games of...
2,30,Day of Defeat,187,3814,0.895648,Enlist in an intense brand of Axis vs. Allied ...
3,40,Deathmatch Classic,258,1540,0.826623,Enjoy fast-paced multiplayer gaming with Death...
4,50,Half-Life: Opposing Force,300,5538,0.947996,Return to the Black Mesa Research Facility as ...


column: description

- appid: steam appid for joining
- name: video game name
- average_playtime: average playtime based on master steam store dataset
- total_ratings: total number of ratings
- percent_positive_ratings: percentage of ratings that are positive
- description: text description of the game

In [4]:
# Merge the two dataframes on appid
df = user_df.merge(games_df,on='appid')
df = df.drop('name',1)
df.head()

Unnamed: 0,user_id,app_name,hours_played,appid,user_num_games,game_popularity,rating_5,rating_10,average_playtime,total_ratings,percent_positive_ratings,description
0,151603712,Fallout 4,87.0,377160,16,144,5,10,300,155753,0.708661,"<h1>Review Highlights</h1><p><img src=""https:/..."
1,25096601,Fallout 4,1.6,377160,25,144,2,4,300,155753,0.708661,"<h1>Review Highlights</h1><p><img src=""https:/..."
2,4834220,Fallout 4,19.8,377160,5,144,4,8,300,155753,0.708661,"<h1>Review Highlights</h1><p><img src=""https:/..."
3,65229865,Fallout 4,0.5,377160,16,144,1,2,300,155753,0.708661,"<h1>Review Highlights</h1><p><img src=""https:/..."
4,65958466,Fallout 4,123.0,377160,46,144,5,10,300,155753,0.708661,"<h1>Review Highlights</h1><p><img src=""https:/..."


In [5]:
# Let's take a look at one of the most prominent users in the dataset, user 24469287
df[df['user_id'] == 24469287]

Unnamed: 0,user_id,app_name,hours_played,appid,user_num_games,game_popularity,rating_5,rating_10,average_playtime,total_ratings,percent_positive_ratings,description
20,24469287,Fallout 4,119.0,377160,140,144,5,10,300,155753,0.708661,"<h1>Review Highlights</h1><p><img src=""https:/..."
228,24469287,Left 4 Dead 2,0.6,550,140,621,1,2,300,260207,0.967649,"Set in the zombie apocalypse, Left 4 Dead 2 (L..."
950,24469287,Left 4 Dead,2.4,500,140,160,2,4,300,18899,0.949839,"From Valve (the creators of Counter-Strike, Ha..."
2236,24469287,The Banner Saga,12.4,237990,140,28,4,8,164,10881,0.896057,Live through an epic role-playing Viking saga ...
2284,24469287,BioShock Infinite,33.0,8870,140,196,5,9,300,83288,0.953823,"Indebted to the wrong people, with his life on..."
...,...,...,...,...,...,...,...,...,...,...,...,...
23935,24469287,The Fall,0.7,290770,140,4,1,2,106,3151,0.883529,Experience the first story in a mind bending t...
23939,24469287,Greed Corp,0.4,48950,140,12,1,1,38,467,0.854390,<strong>Turn-based strategy at its finest</str...
23951,24469287,Ultratron,0.4,219190,140,4,1,1,84,605,0.900826,Experience the addictive gameplay of old-schoo...
23955,24469287,Capsized,0.3,95300,140,10,1,1,152,935,0.773262,<strong>Capsized</strong> is a fast paced 2D p...


In [6]:
# Let's find this users favorite games using the 1-5 rating scale
print(f"Shape:{df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)].shape}")
display(df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)])

Shape:(10, 12)


Unnamed: 0,user_id,app_name,hours_played,appid,user_num_games,game_popularity,rating_5,rating_10,average_playtime,total_ratings,percent_positive_ratings,description
20,24469287,Fallout 4,119.0,377160,140,144,5,10,300,155753,0.708661,"<h1>Review Highlights</h1><p><img src=""https:/..."
2284,24469287,BioShock Infinite,33.0,8870,140,196,5,9,300,83288,0.953823,"Indebted to the wrong people, with his life on..."
5025,24469287,Grand Theft Auto V,34.0,271590,140,205,5,9,300,468369,0.702568,<strong>Partner with legendary impresario Tony...
5770,24469287,Mad Max,25.0,234140,140,21,5,9,300,40033,0.904804,"Become Mad Max, the lone warrior in a savage p..."
6331,24469287,Saints Row IV,22.0,206420,140,128,5,9,300,59036,0.916238,The US President must save the Earth from alie...
6909,24469287,The Talos Principle,25.0,257510,140,23,5,9,230,18022,0.956442,"<img src=""https://steamcdn-a.akamaihd.net/stea..."
8688,24469287,Borderlands 2,31.0,49520,140,359,5,9,300,155616,0.929178,A new era of shoot and loot is about to begin....
13302,24469287,Terraria,29.0,105600,140,398,5,9,300,263397,0.970398,"Dig, Fight, Explore, Build: The very world is..."
18952,24469287,StarDrive 2,28.0,252450,140,4,5,9,300,1725,0.56058,<h1>Digital Deluxe Edition</h1><p><strong>The ...
23860,24469287,Eador. Masters of the Broken World,23.0,232050,140,4,5,9,272,2044,0.667808,"<h1>Eador. Imperium Released!</h1><p><a href=""..."


- Looking at this list, although there are a lot of great games here it feels like there needs to be more granularity in this dataset
- For example, playing a game 23 hours is vastly different than playing a game 119 hours
- Let's see if the model agrees, however.  More granularity may just be more noise to overfit

In [7]:
# Let's find this users favorite games based on the 1-10 rating scale
print(f"Shape:{df[(df['user_id'] == 24469287) & (df['rating_10'] == 10)].shape}")
display(df[(df['user_id'] == 24469287) & (df['rating_10'] == 10)])

Shape:(1, 12)


Unnamed: 0,user_id,app_name,hours_played,appid,user_num_games,game_popularity,rating_5,rating_10,average_playtime,total_ratings,percent_positive_ratings,description
20,24469287,Fallout 4,119.0,377160,140,144,5,10,300,155753,0.708661,"<h1>Review Highlights</h1><p><img src=""https:/..."


- This game did stand out from the list before, and with the more detailed scale it shows as the only 10 rating

In [8]:
# Prepare the dataframes for the surprise package
# Dataframe needs to contain 3 columns: user id, item id, and rating
# For the 1-10 scale
rating_10_df = df.filter(['user_id','appid','rating_10'])
rating_10_df = rating_10_df.sort_values(by=['user_id','appid'])
# And the 1-5 scale
rating_5_df = df.filter(['user_id','appid','rating_5'])
rating_5_df = rating_5_df.sort_values(by=['user_id','appid'])

## Funk SVD ##
- I will try it first with the 1-10 ratings, it seemed like the added detail would be better

In [9]:
# Confirm dataframe is set up properly (user, item, rating)
rating_10_df.head()

Unnamed: 0,user_id,appid,rating_10
1395,5250,440,2
2950,5250,570,1
5928,5250,620,8
10707,5250,630,6
14383,76767,70,3


In [10]:
# initialize the reader with 1-10 rating scale
my_reader = Reader(rating_scale=(0,10))

# load the dataframe with the reader
my_dataset_10 = Dataset.load_from_df(rating_10_df, my_reader)

In [11]:
%%time

# Set the parameter grid for optimization
param_grid = {
    # Number of latent factors. More factors could give better results, but can also lead overfitting
    'n_factors': [50, 100, 150], 
    # Number of epochs. Number of iterations the algorithm will run
    'n_epochs': [10, 20, 50],
    # Learning rate. The speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.
    'lr_all': [0.005, 0.1],
    'biased': [False] }

# Set GridSearchCV with 5 fold cross-validation using the FunkSVD
GS = GridSearchCV(FunkSVD, param_grid, measures=['rmse','mae','fcp'], cv=5)

# Fit the model to the data
GS.fit(my_dataset_10)

CPU times: user 2min 14s, sys: 1.17 s, total: 2min 16s
Wall time: 2min 19s


In [12]:
# Print the evaluation metrics
print('Root Mean Squared Error (RMSE):',GS.best_score['rmse'])
print('Mean Absolute Error (MAE):',GS.best_score['mae'])
print('Fraction of Concordant Pairs (FCP):',GS.best_score['fcp'])

Root Mean Squared Error (RMSE): 2.8109738497994576
Mean Absolute Error (MAE): 2.2582732917825865
Fraction of Concordant Pairs (FCP): 0.6167397061841069


- Root Mean Squared Error and Mean Absolute Error are measures of error between the predicted ratings and the actual ratings
    - The lower the number the better
- Fraction of Concordant Pairs describe the relationship between the pairs.  A pair is concordant if one user gave a game a high rating and the predicted rating for that game is also high, or vice vera
    - FCP ranges from 0.0 on the low end to 1.0 on the high end
    - High FCP means most of the pairs (predicted ratings and actual ratings) agree in overall sentiment
- This is something to work with.  Humans are finicky, especially with entertainment preferences.  Not great, but not terrible.

Looking at these metrics, the FCP score is preferred, but let's check the best parameters of the two measures

In [13]:
GS.best_params['rmse']

{'n_factors': 50, 'n_epochs': 20, 'lr_all': 0.005, 'biased': False}

In [14]:
GS.best_params['fcp']

{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'biased': False}

We will optimize on the fcp from here

- let's compare this fcp score to the 0-5 rating scale data

In [15]:
rating_5_df.head()

Unnamed: 0,user_id,appid,rating_5
1395,5250,440,1
2950,5250,570,1
5928,5250,620,4
10707,5250,630,3
14383,76767,70,2


In [16]:
# Set the reader with accurate rating scale
my_reader = Reader() # default 1-5 rating scale

# Set the dataset
# Remember that the df parameter has to have 3 columns:
# User ids, Item ids (game appid), Ratings
my_dataset_5 = Dataset.load_from_df(rating_5_df, my_reader)

In [17]:
%%time
GS.fit(my_dataset_5)

CPU times: user 2min 14s, sys: 1.03 s, total: 2min 15s
Wall time: 2min 16s


In [18]:
GS.best_score['fcp']

0.625718926810974

In [19]:
GS.best_params['fcp']

{'n_factors': 50, 'n_epochs': 20, 'lr_all': 0.005, 'biased': False}

Using the same gridsearch with the 1-5 rating dataset caused a slight increase in the best FCP score.  Additionally, the best parameters for that score were different for each input.

Let's focus on the 1-5 rating scale as it is more accurate.  This is likely due to an increase in noise for the 1-10 rating, the more simplified model led to more accurate predictions.

In [20]:
%%time

# Expand the parameter grid to really fine tune (this will take longer)
param_grid = {
    # Number of latent factors. More factors could give better results, but can also lead overfitting
    'n_factors': [50, 100, 150, 200, 250], 
    # Number of epochs. Number of iterations the algorithm will run
    'n_epochs': [10, 20, 50, 75, 100],
    # Learning rate. The speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.
    'lr_all': [0.005, 0.01, 0.1, 1, 10],
    'biased': [False] }

# Set GridSearchCV with 5 fold cross validation (focusing on the 'fcp' measure)
GS2 = GridSearchCV(FunkSVD, param_grid, measures=['fcp'], cv=5)

# Fit the model
GS2.fit(my_dataset_5)

CPU times: user 41min 6s, sys: 27.2 s, total: 41min 33s
Wall time: 42min 23s


In [22]:
# Check the score - previous best was 0.6257
print('Fraction of Concordant Pairs (FCP):',GS2.best_score['fcp'])

Fraction of Concordant Pairs (FCP): 0.6317900321038092


- Nice! An increase to 0.6318.  I'll take it.

In [23]:
# Check best params - previous best was 50 factors, 20 epochs, 0.005 learning rate
GS2.best_params['fcp']

{'n_factors': 200, 'n_epochs': 100, 'lr_all': 0.01, 'biased': False}

- Looks like the hyperparameters indeed did get optimized with the expanded gridsearch.  We will use these settings from here on out

## Making Predictions with Collaborative Filtering ##

Now that we have our hyperparameters optimized, let's build our collaborative recommender system.

We will use a train_test_split to confirm the accuracy of the predictions and look closer at the performance by extracting the predictions into a dataframe to explore

In [26]:
# Split the data into the train set and the test set, reserving 25% of the data for the testset
trainset, testset = train_test_split(my_dataset_5, test_size=0.25)

In [27]:
# Initialize the FunkSVD
my_svd = FunkSVD(n_factors=200, 
                 n_epochs=100, 
                 lr_all=0.01, 
                 biased=False,
                 verbose=0)
# Fit train set
my_svd.fit(trainset)

# Test the predictions on the testset
my_pred = my_svd.test(testset)

In [28]:
# Put my_pred result in a dataframe
df_prediction = pd.DataFrame(my_pred, columns=['user_id',
                                                     'appid',
                                                     'actual_rating',
                                                     'predicted_rating',
                                                     'details'])

# Calculate the difference of actual and prediction into diff column
df_prediction['diff'] = abs(df_prediction['predicted_rating'] - 
                            df_prediction['actual_rating'])

In [29]:
df_prediction.head()

Unnamed: 0,user_id,appid,actual_rating,predicted_rating,details,diff
0,248254297,570,2.0,2.370514,{'was_impossible': False},0.370514
1,38731746,231430,4.0,1.426768,{'was_impossible': False},2.573232
2,59157389,107410,5.0,1.0,{'was_impossible': False},4.0
3,185258131,220240,5.0,3.329104,{'was_impossible': False},1.670896
4,216785107,355840,4.0,1.490933,{'was_impossible': False},2.509067


In [30]:
df_prediction[df_prediction['diff'] <= 1]['user_id'].count() / df_prediction['user_id'].count()

0.5088056356067884

- 50.9% of the dataset the predicted rating is within 1 point of the acutal rating.  Not bad.

Now let's continue with the full trainset and full testset

In [31]:
%%time
# Build full trainset
full_trainset = my_dataset_5.build_full_trainset()

# Re-Fit the FunkSVD from before with the full_trainset
my_svd.fit(full_trainset)

CPU times: user 10.8 s, sys: 40.7 ms, total: 10.9 s
Wall time: 10.9 s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ff2ee85caf0>

In [34]:
# Define the full test set, fill the empty ratings with -1 where the user has not played the game
full_testset = full_trainset.build_anti_testset(fill=-1)

# Make the predictions on the full_testset
my_prediction = my_svd.test(full_testset)

In [35]:
# Put my_pred result in a dataframe like before
df_prediction = pd.DataFrame(my_pred, columns=['user_id',
                                                     'appid',
                                                     'actual_rating',
                                                     'predicted_rating',
                                                     'details'])

In [36]:
# Check our favorite user id `24469287` for the top predictions
newdf = df_prediction[df_prediction['user_id'] == 24469287].sort_values(
    by=['predicted_rating'], ascending=False).head()

# Merge the dataframes
merge_df = newdf.merge(games_df, how='left', 
                    left_on=['appid'], right_on=['appid'])

# Show the top recommendations
merge_df

Unnamed: 0,user_id,appid,actual_rating,predicted_rating,details,name,average_playtime,total_ratings,percent_positive_ratings,description
0,24469287,206420,5.0,4.313017,{'was_impossible': False},Saints Row IV,300,59036,0.916238,The US President must save the Earth from alie...
1,24469287,291650,3.0,3.649052,{'was_impossible': False},Pillars of Eternity,300,13393,0.862988,Prepare to be enchanted by a world where the c...
2,24469287,8870,5.0,3.488895,{'was_impossible': False},BioShock Infinite,300,83288,0.953823,"Indebted to the wrong people, with his life on..."
3,24469287,239820,2.0,3.244829,{'was_impossible': False},Game Dev Tycoon,300,28809,0.950154,"<h1>Just Updated</h1><p><img src=""https://%CDN..."
4,24469287,237990,4.0,3.139739,{'was_impossible': False},The Banner Saga,164,10881,0.896057,Live through an epic role-playing Viking saga ...


You could take this collaborative modeling even further with a hybrid model as mentioned at the end of the content modeling notebook.  A hybrid model could function by finding the top predictions based on similar profiles, and then also looking for additional games that are similar to those based on their content to expand the recommendations.

Additionally, with much more detailed user data you could perform extensive analysis to better define a similar user profile.  For example, someone who only plays on the weekends may prefer different games than someone who regularly plays after work every day, even if the total hours played appears to be similar.  Of course, demographic data would be valuable as well.  

A particularly interesting analysis I would like to perform would be to cross-reference a users Netflix profile as you can learn a lot about a users entertainment preferences and cross-over between the two medium.

# MyGamePass #
## Ben Polzin ##