In this workbook I cover all aspects of analysing our Twitch data using the Surprise package. More information about Surprise can be found here: https://surprise.readthedocs.io/en/stable/index.html

We start with taking the grid from our EDA Surprise Funnel Workbook and preparing it to be used by Surprise. The package specifically asks for the three items to be present: User, Item, and Rating. Usually, a rating is an evaluation given by the user herself (I can rate a movie I watched or a book I read 4 out of 5 stars), but in our case the rating is a custom success metric which measures how successful a certain game or genre of content has been for a particular streamer. Each person has one 5-rated game, and sometimes others which are scored as a percentage of that core content. In this fashion, we are avoiding the imbalance of scores for those streamers with millions of views, which are very few but skew the results heavily.

In [1]:
import pandas as pd
import numpy as np

In [2]:
grid = pd.read_csv('final_game_user_grid.csv')

In [3]:
grid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31751 entries, 0 to 31750
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_name     31751 non-null  object
 1   game_name     31751 non-null  object
 2   game_genres   31634 non-null  object
 3   language      31751 non-null  object
 4   started_at    31751 non-null  object
 5   viewer_count  31751 non-null  int64 
 6   max           31751 non-null  int64 
dtypes: int64(2), object(5)
memory usage: 1.7+ MB


In [4]:
grid = grid.dropna()

In [5]:
grid = grid.dropna(how='any',axis=0) 

In [6]:
x = grid
x['game_genres'][x['game_genres'].str.contains('NONE') == True] = 'Other'
x['game_genres'][x['game_genres'].isnull()] = 'Other'
x['game_genres'][x['game_genres'].str.contains('RETROGAMEPLACEHOLDER')] = 'Other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [7]:
grid = x

To begin, we're calculating the metric we'll be using to compare users and games. Since we're focused on the streamers, we don't have the traditional "I like this movie so I will rate it a 4.5 out of 5.0" ratings. Instead we're calculating how many viewers each streamer attracted with a particular game compared to their max viewer potential over the last week. For each user, one game is their ultimate streaming "5 out of 5" benchmark, and all other games they play are compared to that one, and normalized to ratings between 1 and 5. We are also using all the genres for each game to pinpoint how successful various genres and games have been for each streamer.

In [565]:
max_value_username = pd.DataFrame(grid.groupby('user_name')['max'].max().reset_index())

In [566]:
max_val_dict = max_value_username.groupby('user_name')['max'].apply(list).to_dict()

In [567]:
grid['max_game'] = grid['user_name'].map(max_val_dict)

In [568]:
grid['max_game_int'] = grid.max_game.str[0].astype(int)
grid = grid.drop('max_game', axis = 1)

In [569]:
grid['score'] = grid['max']/grid['max_game_int']

In [570]:
from sklearn.preprocessing import minmax_scale
grid['scaled_score'] = minmax_scale(grid['score'], feature_range=(1, 5))

In [571]:
grid = grid.dropna()

Now we have a listing for each user pairing them with each game they play, what genre it belongs to, and how many people watched them play each game compared to the max viewers they ever got for a stream during the week we examined.

In [572]:
grid.groupby('user_name')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,user_name,scaled_score
27473,Fonbet_RocketLeague,378
77300,StreamerHouse,375
123375,luke4316live,333
110114,gaules,315
27472,Fonbet_RLH,270


In [573]:
grid.groupby('game_genres')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,game_genres,scaled_score
0,Action,211633
28,Shooter,165931
7,FPS,131286
23,RPG,118595
15,MMORPG,88901


In [574]:
grid.groupby('game_name')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,game_name,scaled_score
128,Grand Theft Auto V,37860
180,Minecraft,35715
243,Rocket League,35043
37,Black Desert Online,33723
97,Escape From Tarkov,33366


In [575]:
min_number_scores = 5
filter_users = grid['user_name'].value_counts() > min_number_scores
filter_users = filter_users[filter_users].index.tolist()

In [576]:
grid_new = grid[(grid['user_name'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(grid.shape))
print('The new data frame shape:\t{}'.format(grid_new.shape))

The original data frame shape:	(1364027, 10)
The new data frame shape:	(1137673, 10)


After reshaping the grid, we are going to extract the recommendations for game genre, game titles, and games similar to those already rated by the streamers as a three-pronged recommender approach.

In [4]:
import pickle

In [578]:
pickle.dump(grid_new, open("./Data/final_grid_06_07_19.pkl", "wb" ) )

In [579]:
grid_new = pickle.load( open( "./Data/final_grid_06_07_19.pkl", "rb" ) )

Preparing for genre recommendations based on viewership scores:

In [580]:
genres_base_df = grid_new.groupby(by = ['user_name', 'game_genres'])['scaled_score'].agg([np.mean])
games_base_df = grid_new.groupby(by = ['user_name', 'game_name'])['scaled_score'].agg([np.mean])

In [581]:
genres_base_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
user_name,game_genres,Unnamed: 2_level_1
00NothingLabs,Fighting,2.421053
00NothingLabs,Open World,1.578947
00NothingLabs,RPG,1.578947
00NothingLabs,Shooter,1.578947
00elu00,Action,4.0


In [582]:
games_base_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
user_name,game_name,Unnamed: 2_level_1
00NothingLabs,Mortal Kombat 11,2.421053
00NothingLabs,Tom Clancy's The Division 2,1.578947
00elu00,Dead by Daylight,4.666667
00elu00,Deathgarden,2.0
01joga,PUBG MOBILE,3.75


In [583]:
genres_base_df.columns = genres_base_df.columns.map(''.join)
games.columns = games.columns.map(''.join)

In [584]:
genres_base_df = genres_base_df.reset_index()
games_base_df = games_base_df.reset_index()

In [585]:
genres_base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181215 entries, 0 to 181214
Data columns (total 3 columns):
user_name      181215 non-null object
game_genres    181215 non-null object
mean           181215 non-null float64
dtypes: float64(1), object(2)
memory usage: 4.1+ MB


In [586]:
games_base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93606 entries, 0 to 93605
Data columns (total 3 columns):
user_name    93606 non-null object
game_name    93606 non-null object
mean         93606 non-null float64
dtypes: float64(1), object(2)
memory usage: 2.1+ MB


In [587]:
genres_base_df.head(5)

Unnamed: 0,user_name,game_genres,mean
0,00NothingLabs,Fighting,2.421053
1,00NothingLabs,Open World,1.578947
2,00NothingLabs,RPG,1.578947
3,00NothingLabs,Shooter,1.578947
4,00elu00,Action,4.0


In [588]:
games_base_df.head(5)

Unnamed: 0,user_name,game_name,mean
0,00NothingLabs,Mortal Kombat 11,2.421053
1,00NothingLabs,Tom Clancy's The Division 2,1.578947
2,00elu00,Dead by Daylight,4.666667
3,00elu00,Deathgarden,2.0
4,01joga,PUBG MOBILE,3.75


Using Surprise to predict genres/games for a streamer based on their existing games and genres ratings

In [2]:
import surprise
from surprise import Dataset, accuracy, Reader, NMF, NormalPredictor, BaselineOnly, CoClustering, SlopeOne, SVD, KNNBaseline
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

In [590]:
reader = Reader(rating_scale=(1, 5))
genre_base_data = Dataset.load_from_df(genres_base_df[['user_name', 'game_genres', 'mean']], reader)

In [591]:
game_base_data = Dataset.load_from_df(games_base_df[['user_name', 'game_name', 'mean']], reader)

Now it is time to select the algorithms we will use to generate recommendations for our streamers based on their existing behavior and also on how similar the games they already stream are to those they do not yet stream. 

The first step is to evaluate various algorithms available in Surprise to see which produce the lowest RMSE and are computationally friendly so we can use them in a dynamic app and recommend things on the fly. We perform cross validation on all fast algorithms to see which performs best with either of our datasets, genres and games.

In [593]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SlopeOne(), NormalPredictor(), BaselineOnly(), NMF(), CoClustering(), SVD()]:


    # Perform cross validation
    results = cross_validate(algorithm, genre_base_data, measures=['RMSE'], return_train_measures=True, cv=5, verbose=True)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Evaluating RMSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.5857  0.5934  0.5716  0.5848  0.5865  0.5844  0.0071  
RMSE (trainset)   0.3755  0.3733  0.3784  0.3757  0.3750  0.3756  0.0016  
Fit time          0.54    0.60    0.61    0.58    0.60    0.59    0.03    
Test time         0.31    0.29    0.29    0.30    0.29    0.30    0.01    
Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1046  1.1042  1.1017  1.1126  1.1032  1.1053  0.0038  
RMSE (trainset)   1.1061  1.1043  1.1074  1.1044  1.1104  1.1065  0.0023  
Fit time          0.22    0.21    0.23    0.23    0.22    0.22    0.01    
Test time         0.19    0.19    0.18    0.18    0.18    0.18    0.01    
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating 

Unnamed: 0_level_0,test_rmse,train_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SlopeOne,0.584388,0.375585,0.58622,0.29737
NMF,0.663043,0.148451,12.621102,0.205193
SVD,0.683948,0.548392,8.569584,0.241699
BaselineOnly,0.736155,0.695059,0.481939,0.170293
CoClustering,0.75996,0.623429,8.188446,0.189113
NormalPredictor,1.105266,1.106512,0.221594,0.181772


For genre data, the SlopeOne seems to do the best algorithm. To drastically reduce overfitting, improve performance and ease implementation, the Slope One family of easily implemented Item-based Rating-Based collaborative filtering algorithms was proposed. Essentially, instead of using linear regression from one item's ratings to another item's ratings ( {\displaystyle f(x)=ax+b} f(x)=ax+b), it uses a simpler form of regression with a single free parameter ( {\displaystyle f(x)=x+b} f(x)=x+b). The free parameter is then simply the average difference between the two items' ratings. It was shown to be much more accurate than linear regression in some instances, and it takes half the storage or less.

In [594]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SlopeOne(), NormalPredictor(), BaselineOnly(), NMF(), CoClustering(), SVD()]:


    # Perform cross validation
    results = cross_validate(algorithm, game_base_data, measures=['RMSE'], return_train_measures=True, cv=5, verbose=True)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Evaluating RMSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9616  0.9684  0.9598  0.9618  0.9601  0.9624  0.0031  
RMSE (trainset)   0.3379  0.3353  0.3368  0.3346  0.3354  0.3360  0.0012  
Fit time          0.57    0.53    0.51    0.51    0.48    0.52    0.03    
Test time         0.15    0.13    0.15    0.13    0.13    0.14    0.01    
Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1551  1.1522  1.1555  1.1631  1.1635  1.1579  0.0046  
RMSE (trainset)   1.1556  1.1541  1.1521  1.1519  1.1542  1.1536  0.0014  
Fit time          0.09    0.11    0.11    0.11    0.11    0.10    0.01    
Test time         0.12    0.10    0.10    0.10    0.10    0.10    0.01    
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating 

Unnamed: 0_level_0,test_rmse,train_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BaselineOnly,0.798883,0.748651,0.28055,0.098541
SVD,0.80349,0.598826,4.237655,0.108464
SlopeOne,0.962354,0.335998,0.52038,0.13793
NMF,0.98123,0.10169,7.611222,0.107752
CoClustering,1.030427,0.676783,5.911154,0.096219
NormalPredictor,1.157878,1.153591,0.103438,0.104182


For the game data, we are going to use the BaselineOnly algorithm which produces the lowest test RMSE.

Based on the results of the algorithm selector, we will proceed with SlopeOne for genre predictions and with the Baseline algorithm for game predictions. More information about SlopeOne can be found here: https://arxiv.org/abs/cs/0702144

Next we perform cross-validation to see how our chosen algorithms perform:

In [595]:
algo_genre_base = SlopeOne()
cross_validate(algo_genre_base, genre_base_data, measures=['RMSE'], cv=7, verbose=True)

Evaluating RMSE of algorithm SlopeOne on 7 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Mean    Std     
RMSE (testset)    0.5703  0.5647  0.5678  0.5765  0.5766  0.5738  0.5766  0.5723  0.0045  
Fit time          0.55    0.59    0.61    0.65    0.72    0.63    0.63    0.63    0.05    
Test time         0.23    0.22    0.23    0.23    0.24    0.23    0.24    0.23    0.01    


{'test_rmse': array([0.57027159, 0.56471004, 0.56780976, 0.5765284 , 0.5765854 ,
        0.57377448, 0.57662914]),
 'fit_time': (0.5517978668212891,
  0.5928771495819092,
  0.6136910915374756,
  0.6544570922851562,
  0.7177228927612305,
  0.6294460296630859,
  0.6278128623962402),
 'test_time': (0.22820138931274414,
  0.22416305541992188,
  0.22986793518066406,
  0.228363037109375,
  0.2439868450164795,
  0.23336386680603027,
  0.23992705345153809)}

In [596]:
bsl_options = {'method': 'als',
               'n_epochs': 10,
               'reg_u': 12,
               'reg_i': 5
               }
algo_games_base = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo_games_base, game_base_data, measures=['RMSE'], cv=7, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE of algorithm BaselineOnly on 7 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Mean    Std     
RMSE (testset)    0.8005  0.8022  0.7937  0.7947  0.7949  0.7968  0.7998  0.7975  0.0030  
Fit time          0.35    0.31    0.37    0.31    0.33    0.34    0.34    0.34    0.02    
Test time         0.07    0.10    0.08    0.07    0.07    0.08    0.07    0.08    0.01    


{'test_rmse': array([0.80050161, 0.80216882, 0.79373707, 0.79473671, 0.79493957,
        0.79677756, 0.79979135]),
 'fit_time': (0.34818196296691895,
  0.3067960739135742,
  0.3692131042480469,
  0.3146500587463379,
  0.3276832103729248,
  0.3410341739654541,
  0.3416450023651123),
 'test_time': (0.0689859390258789,
  0.0962369441986084,
  0.08499288558959961,
  0.07309079170227051,
  0.06678414344787598,
  0.07574701309204102,
  0.07406401634216309)}

Now we split the data into train and test sets to produce the predictions for genres and games streamers would like. Our RMSE is low for both algorithms, lower for the genres which makes sense since each genre includes many games which produces more observations for each genres than for each individual game.

In [597]:
genre_trainset, genre_testset = train_test_split(genre_base_data, test_size=0.25)
genre_base_predictions = algo_genre_base.fit(genre_trainset).test(genre_testset)
accuracy.rmse(genre_base_predictions)

RMSE: 0.5997


0.5996935050233947

In [598]:
game_trainset, game_testset = train_test_split(game_base_data, test_size=0.25)
game_base_predictions = algo_games_base.fit(game_trainset).test(game_testset)
accuracy.rmse(game_base_predictions)

Estimating biases using als...
RMSE: 0.7948


0.794770382172477

Saving the models and predictions to use in the predictor app:

In [5]:
#pickle.dump(algo_genre_base, open("./Data/SlopeOne_genre_model.pkl", "wb" ) )
algo_genre_base = pickle.load( open( "./Data/SlopeOne_genre_model.pkl", "rb" ) )

In [6]:
#pickle.dump(algo_games_base, open("./Data/BaselineOnly_game_model.pkl", "wb" ) )
algo_games_base = pickle.load( open( "./Data/BaselineOnly_game_model.pkl", "rb" ) )

In [7]:
#pickle.dump(genre_base_predictions, open("./Data/SlopeOne_genre_model_predictions.pkl", "wb" ) )
genre_base_predictions = pickle.load( open( "./Data/SlopeOne_genre_model_predictions.pkl", "rb" ) )

In [8]:
#pickle.dump(game_base_predictions, open("./Data/BaselineOnly_game_model_predictions.pkl", "wb" ) )
game_base_predictions = pickle.load( open( "./Data/BaselineOnly_game_model_predictions.pkl", "rb" ) )

In [9]:
#pickle.dump(genres_base_df, open("./Data/genres.pkl", "wb" ) )
genres_base_df = pickle.load( open( "./Data/genres.pkl", "rb" ) )

In [10]:
#pickle.dump(games_base_df, open("./Data/games.pkl", "wb" ) )
games_base_df = pickle.load( open( "./Data/games.pkl", "rb" ) )

Taking the inputs from the user:

In [100]:
streamer_name = input('What is your streamer name? ')
streamer_genres = list(input ('Which game genres do you currently stream? ').split(', '))
streamer_games = list(input ('Which games do you currently stream? ').split(', '))

What is your streamer name? jfortson78
Which game genres do you currently stream? FPS, Sport
Which games do you currently stream? Fortnite, Madden NFL 19


In [101]:
streamer_name, streamer_genres, streamer_games

('jfortson78', ['FPS', 'Sport'], ['Fortnite', 'Madden NFL 19'])

Making a list of streamers' current genres and games by combining any information we already have in our dataset and their own inputs into the app:

In [72]:
genres_base_df.head()

Unnamed: 0,user_name,game_genres,mean
0,00NothingLabs,Fighting,2.421053
1,00NothingLabs,Open World,1.578947
2,00NothingLabs,RPG,1.578947
3,00NothingLabs,Shooter,1.578947
4,00elu00,Action,4.0


In [102]:
def display_current_genres(streamer_name):
    user_genres = list(genres_base_df[genres_base_df['user_name']==streamer_name]['game_genres'])
    return user_genres
recorder_genres_list = display_current_genres(streamer_name)
full_genres = set(recorder_genres_list + streamer_genres)
full_genres = list(full_genres)
full_genres

['FPS', 'Sport']

In [103]:
def display_current_games(streamer_name):
    user_games = list(games_base_df[games_base_df['user_name']==streamer_name]['game_name'])
    return user_games
recorder_games_list = display_current_games(streamer_name)
full_games = set(recorder_games_list + streamer_games)
full_games = list(full_games)
full_games

['Madden NFL 19', 'Fortnite']

####  Predicting Genres for Streamers (user-based similarities) ####

In [75]:
from collections import defaultdict

In [76]:
def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for the user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

Setting the inner ids for genres and games to be able to identify them in Surprise nomenclature. The results are all the genres and games the user has not had exposure to:

In [77]:
iids_genre = genres_base_df['game_genres'].unique()
iids_genre_to_predict = np.setdiff1d(iids_genre, full_genres, assume_unique = True)

In [78]:
iids_game = games_base_df['game_name'].unique()
iids_game_to_predict = np.setdiff1d(iids_game, full_games, assume_unique = True)

In [79]:
iids_genre_to_predict

array(['Fighting', 'Open World', 'RPG', 'Action', 'Horror', 'Other',
       'IRL', 'MMORPG', 'MOBA', 'Simulation', 'Adventure Game',
       'Driving/Racing Game', 'Strategy', 'Indie Game', 'Roguelike',
       'Compilation', 'Puzzle', 'Stealth', 'Sports Game', 'Creative',
       'Series: Souls', 'Platformer', 'RTS', 'Card & Board Game',
       'Rhythm & Music Game', 'Gambling Game', 'Survival', 'Metroidvania',
       'Educational Game', 'Point and Click', 'Flight Simulator',
       'Hidden Objects', 'Visual Novel'], dtype=object)

In [80]:
iids_game_to_predict

array(['Mortal Kombat 11', "Tom Clancy's The Division 2",
       'Dead by Daylight', 'Deathgarden', 'PUBG MOBILE', 'Dark Souls',
       'Retro', 'Dota 2', 'Path of Exile', 'Talk Shows & Podcasts',
       'Call of Duty: Black Ops 4', 'World of Tanks', 'Arma 3',
       'Outer Wilds', 'Apex Legends', 'My Summer Car',
       'Escape From Tarkov', 'Diablo III: Reaper of Souls',
       "Don't Starve Together", 'Games + Demos', 'Days Gone',
       "PLAYERUNKNOWN'S BATTLEGROUNDS", 'Smite', 'Black Desert Online',
       'FINAL FANTASY XIV Online', 'DayZ', 'Total War: Three Kingdoms',
       'Void Bastards', 'The Elder Scrolls Online', 'Battlefield V',
       'Mordhau', 'Battalion 1944', 'Rocket League',
       'Sekiro: Shadows Die Twice', 'World of Warships',
       "Conqueror's Blade", 'Monster Hunter World', 'Art',
       'Just Chatting', 'Realm Royale', 'Sea of Thieves',
       'Warhammer: Chaosbane', 'Dark Souls III', 'Super Mario Maker',
       'Science & Technology', 'Red Dead Redemption 

Making a personal testset for the user, populating same base rating as true since the actuals are not known and we are trying to predict the expected rating:

In [81]:
genre_testset_personal = [[streamer_name, iid, 0.] for iid in iids_genre_to_predict]
game_testset_personal = [[streamer_name, iid, 0.] for iid in iids_game_to_predict]

Producing a list of predictions based on the inputs:

In [82]:
personal_genre_predictions = algo_genre_base.test(genre_testset_personal)
personal_game_predictions = algo_games_base.test(game_testset_personal)

In [83]:
personal_genre_df = pd.DataFrame(personal_genre_predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])    

In [84]:
personal_genre_pred = personal_genre_df[['iid', 'est']]

In [85]:
personal_game_df = pd.DataFrame(personal_game_predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])    

In [86]:
personal_game_pred = personal_game_df[['iid', 'est']]

In [87]:
top_n_genres = get_top_n(personal_genre_predictions)
top_n_games = get_top_n(personal_game_predictions)

In [88]:
top_n_genres

defaultdict(list,
            {'jfortson78': [('Fighting', 3.566848473427548),
              ('Open World', 3.566848473427548),
              ('RPG', 3.566848473427548),
              ('Action', 3.566848473427548),
              ('Horror', 3.566848473427548),
              ('Other', 3.566848473427548),
              ('IRL', 3.566848473427548),
              ('MMORPG', 3.566848473427548),
              ('MOBA', 3.566848473427548),
              ('Simulation', 3.566848473427548)]})

In [89]:
top_n_games

defaultdict(list,
            {'jfortson78': [('The Jackbox Party Pack 3', 4.062363081116334),
              ('Call of Duty: Black Ops II', 3.9908953498432735),
              ('Pokémon Ultra Sun/Ultra Moon', 3.9754841902087326),
              ('Visage', 3.9680701013431072),
              ('Minecraft', 3.915990585923289),
              ('The Elder Scrolls IV: Oblivion', 3.9144011379310846),
              ('The Jackbox Party Pack 4', 3.9086796813716798),
              ('Silent Hill', 3.8990685919404933),
              ('Satisfactory', 3.8925097559050457),
              ('Rocket League', 3.8904878288840434)]})

In [90]:
for uid, user_ratings in top_n_genres.items():
    print('For ' + uid + ', the recommended genres are:'+ str([iid for (iid, _) in user_ratings]))
genre_user_based_list = [iid for (iid, _) in user_ratings]

For jfortson78, the recommended genres are:['Fighting', 'Open World', 'RPG', 'Action', 'Horror', 'Other', 'IRL', 'MMORPG', 'MOBA', 'Simulation']


In [91]:
for uid, user_ratings in top_n_games.items():
    print('For ' + uid + ', the recommended games are:'+ str([iid for (iid, _) in user_ratings]))
game_user_based_list = [iid for (iid, _) in user_ratings]

For jfortson78, the recommended games are:['The Jackbox Party Pack 3', 'Call of Duty: Black Ops II', 'Pokémon Ultra Sun/Ultra Moon', 'Visage', 'Minecraft', 'The Elder Scrolls IV: Oblivion', 'The Jackbox Party Pack 4', 'Silent Hill', 'Satisfactory', 'Rocket League']


#### Predicting Similar Genres Based on Current Genre/Game (item-based similarity) #### 

In [92]:
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir

In [632]:
genre_group = grid_new.groupby(by = ['game_genres', 'user_name'])['scaled_score'].agg([np.mean])
game_group = grid_new.groupby(by = ['game_name', 'user_name'])['scaled_score'].agg([np.mean])

In [633]:
genre_group = genre_group.reset_index()
game_group = game_group.reset_index()

In [93]:
#pickle.dump(genre_group, open("./Data/genre_group.pkl", "wb" ) )
genre_group = pickle.load( open( "./Data/genre_group.pkl", "rb" ) )

#pickle.dump(game_group, open("./Data/game_group.pkl", "wb" ) )
game_group = pickle.load( open( "./Data/game_group.pkl", "rb" ) )

In [94]:
genre_group = pickle.load( open( "./Data/genre_group.pkl", "rb" ) )
reader = Reader(rating_scale=(1, 5))
genre_group_data = Dataset.load_from_df(genre_group[['user_name', 'game_genres', 'mean']], reader)

game_group = pickle.load( open( "./Data/game_group.pkl", "rb" ) )
reader = Reader(rating_scale=(1, 5))
game_group_data = Dataset.load_from_df(game_group[['user_name', 'game_name', 'mean']], reader)

We're using the KNNBaseline algorithm to choose the genres and games with the highest similarity to those already streamed, as it was the fastest of the KNN algorithms available and can be quickly used in the app to make on the spot predictions. We now perform the grid search to find the optimal parameters and then use those algorithms on our data.

In [95]:
import random

from surprise.model_selection import GridSearchCV


# Load the full dataset.
data = genre_group_data
raw_ratings = data.raw_ratings

# shuffle ratings if you want
random.shuffle(raw_ratings)

# A = 90% of the data, B = 10% of the data
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

data.raw_ratings = A_raw_ratings  # data is now the set A

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                              'reg': [1, 2]},
              'k': [2,3],
              'sim_options': {'name': ['msd', 'cosine'],
                              'min_support': [1, 5],
                              'user_based': [False]}
              }
grid_search = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)

KNN_genre_algo = grid_search.best_estimator['rmse']

# retrain on the whole set A
trainset = data.build_full_trainset()
KNN_genre_algo.fit(trainset)

# Compute biased accuracy on A
predictions = KNN_genre_algo.test(trainset.build_testset())
print('Biased accuracy on A,', end='   ')
accuracy.rmse(predictions)

# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
genre_predictions = KNN_genre_algo.test(testset)
print('Unbiased accuracy on B,', end=' ')
accuracy.rmse(genre_predictions)


Grid Search...
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing

Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the m

0.481351625218479

In [96]:
import random

from surprise.model_selection import GridSearchCV


# Load the full dataset.
data = game_group_data
raw_ratings = data.raw_ratings

# shuffle ratings if you want
random.shuffle(raw_ratings)

# A = 90% of the data, B = 10% of the data
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

data.raw_ratings = A_raw_ratings  # data is now the set A

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                              'reg': [1, 2]},
              'k': [2, 3],
              'sim_options': {'name': ['msd', 'cosine'],
                              'min_support': [1, 5],
                              'user_based': [False]}
              }
grid_search = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)

KNN_game_algo = grid_search.best_estimator['rmse']

# retrain on the whole set A
trainset = data.build_full_trainset()
KNN_game_algo.fit(trainset)

# Compute biased accuracy on A
game_predictions = KNN_game_algo.test(trainset.build_testset())
print('Biased accuracy on A,', end='   ')
accuracy.rmse(game_predictions)

# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
game_predictions = KNN_game_algo.test(testset)
print('Unbiased accuracy on B,', end=' ')
accuracy.rmse(game_predictions)


Grid Search...
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing

Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using sgd...
Computing the m

0.8876411080616802

Now we're going to use the best estimator from the grid search to find the nearest neighbors of the genres and games we received as inputs from the streamers:

In [97]:
genre_group_trainset = genre_group_data.build_full_trainset()
genre_group_testset = genre_group_trainset.build_anti_testset()
KNN_genre_predictions = KNN_genre_algo.fit(genre_group_trainset).test(genre_group_testset)
accuracy.rmse(KNN_genre_predictions)

Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.6615


0.6614788265828276

In [98]:
game_group_trainset = game_group_data.build_full_trainset()
game_group_testset = game_group_trainset.build_anti_testset()
KNN_game_predictions = KNN_game_algo.fit(game_group_trainset).test(game_group_testset)
accuracy.rmse(KNN_game_predictions)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.3144


0.314351405496346

To be able to evaluate the closest neighbors of items, we're going to make a list of every genre and game that needs to be evaluated and predict how similar they are going to be those already streamed by our target person.

Note that there are two ways we could go here - if the streamer already has a Twitch sign-in name and streams content, they might be in our existing database, but we also need to be able to provide some recommendations if they are not yet streaming but already have a Twitch sign-in name. So we're going to retrieve the recommendations for them either way, doing our best to estimate their expected success.

In [107]:
#produce the list of genres/games needed to be evaluated by converting all the items to their inner ids used by Surprise
KNN_genre_inner_id_list = []
for genre in full_genres:
    inner = KNN_genre_algo.trainset.to_inner_iid(genre)
    KNN_genre_inner_id_list.append(inner)

KNN_game_inner_id_list = []
for game in full_games:
    inner = KNN_game_algo.trainset.to_inner_iid(game)
    KNN_game_inner_id_list.append(inner)

In [105]:
KNN_genre_inner_id_list

[1]

In [108]:
KNN_game_inner_id_list

[311, 81]

In [109]:
# Retrieve inner ids of the nearest neighbors of the games in question.
KNN_genre_neighbors_list = []
for inner in KNN_genre_inner_id_list:
    genre_neighbors = KNN_genre_algo.get_neighbors(inner, k=3)
    KNN_genre_neighbors_list.append(genre_neighbors)

KNN_game_neighbors_list = []
for inner in KNN_game_inner_id_list:
    game_neighbors = KNN_game_algo.get_neighbors(inner, k=3)
    KNN_game_neighbors_list.append(game_neighbors)

In [110]:
print(KNN_genre_neighbors_list)
print(KNN_game_neighbors_list)

[[2, 9, 7]]
[[0, 1, 2], [21, 6, 31]]


In [111]:
KNN_genres_predictions_df = pd.DataFrame(KNN_genre_predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])    

In [112]:
KNN_genres_predictions_df.head(3)

Unnamed: 0,uid,iid,rui,est,details
0,WildEraLIVE,FPS,3.567691,4.439771,"{'actual_k': 2, 'was_impossible': False}"
1,WildEraLIVE,Shooter,3.567691,4.303374,"{'actual_k': 2, 'was_impossible': False}"
2,WildEraLIVE,Driving/Racing Game,3.567691,4.287514,"{'actual_k': 2, 'was_impossible': False}"


In [113]:
#Provides genres if the person is already an active streamer and are in the dataset. Can be an empty frame if they are not.

KNN_reco_genres = KNN_genres_predictions_df[KNN_genres_predictions_df['uid']==streamer_name].sort_values('est', ascending = False)[['iid', 'est']][:10]

In [114]:
KNN_reco_genres

Unnamed: 0,iid,est


In [115]:
KNN_games_predictions_df = pd.DataFrame(KNN_game_predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])   

In [116]:
KNN_reco_games = KNN_games_predictions_df[KNN_games_predictions_df['uid']==streamer_name].sort_values('est', ascending = False)[['iid', 'est']][:10]

In [117]:
KNN_reco_games

Unnamed: 0,iid,est


In [118]:
# prioritize closest neighbors to all original genres/games mentioned

genre_final_list = []
for item in KNN_genre_neighbors_list:
    genre_final_list.append(item[0])
    genre_final_list.append(item[1])

game_final_list = []
for item in KNN_game_neighbors_list:
    game_final_list.append(item[0])
    game_final_list.append(item[1])

In [119]:
print(genre_final_list)
print(game_final_list)

[2, 9]
[0, 1, 21, 6]


Come up with a way to weigh the most frequent neighbors in all genres/games and combine with user-recommended ones.

In [120]:
genres_KNN_List = [KNN_genre_algo.trainset.to_raw_iid(iiid) for iiid in set(genre_final_list)]

games_KNN_List = [KNN_game_algo.trainset.to_raw_iid(iiid) for iiid in set(game_final_list)]
print('The nearest neighbors of your current genres are:' + str(genres_KNN_List))
print('The nearest neighbors of your current games are:' + str(games_KNN_List))

The nearest neighbors of your current genres are:['Simulation', 'Shooter']
The nearest neighbors of your current games are:['The Elder Scrolls V: Skyrim', 'Sea of Thieves', 'Escape From Tarkov', 'Art']


The below is a way to add weights for the closest neighbors of the current streamer's games. Ideally, this will be part of the metric if higher calculation speeds can be achieved for the algorithm to work live. 

In [121]:
genres_KNN_List, games_KNN_List

(['Simulation', 'Shooter'],
 ['The Elder Scrolls V: Skyrim',
  'Sea of Thieves',
  'Escape From Tarkov',
  'Art'])

In [122]:
genres_KNN_df = pd.DataFrame(genres_KNN_List)
games_KNN_df = pd.DataFrame(games_KNN_List)

In [123]:
genres_KNN_df['est'] = np.ones(len(genres_KNN_List))
games_KNN_df['est'] = np.ones(len(games_KNN_List))

In [124]:
games_KNN_df['iid'] = games_KNN_df[0]
genres_KNN_df['iid'] = genres_KNN_df[0]

In [125]:
genre_frames = [personal_genre_pred, KNN_reco_genres, genres_KNN_df]
game_frames = [personal_game_pred, KNN_reco_games, games_KNN_df]

In [126]:
genre_result = pd.concat(genre_frames)
game_result = pd.concat(game_frames)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [127]:
genre_result.shape, game_result.shape

((35, 3), (354, 3))

In [128]:
genre_result[['iid', 'est']].sort_values('est', ascending = False)[:10]

Unnamed: 0,iid,est
0,Fighting,3.566848
25,Gambling Game,3.566848
19,Creative,3.566848
20,Series: Souls,3.566848
21,Platformer,3.566848
22,RTS,3.566848
23,Card & Board Game,3.566848
24,Rhythm & Music Game,3.566848
26,Survival,3.566848
1,Open World,3.566848


In [129]:
genre_result.groupby(by='iid').sum().sort_values('est', ascending = False)[:10]

Unnamed: 0_level_0,est
iid,Unnamed: 1_level_1
Simulation,4.566848
Action,3.566848
Adventure Game,3.566848
Survival,3.566848
Strategy,3.566848
Stealth,3.566848
Sports Game,3.566848
Series: Souls,3.566848
Roguelike,3.566848
Rhythm & Music Game,3.566848


In [130]:
game_result.groupby(by='iid').sum().sort_values('est', ascending = False)[:10]

Unnamed: 0_level_0,est
iid,Unnamed: 1_level_1
Escape From Tarkov,4.77304
The Elder Scrolls V: Skyrim,4.642488
Art,4.631819
Sea of Thieves,4.326709
The Jackbox Party Pack 3,4.062363
Call of Duty: Black Ops II,3.990895
Pokémon Ultra Sun/Ultra Moon,3.975484
Visage,3.96807
Minecraft,3.915991
The Elder Scrolls IV: Oblivion,3.914401
