In this workbook I cover all aspects of analysing our Twitch data using the Surprise package. More information about Surprise can be found here: https://surprise.readthedocs.io/en/stable/index.html

We start with taking the grid from our EDA Surprise Funnel Workbook and preparing it to be used by Surprise. The package specifically asks for the three items to be present: User, Item, and Rating. Usually, a rating is an evaluation given by the user herself (I can rate a movie I watched or a book I read 4 out of 5 stars), but in our case the rating is a custom success metric which measures how successful a certain game or genre of content has been for a particular streamer. Each person has one 5-rated game, and sometimes others which are scored as a percentage of that core content. In this fashion, we are avoiding the imbalance of scores for those streamers with millions of views, which are very few but skew the results heavily.

In [190]:
import pandas as pd
import numpy as np

In [191]:
grid = pd.read_csv('final_game_user_grid.csv')

In [192]:
grid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70545 entries, 0 to 70544
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_name     70545 non-null  object
 1   user_id       70545 non-null  int64 
 2   game_id       70545 non-null  int64 
 3   game_name     70545 non-null  object
 4   game_genres   70351 non-null  object
 5   language      70543 non-null  object
 6   started_at    70545 non-null  object
 7   viewer_count  70545 non-null  int64 
 8   max           70545 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 4.8+ MB


In [193]:
grid = grid.dropna()
grid.shape

(70349, 9)

In [194]:
grid = grid.dropna(how='any',axis=0) 
grid.shape

(70349, 9)

In [195]:
grid.head()

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max
0,龜狗,48093884,21779,League of Legends,{MOBA},zh,2020-04-19T04:18:22Z,1931,1931
1,黒田瑞貴,225658233,511224,Apex Legends,"{FPS,Shooter}",ja,2020-04-20T12:19:02Z,165,165
2,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-20T10:53:28Z,102,102
3,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-20T10:53:28Z,74,74
4,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-16T11:29:02Z,71,71


In [196]:
grid['game_genres']= grid['game_genres'].str.replace("{", "").str.replace("}", "")
grid['game_genres']= grid['game_genres'].str.replace('"', "")
grid['game_genres']= grid['game_genres'].str.replace("'", "")

To begin, we're calculating the metric we'll be using to compare users and games. 

Since we're focused on the streamers, we don't have the traditional "I like this movie so I will rate it a 4.5 out of 5.0" ratings. 

Instead we're calculating 
>>how many viewers each streamer attracted with a particular game compared to their max viewer potential over the week

For each user, one game is their ultimate streaming "5 out of 5" benchmark, and all other games they play are compared to that one, and normalized to ratings between 1 and 5. We are also using all the genres for each game to pinpoint how successful various genres and games have been for each streamer.

In [197]:
max_value_username = pd.DataFrame(grid.groupby('user_name')['max'].max().reset_index())
max_value_username.head()

Unnamed: 0,user_name,max
0,0011002200,0
1,00PixieDust,6
2,00carlos03,1
3,00에스프레쏘00,27
4,02theliveR,65


In [198]:
max_val_dict = max_value_username.groupby('user_name')['max'].apply(list).to_dict()

In [199]:
grid['max_game'] = grid['user_name'].map(max_val_dict)
grid.head()

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max,max_game
0,龜狗,48093884,21779,League of Legends,MOBA,zh,2020-04-19T04:18:22Z,1931,1931,[1931]
1,黒田瑞貴,225658233,511224,Apex Legends,"FPS,Shooter",ja,2020-04-20T12:19:02Z,165,165,[165]
2,黑色柳丁,18156459,18122,World of Warcraft,"MMORPG,RPG",zh,2020-04-20T10:53:28Z,102,102,[102]
3,黑色柳丁,18156459,18122,World of Warcraft,"MMORPG,RPG",zh,2020-04-20T10:53:28Z,74,74,[102]
4,黑色柳丁,18156459,18122,World of Warcraft,"MMORPG,RPG",zh,2020-04-16T11:29:02Z,71,71,[102]


In [200]:
grid['max_game_int'] = grid.max_game.str[0].astype(int)
grid = grid.drop('max_game', axis = 1)

In [201]:
grid.sample(10)

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max,max_game_int
29313,Molinero1990,139141695,504461,Super Smash Bros. Ultimate,"Fighting,Platformer",de,2020-04-16T19:05:13Z,20,20,20
36854,konnichiwatwitch,257569998,502732,Garena Free Fire,"Shooter,Strategy",es,2020-04-20T02:22:27Z,1,1,1
47284,GianniBianchini,218808044,417752,Talk Shows & Podcasts,IRL,en,2020-04-20T16:00:22Z,1,1,1
67421,AkinaFair,82895940,115977,The Witcher 3: Wild Hunt,"RPG,Action",en,2020-04-16T19:22:17Z,6,6,6
29988,Minion777,31549282,515448,Resident Evil 3,"Action,Adventure Game",en,2020-04-19T23:51:39Z,450,450,751
9641,TheChickenWing,81486617,491487,Dead by Daylight,"Action,Horror",en,2020-04-20T13:54:32Z,51,51,68
58875,ChrisPlaysFUNGames,62672112,511399,Super Mario Maker 2,Platformer,en,2020-04-18T20:02:01Z,15,15,15
67731,Agent13Cookie,515478471,493244,Deceit,"Shooter,Action,FPS",en,2020-04-19T23:32:03Z,2,2,2
9912,That_Dahlia,408707982,506246,Fallout 76,"RPG,Shooter",en,2020-04-19T23:54:22Z,23,23,23
2077,Yavanoa,40024560,32959,Heroes of the Storm,MOBA,fr,2020-04-19T19:27:02Z,2,2,2


In [202]:
grid['score'] = grid['max']/grid['max_game_int']
grid.head()

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max,max_game_int,score
0,龜狗,48093884,21779,League of Legends,MOBA,zh,2020-04-19T04:18:22Z,1931,1931,1931,1.0
1,黒田瑞貴,225658233,511224,Apex Legends,"FPS,Shooter",ja,2020-04-20T12:19:02Z,165,165,165,1.0
2,黑色柳丁,18156459,18122,World of Warcraft,"MMORPG,RPG",zh,2020-04-20T10:53:28Z,102,102,102,1.0
3,黑色柳丁,18156459,18122,World of Warcraft,"MMORPG,RPG",zh,2020-04-20T10:53:28Z,74,74,102,0.72549
4,黑色柳丁,18156459,18122,World of Warcraft,"MMORPG,RPG",zh,2020-04-16T11:29:02Z,71,71,102,0.696078


In [203]:
from sklearn.preprocessing import minmax_scale
grid['scaled_score'] = minmax_scale(grid['score'], feature_range=(1, 5))

In [204]:
grid.tail()

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max,max_game_int,score,scaled_score
70540,_문님,232779448,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"Shooter,FPS",ko,2020-04-20T16:02:01Z,15,15,35,0.428571,2.714286
70541,_문님,232779448,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"Shooter,FPS",ko,2020-04-19T16:09:19Z,14,14,35,0.4,2.6
70542,__수현,423848194,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"Shooter,FPS",ko,2020-04-19T14:18:18Z,18,18,18,1.0,5.0
70543,__수현,423848194,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"Shooter,FPS",ko,2020-04-19T14:18:18Z,16,16,18,0.888889,4.555556
70544,__수현,423848194,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"Shooter,FPS",ko,2020-04-19T14:18:18Z,15,15,18,0.833333,4.333333


In [205]:
#grid.to_csv('final_rating_game.csv', index = False)

In [206]:
grid = grid.dropna()

Now we have a listing for each user pairing them with each game they play, what genre it belongs to, and how many people watched them play each game compared to the max viewers they ever got for a stream during the week we examined.

In [207]:
grid.groupby('user_name')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,user_name,scaled_score
30841,gaules,9
7888,FroggedTV,9
16826,OgamingLoL,9
15146,Monstercat,9
21060,SlotRoom247,9


In [208]:
grid.groupby('game_genres')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,game_genres,scaled_score
1,"Action,Adventure Game",5311
21,"FPS,Shooter",3604
14,Card & Board Game,3566
36,MOBA,3372
6,"Action,RPG",3352


In [209]:
grid.groupby('game_name')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,game_name,scaled_score
54,Fortnite,900
46,FIFA 20,898
63,Grand Theft Auto V,898
91,Minecraft,895
169,VALORANT,895


In [210]:
min_number_scores = 5
filter_users = grid['user_name'].value_counts() > min_number_scores
filter_users = filter_users[filter_users].index.tolist()

In [211]:
grid_new = grid[(grid['user_name'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(grid.shape))
print('The new data frame shape:\t{}'.format(grid_new.shape))

The original data frame shape:	(69428, 12)
The new data frame shape:	(4699, 12)


After reshaping the grid, we are going to extract the recommendations for game genre, game titles, and games similar to those already rated by the streamers as a three-pronged recommender approach.

Preparing for genre recommendations based on viewership scores:

In [212]:
genres_base_df = grid_new.groupby(by = ['user_name', 'game_genres'])['scaled_score'].agg([np.mean])
games_base_df = grid_new.groupby(by = ['user_name', 'game_name'])['scaled_score'].agg([np.mean])

In [336]:
genres_base_df.head(5)

Unnamed: 0,user_name,game_genres,mean
0,0monstro,"MMORPG,RPG",3.626866
1,1Gn0rance,"Action,RPG",3.512821
2,1ST3NM1,MMORPG,4.257028
3,24_Flash,Shooter,4.677054
4,24_Flash,"Shooter,FPS",4.31898


In [214]:
games_base_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
user_name,game_name,Unnamed: 2_level_1
0monstro,Last Oasis,3.626866
1Gn0rance,Path of Exile,3.512821
1ST3NM1,Knight Online,4.257028
24_Flash,Fortnite,4.677054
24_Flash,VALORANT,4.31898


In [215]:
genres_base_df.columns = genres_base_df.columns.map(''.join)
games_base_df.columns = games_base_df.columns.map(''.join)

In [216]:
genres_base_df = genres_base_df.reset_index()
games_base_df = games_base_df.reset_index()

In [217]:
genres_base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 861 entries, 0 to 860
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_name    861 non-null    object 
 1   game_genres  861 non-null    object 
 2   mean         861 non-null    float64
dtypes: float64(1), object(2)
memory usage: 20.3+ KB


In [218]:
games_base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 868 entries, 0 to 867
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   user_name  868 non-null    object 
 1   game_name  868 non-null    object 
 2   mean       868 non-null    float64
dtypes: float64(1), object(2)
memory usage: 20.5+ KB


In [219]:
genres_base_df.head(5)

Unnamed: 0,user_name,game_genres,mean
0,0monstro,"MMORPG,RPG",3.626866
1,1Gn0rance,"Action,RPG",3.512821
2,1ST3NM1,MMORPG,4.257028
3,24_Flash,Shooter,4.677054
4,24_Flash,"Shooter,FPS",4.31898


In [220]:
games_base_df.head(5)

Unnamed: 0,user_name,game_name,mean
0,0monstro,Last Oasis,3.626866
1,1Gn0rance,Path of Exile,3.512821
2,1ST3NM1,Knight Online,4.257028
3,24_Flash,Fortnite,4.677054
4,24_Flash,VALORANT,4.31898


Using Surprise to predict genres/games for a streamer based on their existing games and genres ratings

In [221]:
import surprise
from surprise import Dataset, accuracy, Reader, NMF, NormalPredictor, BaselineOnly, CoClustering, SlopeOne, SVD, KNNBaseline
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

In [222]:
reader = Reader(rating_scale=(1, 5))
genre_base_data = Dataset.load_from_df(genres_base_df[['user_name', 'game_genres', 'mean']], reader)

In [223]:
game_base_data = Dataset.load_from_df(games_base_df[['user_name', 'game_name', 'mean']], reader)

Now it is time to select the algorithms we will use to generate recommendations for our streamers based on their existing behavior and also on how similar the games they already stream are to those they do not yet stream. 

The first step is to evaluate various algorithms available in Surprise to see which produce the lowest RMSE and are computationally friendly so we can use them in a dynamic app and recommend things on the fly. We perform cross validation on all fast algorithms to see which performs best with either of our datasets, genres and games.

In [224]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SlopeOne(), NormalPredictor(), BaselineOnly(), NMF(), CoClustering(), SVD()]:


    # Perform cross validation
    results = cross_validate(algorithm, genre_base_data, measures=['RMSE'], return_train_measures=True, cv=5, verbose=True)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Evaluating RMSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8229  0.8884  0.9512  0.8251  0.7635  0.8502  0.0641  
RMSE (trainset)   0.2229  0.1789  0.1971  0.2123  0.2256  0.2074  0.0174  
Fit time          0.01    0.01    0.00    0.00    0.00    0.01    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9955  0.9928  0.9484  0.9520  0.9958  0.9769  0.0219  
RMSE (trainset)   1.0148  0.9568  1.0008  1.0547  0.9941  1.0043  0.0317  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating 

Unnamed: 0_level_0,test_rmse,train_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BaselineOnly,0.703237,0.644708,0.002022,0.000795
SVD,0.70862,0.507932,0.04365,0.001027
NMF,0.835393,0.072297,0.067451,0.002026
SlopeOne,0.850229,0.207358,0.005558,0.001143
CoClustering,0.895164,0.693736,0.07922,0.0008
NormalPredictor,0.976884,1.004251,0.000864,0.00155


For genre data, the SlopeOne seems to do the best algorithm. To drastically reduce overfitting, improve performance and ease implementation, the Slope One family of easily implemented Item-based Rating-Based collaborative filtering algorithms was proposed. Essentially, instead of using linear regression from one item's ratings to another item's ratings ( {\displaystyle f(x)=ax+b} f(x)=ax+b), it uses a simpler form of regression with a single free parameter ( {\displaystyle f(x)=x+b} f(x)=x+b). The free parameter is then simply the average difference between the two items' ratings. It was shown to be much more accurate than linear regression in some instances, and it takes half the storage or less.

In [225]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SlopeOne(), NormalPredictor(), BaselineOnly(), NMF(), CoClustering(), SVD()]:


    # Perform cross validation
    results = cross_validate(algorithm, game_base_data, measures=['RMSE'], return_train_measures=True, cv=5, verbose=True)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Evaluating RMSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8102  0.9477  0.8467  0.9236  0.8765  0.8810  0.0500  
RMSE (trainset)   0.1741  0.1591  0.1755  0.1650  0.1756  0.1699  0.0067  
Fit time          0.01    0.01    0.01    0.01    0.01    0.01    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0355  0.9512  0.9682  1.0293  1.0210  1.0011  0.0345  
RMSE (trainset)   0.9925  0.9808  1.0058  0.9739  0.9866  0.9879  0.0109  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating 

Unnamed: 0_level_0,test_rmse,train_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVD,0.713561,0.506151,0.04024,0.001109
BaselineOnly,0.714918,0.639666,0.001939,0.00089
NMF,0.85622,0.079258,0.078619,0.001086
SlopeOne,0.880955,0.169867,0.005759,0.001207
CoClustering,0.910979,0.713615,0.077834,0.000847
NormalPredictor,1.00106,0.987916,0.000947,0.001222


For the game data, we are going to use the BaselineOnly algorithm which produces the lowest test RMSE.

Based on the results of the algorithm selector, we will proceed with SlopeOne for genre predictions and with the Baseline algorithm for game predictions. More information about SlopeOne can be found here: https://arxiv.org/abs/cs/0702144

Next we perform cross-validation to see how our chosen algorithms perform:

In [226]:
algo_genre_base = SlopeOne()
cross_validate(algo_genre_base, genre_base_data, measures=['RMSE'], cv=7, verbose=True)

Evaluating RMSE of algorithm SlopeOne on 7 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Mean    Std     
RMSE (testset)    0.8028  0.9276  0.9304  0.8694  0.7507  0.8968  0.8994  0.8682  0.0624  
Fit time          0.01    0.01    0.01    0.01    0.01    0.01    0.01    0.01    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([0.80280949, 0.92757467, 0.93040153, 0.86942057, 0.75070622,
        0.89678486, 0.89940161]),
 'fit_time': (0.005972146987915039,
  0.005824089050292969,
  0.007110118865966797,
  0.005393266677856445,
  0.005966901779174805,
  0.005309104919433594,
  0.006371021270751953),
 'test_time': (0.0009241104125976562,
  0.0008308887481689453,
  0.0011897087097167969,
  0.0011010169982910156,
  0.0011429786682128906,
  0.0009257793426513672,
  0.0015020370483398438)}

In [227]:
bsl_options = {'method': 'als',
               'n_epochs': 10,
               'reg_u': 12,
               'reg_i': 5
               }
algo_games_base = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo_games_base, game_base_data, measures=['RMSE'], cv=7, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE of algorithm BaselineOnly on 7 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Mean    Std     
RMSE (testset)    0.7330  0.7821  0.7714  0.6184  0.7143  0.7164  0.6512  0.7124  0.0552  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([0.73300186, 0.78214757, 0.77142298, 0.61843997, 0.71433334,
        0.71638114, 0.65123915]),
 'fit_time': (0.0019719600677490234,
  0.0025997161865234375,
  0.0018451213836669922,
  0.001990079879760742,
  0.0020227432250976562,
  0.0018582344055175781,
  0.0021369457244873047),
 'test_time': (0.0005698204040527344,
  0.0005650520324707031,
  0.0004937648773193359,
  0.0008409023284912109,
  0.0005128383636474609,
  0.0006039142608642578,
  0.0005490779876708984)}

Now we split the data into train and test sets to produce the predictions for genres and games streamers would like. Our RMSE is low for both algorithms, lower for the genres which makes sense since each genre includes many games which produces more observations for each genres than for each individual game.

In [228]:
genre_trainset, genre_testset = train_test_split(genre_base_data, test_size=0.25)
genre_base_predictions = algo_genre_base.fit(genre_trainset).test(genre_testset)
accuracy.rmse(genre_base_predictions)

RMSE: 0.9014


0.9013714892958293

In [229]:
game_trainset, game_testset = train_test_split(game_base_data, test_size=0.25)
game_base_predictions = algo_games_base.fit(game_trainset).test(game_testset)
accuracy.rmse(game_base_predictions)

Estimating biases using als...
RMSE: 0.6484


0.6483606568201572

Saving the models and predictions to use in the predictor app:

In [230]:
import pickle

In [231]:
pickle.dump(algo_genre_base, open("Data/SlopeOne_genre_model.pkl", "wb" ) )
#algo_genre_base = pickle.load( open( "./Data/SlopeOne_genre_model.pkl", "rb" ) )

In [232]:
pickle.dump(algo_games_base, open("Data/BaselineOnly_game_model.pkl", "wb" ) )
#algo_games_base = pickle.load( open( "./Data/BaselineOnly_game_model.pkl", "rb" ) )

In [233]:
pickle.dump(genre_base_predictions, open("Data/SlopeOne_genre_model_predictions.pkl", "wb" ) )
#genre_base_predictions = pickle.load( open( "./Data/SlopeOne_genre_model_predictions.pkl", "rb" ) )

In [234]:
pickle.dump(game_base_predictions, open("Data/BaselineOnly_game_model_predictions.pkl", "wb" ) )
#game_base_predictions = pickle.load( open( "./Data/BaselineOnly_game_model_predictions.pkl", "rb" ) )

In [235]:
pickle.dump(genres_base_df, open("Data/genres.pkl", "wb" ) )
#genres_base_df = pickle.load( open( "./Data/genres.pkl", "rb" ) )

In [236]:
pickle.dump(games_base_df, open("Data/games.pkl", "wb" ) )
#games_base_df = pickle.load( open( "./Data/games.pkl", "rb" ) )

Taking the inputs from the user:

In [337]:
streamer_name = input('What is your streamer name? ')
streamer_genres = list(input ('Which game genres do you currently stream? ').split(', '))
streamer_games = list(input ('Which games do you currently stream? ').split(', '))

What is your streamer name? dvvgv
Which game genres do you currently stream? cdd
Which games do you currently stream? dsdsvv


In [338]:
streamer_name, streamer_genres, streamer_games

('dvvgv', ['cdd'], ['dsdsvv'])

Making a list of streamers' current genres and games by combining any information we already have in our dataset and their own inputs into the app:

In [339]:
genres_base_df.head()

Unnamed: 0,user_name,game_genres,mean
0,0monstro,"MMORPG,RPG",3.626866
1,1Gn0rance,"Action,RPG",3.512821
2,1ST3NM1,MMORPG,4.257028
3,24_Flash,Shooter,4.677054
4,24_Flash,"Shooter,FPS",4.31898


In [340]:
def display_current_genres(streamer_name):
    user_genres = list(genres_base_df[genres_base_df['user_name']==streamer_name]['game_genres'])
    return user_genres
recorder_genres_list = display_current_genres(streamer_name)
full_genres = set(recorder_genres_list + streamer_genres)
full_genres = list(full_genres)
full_genres

['cdd']

In [341]:
def display_current_games(streamer_name):
    user_games = list(games_base_df[games_base_df['user_name']==streamer_name]['game_name'])
    return user_games
recorder_games_list = display_current_games(streamer_name)
full_games = set(recorder_games_list + streamer_games)
full_games = list(full_games)
full_games

['dsdsvv']

####  Predicting Genres for Streamers (user-based similarities) ####

In [342]:
from collections import defaultdict

In [343]:
def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for the user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

Setting the inner ids for genres and games to be able to identify them in Surprise nomenclature. The results are all the genres and games the user has not had exposure to:

In [344]:
iids_genre = genres_base_df['game_genres'].unique()
iids_genre_to_predict = np.setdiff1d(iids_genre, full_genres, assume_unique = True)

In [345]:
iids_game = games_base_df['game_name'].unique()
iids_game_to_predict = np.setdiff1d(iids_game, full_games, assume_unique = True)

In [346]:
iids_genre_to_predict

array(['MMORPG,RPG', 'Action,RPG', 'MMORPG', 'Shooter', 'Shooter,FPS',
       'Card & Board Game', 'FPS,Shooter', 'Action,Shooter', 'IRL',
       'RPG,Action', 'Sports Game', 'Platformer', 'IRL,Creative',
       'Simulation,Action', 'MOBA', 'FPS,Shooter,MOBA', 'Simulation,FPS',
       'Adventure Game,RPG', 'Driving/Racing Game', 'FPS,Shooter,RPG',
       'RPG', 'RPG,Shooter', 'Action,Horror', 'Strategy,Autobattler',
       'RPG,Simulation', 'Gambling Game', 'Strategy,Simulation',
       'Action,Adventure Game', 'Action,Simulation', 'Simulation',
       'RPG,Mobile Game', 'NONE', 'Creative', 'FPS,Shooter,Horror',
       'Adventure Game', 'RTS,Strategy', 'Fighting,Platformer',
       'Adventure Game,Action', 'Simulation,Puzzle', 'Strategy,RTS',
       'Action', 'Shooter,Horror', 'Simulation,RPG', 'MMORPG,Stealth',
       'Strategy', 'Sports Game,Simulation', 'Action,Open World',
       'Horror,Indie Game', 'Shooter,Strategy', 'RPG,Indie Game',
       'Adventure Game,Horror', 'Creative,IR

In [347]:
iids_game_to_predict

array(['Last Oasis', 'Path of Exile', 'Knight Online', 'Fortnite',
       'VALORANT', 'Hearthstone', 'Poker', 'Apex Legends',
       'Final Fantasy VII Remake', 'Just Chatting', 'Resident Evil 6',
       'The Witcher 3: Wild Hunt', 'FIFA 20', 'Super Mario Maker 2',
       'Music & Performing Arts', 'World of Tanks', 'League of Legends',
       'World of Warcraft', 'Old School RuneScape', 'Escape From Tarkov',
       'Overwatch', 'Tibia', 'Deadside', 'Grand Theft Auto V',
       'Destiny 2', 'Call of Duty: Modern Warfare',
       'Pokémon Sword/Shield', 'Fallout 76', 'Dead by Daylight',
       'Counter-Strike: Global Offensive', 'Teamfight Tactics', 'DayZ',
       'Retro', 'Dota 2', 'Talk Shows & Podcasts', 'Slots', 'RimWorld',
       'Mount & Blade II: Bannerlord', 'World of Warships',
       'Rocket League', 'Magic: The Gathering', 'Sea of Thieves',
       'X4: Foundations', 'ASMR', 'Summoners War: Sky Arena',
       'Resident Evil 3', 'Drug Dealer Simulator', 'iRacing.com',
       'H

Making a personal testset for the user, populating same base rating as true since the actuals are not known and we are trying to predict the expected rating:

In [348]:
genre_testset_personal = [[streamer_name, iid, 0.] for iid in iids_genre_to_predict]
game_testset_personal = [[streamer_name, iid, 0.] for iid in iids_game_to_predict]

Producing a list of predictions based on the inputs:

In [349]:
personal_genre_predictions = algo_genre_base.test(genre_testset_personal)
personal_game_predictions = algo_games_base.test(game_testset_personal)

In [350]:
personal_genre_df = pd.DataFrame(personal_genre_predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])   
personal_genre_df

Unnamed: 0,uid,iid,rui,est,details
0,dvvgv,"MMORPG,RPG",0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
1,dvvgv,"Action,RPG",0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
2,dvvgv,MMORPG,0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
3,dvvgv,Shooter,0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
4,dvvgv,"Shooter,FPS",0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
5,dvvgv,Card & Board Game,0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
6,dvvgv,"FPS,Shooter",0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
7,dvvgv,"Action,Shooter",0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
8,dvvgv,IRL,0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."
9,dvvgv,"RPG,Action",0.0,3.671783,"{'was_impossible': True, 'reason': 'User and/o..."


In [294]:
personal_genre_pred = personal_genre_df[['iid', 'est']]
personal_genre_pred

Unnamed: 0,iid,est
0,"MMORPG,RPG",3.671783
1,"Action,RPG",3.671783
2,MMORPG,3.671783
3,Shooter,3.671783
4,Card & Board Game,3.671783
5,"FPS,Shooter",3.671783
6,"Action,Shooter",3.671783
7,"RPG,Action",3.671783
8,Sports Game,3.671783
9,Platformer,3.671783


In [295]:
personal_game_df = pd.DataFrame(personal_game_predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])    
personal_game_df

Unnamed: 0,uid,iid,rui,est,details
0,Anomaly,Last Oasis,0.0,3.713036,{'was_impossible': False}
1,Anomaly,Path of Exile,0.0,3.896992,{'was_impossible': False}
2,Anomaly,Knight Online,0.0,3.822232,{'was_impossible': False}
3,Anomaly,Hearthstone,0.0,3.794633,{'was_impossible': False}
4,Anomaly,Poker,0.0,3.690705,{'was_impossible': False}
...,...,...,...,...,...
92,Anomaly,The Legend of Zelda: Breath of the Wild,0.0,3.692920,{'was_impossible': False}
93,Anomaly,Griftlands,0.0,3.382086,{'was_impossible': False}
94,Anomaly,Outlast,0.0,3.845196,{'was_impossible': False}
95,Anomaly,Bloodborne,0.0,3.284652,{'was_impossible': False}


In [351]:
personal_game_pred = personal_game_df[['iid', 'est']]

In [352]:
top_n_genres = get_top_n(personal_genre_predictions)
top_n_games = get_top_n(personal_game_predictions)

In [353]:
top_n_genres

defaultdict(list,
            {'dvvgv': [('MMORPG,RPG', 3.6717834370828575),
              ('Action,RPG', 3.6717834370828575),
              ('MMORPG', 3.6717834370828575),
              ('Shooter', 3.6717834370828575),
              ('Shooter,FPS', 3.6717834370828575),
              ('Card & Board Game', 3.6717834370828575),
              ('FPS,Shooter', 3.6717834370828575),
              ('Action,Shooter', 3.6717834370828575),
              ('IRL', 3.6717834370828575),
              ('RPG,Action', 3.6717834370828575)]})

In [354]:
top_n_games

defaultdict(list,
            {'dvvgv': [('Animal Crossing: New Horizons', 3.97842757683727),
              ('Grand Theft Auto V', 3.9771313763459273),
              ('Green Hell', 3.9424081820698818),
              ('Destiny 2', 3.9379344729177483),
              ('Marbles On Stream', 3.8728006601004243),
              ('The Jackbox Party Pack 3', 3.852383406558277),
              ('Path of Exile', 3.8466129478023663),
              ('Poly Bridge', 3.834971789360166),
              ('H1Z1', 3.83318371818989),
              ('Heroes of the Storm', 3.8324781378772257)]})

In [357]:
for uid, user_ratings in top_n_genres.items():
    print('For ' + uid + ', the recommended genres are:'+ str([iid for (iid, _) in user_ratings]))
genre_user_based_list = [iid for (iid, _) in user_ratings]

For dvvgv, the recommended genres are:['MMORPG,RPG', 'Action,RPG', 'MMORPG', 'Shooter', 'Shooter,FPS', 'Card & Board Game', 'FPS,Shooter', 'Action,Shooter', 'IRL', 'RPG,Action']


In [358]:
for uid, user_ratings in top_n_games.items():
    print('For ' + uid + ', the recommended games are:'+ str([iid for (iid, _) in user_ratings]))
game_user_based_list = [iid for (iid, _) in user_ratings]

For dvvgv, the recommended games are:['Animal Crossing: New Horizons', 'Grand Theft Auto V', 'Green Hell', 'Destiny 2', 'Marbles On Stream', 'The Jackbox Party Pack 3', 'Path of Exile', 'Poly Bridge', 'H1Z1', 'Heroes of the Storm']


## Plot for predictions

In [259]:
genre_user_based_list

['MMORPG,RPG',
 'Action,RPG',
 'MMORPG',
 'Shooter',
 'Shooter,FPS',
 'Card & Board Game',
 'FPS,Shooter',
 'Action,Shooter',
 'RPG,Action',
 'Sports Game']

In [260]:
recommendation = pd.DataFrame(genre_user_based_list,columns=['recommended_genre'])
recommendation

Unnamed: 0,recommended_genre
0,"MMORPG,RPG"
1,"Action,RPG"
2,MMORPG
3,Shooter
4,"Shooter,FPS"
5,Card & Board Game
6,"FPS,Shooter"
7,"Action,Shooter"
8,"RPG,Action"
9,Sports Game


In [261]:
recommendation['recommended_genre'].unique()

array(['MMORPG,RPG', 'Action,RPG', 'MMORPG', 'Shooter', 'Shooter,FPS',
       'Card & Board Game', 'FPS,Shooter', 'Action,Shooter', 'RPG,Action',
       'Sports Game'], dtype=object)

In [262]:
from itertools import chain
game_genre_initial = recommendation['recommended_genre'].map(lambda x: x.split(',')).values.tolist()
all_game_genre = list(chain(*game_genre_initial))
all_game_genre

['MMORPG',
 'RPG',
 'Action',
 'RPG',
 'MMORPG',
 'Shooter',
 'Shooter',
 'FPS',
 'Card & Board Game',
 'FPS',
 'Shooter',
 'Action',
 'Shooter',
 'RPG',
 'Action',
 'Sports Game']

In [263]:
from collections import Counter
count_recommended_genre = Counter(all_game_genre)
count_recommended_genre

Counter({'MMORPG': 2,
         'RPG': 3,
         'Action': 3,
         'Shooter': 4,
         'FPS': 2,
         'Card & Board Game': 1,
         'Sports Game': 1})

In [264]:
count_recommended_genre_df = pd.DataFrame.from_dict(count_recommended_genre, orient='index').reset_index()
count_recommended_genre_df = count_recommended_genre_df.rename(columns={'index':'genres_recommended', 0:'count'})

In [265]:
count_recommended_genre_df

Unnamed: 0,genres_recommended,count
0,MMORPG,2
1,RPG,3
2,Action,3
3,Shooter,4
4,FPS,2
5,Card & Board Game,1
6,Sports Game,1


In [266]:
import plotly.graph_objects as go
import plotly.express as px
fig_genre = px.pie(count_recommended_genre_df,values='count',names='genres_recommended')
fig_genre.show()