In this workbook I cover all aspects of analysing our Twitch data using the Surprise package. More information about Surprise can be found here: https://surprise.readthedocs.io/en/stable/index.html

We start with taking the grid from our EDA Surprise Funnel Workbook and preparing it to be used by Surprise. The package specifically asks for the three items to be present: User, Item, and Rating. Usually, a rating is an evaluation given by the user herself (I can rate a movie I watched or a book I read 4 out of 5 stars), but in our case the rating is a custom success metric which measures how successful a certain game or genre of content has been for a particular streamer. Each person has one 5-rated game, and sometimes others which are scored as a percentage of that core content. In this fashion, we are avoiding the imbalance of scores for those streamers with millions of views, which are very few but skew the results heavily.

In [1]:
import pandas as pd
import numpy as np

In [2]:
grid = pd.read_csv('final_game_user_grid.csv')

In [3]:
grid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70545 entries, 0 to 70544
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_name     70545 non-null  object
 1   user_id       70545 non-null  int64 
 2   game_id       70545 non-null  int64 
 3   game_name     70545 non-null  object
 4   game_genres   70351 non-null  object
 5   language      70543 non-null  object
 6   started_at    70545 non-null  object
 7   viewer_count  70545 non-null  int64 
 8   max           70545 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 4.8+ MB


In [4]:
grid = grid.dropna()
grid.shape

(70349, 9)

In [5]:
grid = grid.dropna(how='any',axis=0) 
grid.shape

(70349, 9)

In [6]:
grid.head()

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max
0,龜狗,48093884,21779,League of Legends,{MOBA},zh,2020-04-19T04:18:22Z,1931,1931
1,黒田瑞貴,225658233,511224,Apex Legends,"{FPS,Shooter}",ja,2020-04-20T12:19:02Z,165,165
2,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-20T10:53:28Z,102,102
3,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-20T10:53:28Z,74,74
4,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-16T11:29:02Z,71,71


To begin, we're calculating the metric we'll be using to compare users and games. 

Since we're focused on the streamers, we don't have the traditional "I like this movie so I will rate it a 4.5 out of 5.0" ratings. 

Instead we're calculating 
>>how many viewers each streamer attracted with a particular game compared to their max viewer potential over the week

For each user, one game is their ultimate streaming "5 out of 5" benchmark, and all other games they play are compared to that one, and normalized to ratings between 1 and 5. We are also using all the genres for each game to pinpoint how successful various genres and games have been for each streamer.

In [7]:
max_value_username = pd.DataFrame(grid.groupby('user_name')['max'].max().reset_index())
max_value_username.head()

Unnamed: 0,user_name,max
0,0011002200,0
1,00PixieDust,6
2,00carlos03,1
3,00에스프레쏘00,27
4,02theliveR,65


In [8]:
max_val_dict = max_value_username.groupby('user_name')['max'].apply(list).to_dict()

In [9]:
grid['max_game'] = grid['user_name'].map(max_val_dict)
grid.head()

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max,max_game
0,龜狗,48093884,21779,League of Legends,{MOBA},zh,2020-04-19T04:18:22Z,1931,1931,[1931]
1,黒田瑞貴,225658233,511224,Apex Legends,"{FPS,Shooter}",ja,2020-04-20T12:19:02Z,165,165,[165]
2,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-20T10:53:28Z,102,102,[102]
3,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-20T10:53:28Z,74,74,[102]
4,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-16T11:29:02Z,71,71,[102]


In [10]:
grid['max_game_int'] = grid.max_game.str[0].astype(int)
grid = grid.drop('max_game', axis = 1)

In [11]:
grid.sample(10)

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max,max_game_int
38244,KellFLuz,136422709,24241,FINAL FANTASY XIV Online,{MMORPG},pt,2020-04-19T19:01:37Z,28,28,28
21404,protaxr6,188687033,460630,Tom Clancy's Rainbow Six: Siege,"{FPS,Shooter}",es,2020-04-19T23:34:32Z,35,35,35
43769,ibai,83232866,73586,Outlast,"{""Adventure Game"",Horror}",es,2020-04-19T17:52:37Z,18941,18941,22907
25092,offpc,56758194,515448,Resident Evil 3,"{Action,""Adventure Game""}",zh,2020-04-20T02:14:13Z,11,11,16
12766,sporkerific,43788773,271304,7 Days to Die,"{Simulation,FPS}",en,2020-04-19T17:29:18Z,11,11,11
5332,Venalis,61292850,512969,Last Oasis,"{MMORPG,RPG}",en,2020-04-19T20:28:18Z,675,675,699
12414,Starlordzz,103589936,506468,Nioh 2,"{Action,RPG}",en,2020-04-18T22:44:09Z,0,0,0
42296,iTehJambajoe,47647062,513319,NBA 2K20,"{""Sports Game"",Simulation}",en,2020-04-19T17:27:23Z,30,30,30
26870,Nerdy_Senpai,410049739,509538,Animal Crossing: New Horizons,{Simulation},en,2020-04-19T18:14:24Z,35,35,35
49401,FlyestRaven,97685994,32959,Heroes of the Storm,{MOBA},en,2020-04-20T00:44:27Z,1,1,2


In [12]:
grid['score'] = grid['max']/grid['max_game_int']
grid.head()

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max,max_game_int,score
0,龜狗,48093884,21779,League of Legends,{MOBA},zh,2020-04-19T04:18:22Z,1931,1931,1931,1.0
1,黒田瑞貴,225658233,511224,Apex Legends,"{FPS,Shooter}",ja,2020-04-20T12:19:02Z,165,165,165,1.0
2,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-20T10:53:28Z,102,102,102,1.0
3,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-20T10:53:28Z,74,74,102,0.72549
4,黑色柳丁,18156459,18122,World of Warcraft,"{MMORPG,RPG}",zh,2020-04-16T11:29:02Z,71,71,102,0.696078


In [13]:
from sklearn.preprocessing import minmax_scale
grid['scaled_score'] = minmax_scale(grid['score'], feature_range=(1, 5))

In [14]:
grid.tail()

Unnamed: 0,user_name,user_id,game_id,game_name,game_genres,language,started_at,viewer_count,max,max_game_int,score,scaled_score
70540,_문님,232779448,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"{Shooter,FPS}",ko,2020-04-20T16:02:01Z,15,15,35,0.428571,2.714286
70541,_문님,232779448,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"{Shooter,FPS}",ko,2020-04-19T16:09:19Z,14,14,35,0.4,2.6
70542,__수현,423848194,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"{Shooter,FPS}",ko,2020-04-19T14:18:18Z,18,18,18,1.0,5.0
70543,__수현,423848194,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"{Shooter,FPS}",ko,2020-04-19T14:18:18Z,16,16,18,0.888889,4.555556
70544,__수현,423848194,493057,PLAYERUNKNOWN'S BATTLEGROUNDS,"{Shooter,FPS}",ko,2020-04-19T14:18:18Z,15,15,18,0.833333,4.333333


In [15]:
#grid.to_csv('final_rating_game.csv', index = False)

In [16]:
grid = grid.dropna()

Now we have a listing for each user pairing them with each game they play, what genre it belongs to, and how many people watched them play each game compared to the max viewers they ever got for a stream during the week we examined.

In [17]:
grid.groupby('user_name')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,user_name,scaled_score
30841,gaules,9
7888,FroggedTV,9
16826,OgamingLoL,9
15146,Monstercat,9
21060,SlotRoom247,9


In [18]:
grid.groupby('game_genres')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,game_genres,scaled_score
12,"{Action,""Adventure Game""}",5311
29,"{FPS,Shooter}",3604
4,"{""Card & Board Game""}",3566
40,{MOBA},3372
17,"{Action,RPG}",3352


In [19]:
grid.groupby('game_name')['scaled_score'].count().reset_index().sort_values('scaled_score', ascending=False)[:5]

Unnamed: 0,game_name,scaled_score
54,Fortnite,900
46,FIFA 20,898
63,Grand Theft Auto V,898
91,Minecraft,895
169,VALORANT,895


In [20]:
min_number_scores = 5
filter_users = grid['user_name'].value_counts() > min_number_scores
filter_users = filter_users[filter_users].index.tolist()

In [21]:
grid_new = grid[(grid['user_name'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(grid.shape))
print('The new data frame shape:\t{}'.format(grid_new.shape))

The original data frame shape:	(69428, 12)
The new data frame shape:	(4699, 12)


After reshaping the grid, we are going to extract the recommendations for game genre, game titles, and games similar to those already rated by the streamers as a three-pronged recommender approach.

Preparing for genre recommendations based on viewership scores:

In [22]:
genres_base_df = grid_new.groupby(by = ['user_name', 'game_genres'])['scaled_score'].agg([np.mean])
games_base_df = grid_new.groupby(by = ['user_name', 'game_name'])['scaled_score'].agg([np.mean])

In [23]:
genres_base_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
user_name,game_genres,Unnamed: 2_level_1
0monstro,"{MMORPG,RPG}",3.626866
1Gn0rance,"{Action,RPG}",3.512821
1ST3NM1,{MMORPG},4.257028
24_Flash,"{Shooter,FPS}",4.31898
24_Flash,{Shooter},4.677054


In [24]:
games_base_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
user_name,game_name,Unnamed: 2_level_1
0monstro,Last Oasis,3.626866
1Gn0rance,Path of Exile,3.512821
1ST3NM1,Knight Online,4.257028
24_Flash,Fortnite,4.677054
24_Flash,VALORANT,4.31898


In [25]:
genres_base_df.columns = genres_base_df.columns.map(''.join)
games_base_df.columns = games_base_df.columns.map(''.join)

In [26]:
genres_base_df = genres_base_df.reset_index()
games_base_df = games_base_df.reset_index()

In [27]:
genres_base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 861 entries, 0 to 860
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_name    861 non-null    object 
 1   game_genres  861 non-null    object 
 2   mean         861 non-null    float64
dtypes: float64(1), object(2)
memory usage: 20.3+ KB


In [28]:
games_base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 868 entries, 0 to 867
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   user_name  868 non-null    object 
 1   game_name  868 non-null    object 
 2   mean       868 non-null    float64
dtypes: float64(1), object(2)
memory usage: 20.5+ KB


In [29]:
genres_base_df.head(5)

Unnamed: 0,user_name,game_genres,mean
0,0monstro,"{MMORPG,RPG}",3.626866
1,1Gn0rance,"{Action,RPG}",3.512821
2,1ST3NM1,{MMORPG},4.257028
3,24_Flash,"{Shooter,FPS}",4.31898
4,24_Flash,{Shooter},4.677054


In [30]:
games_base_df.head(5)

Unnamed: 0,user_name,game_name,mean
0,0monstro,Last Oasis,3.626866
1,1Gn0rance,Path of Exile,3.512821
2,1ST3NM1,Knight Online,4.257028
3,24_Flash,Fortnite,4.677054
4,24_Flash,VALORANT,4.31898


Using Surprise to predict genres/games for a streamer based on their existing games and genres ratings

In [31]:
import surprise
from surprise import Dataset, accuracy, Reader, NMF, NormalPredictor, BaselineOnly, CoClustering, SlopeOne, SVD, KNNBaseline
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

In [32]:
reader = Reader(rating_scale=(1, 5))
genre_base_data = Dataset.load_from_df(genres_base_df[['user_name', 'game_genres', 'mean']], reader)

In [33]:
game_base_data = Dataset.load_from_df(games_base_df[['user_name', 'game_name', 'mean']], reader)

Now it is time to select the algorithms we will use to generate recommendations for our streamers based on their existing behavior and also on how similar the games they already stream are to those they do not yet stream. 

The first step is to evaluate various algorithms available in Surprise to see which produce the lowest RMSE and are computationally friendly so we can use them in a dynamic app and recommend things on the fly. We perform cross validation on all fast algorithms to see which performs best with either of our datasets, genres and games.

In [34]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SlopeOne(), NormalPredictor(), BaselineOnly(), NMF(), CoClustering(), SVD()]:


    # Perform cross validation
    results = cross_validate(algorithm, genre_base_data, measures=['RMSE'], return_train_measures=True, cv=5, verbose=True)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Evaluating RMSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8439  0.9232  0.8707  0.8728  0.8478  0.8717  0.0283  
RMSE (trainset)   0.2146  0.1598  0.2039  0.2077  0.2243  0.2020  0.0223  
Fit time          0.01    0.01    0.01    0.01    0.00    0.01    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9329  1.0114  0.9660  0.9430  1.0887  0.9884  0.0570  
RMSE (trainset)   0.9804  0.9622  1.0432  1.0006  1.0049  0.9983  0.0272  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating 

Unnamed: 0_level_0,test_rmse,train_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVD,0.709333,0.511888,0.037618,0.000911
BaselineOnly,0.711147,0.643543,0.001755,0.000679
NMF,0.852828,0.073837,0.063704,0.001463
SlopeOne,0.871671,0.202041,0.005837,0.001022
CoClustering,0.912478,0.69328,0.083411,0.000914
NormalPredictor,0.988406,0.998263,0.00075,0.001066


For genre data, the SlopeOne seems to do the best algorithm. To drastically reduce overfitting, improve performance and ease implementation, the Slope One family of easily implemented Item-based Rating-Based collaborative filtering algorithms was proposed. Essentially, instead of using linear regression from one item's ratings to another item's ratings ( {\displaystyle f(x)=ax+b} f(x)=ax+b), it uses a simpler form of regression with a single free parameter ( {\displaystyle f(x)=x+b} f(x)=x+b). The free parameter is then simply the average difference between the two items' ratings. It was shown to be much more accurate than linear regression in some instances, and it takes half the storage or less.

In [35]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SlopeOne(), NormalPredictor(), BaselineOnly(), NMF(), CoClustering(), SVD()]:


    # Perform cross validation
    results = cross_validate(algorithm, game_base_data, measures=['RMSE'], return_train_measures=True, cv=5, verbose=True)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Evaluating RMSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8892  0.8842  0.9468  0.8391  0.8751  0.8869  0.0347  
RMSE (trainset)   0.1764  0.1707  0.1540  0.1779  0.1699  0.1698  0.0085  
Fit time          0.01    0.01    0.01    0.00    0.00    0.01    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9593  1.0730  1.0387  1.0115  1.0276  1.0220  0.0373  
RMSE (trainset)   1.0234  0.9975  1.0356  0.9961  1.0327  1.0171  0.0170  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating 

Unnamed: 0_level_0,test_rmse,train_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BaselineOnly,0.713877,0.639823,0.001944,0.000734
SVD,0.716651,0.506276,0.038151,0.000942
NMF,0.863259,0.076103,0.076985,0.00096
SlopeOne,0.886892,0.169779,0.005556,0.001083
CoClustering,0.905517,0.720214,0.078422,0.000743
NormalPredictor,1.022018,1.017081,0.000806,0.001017


For the game data, we are going to use the BaselineOnly algorithm which produces the lowest test RMSE.

Based on the results of the algorithm selector, we will proceed with SlopeOne for genre predictions and with the Baseline algorithm for game predictions. More information about SlopeOne can be found here: https://arxiv.org/abs/cs/0702144

Next we perform cross-validation to see how our chosen algorithms perform:

In [36]:
algo_genre_base = SlopeOne()
cross_validate(algo_genre_base, genre_base_data, measures=['RMSE'], cv=7, verbose=True)

Evaluating RMSE of algorithm SlopeOne on 7 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Mean    Std     
RMSE (testset)    0.9348  0.8509  0.8793  0.7670  0.9153  0.8575  0.8582  0.8662  0.0500  
Fit time          0.01    0.01    0.01    0.01    0.01    0.01    0.01    0.01    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([0.93483987, 0.85094237, 0.87928626, 0.7669717 , 0.91533467,
        0.85750808, 0.85819518]),
 'fit_time': (0.005942106246948242,
  0.0053539276123046875,
  0.00918889045715332,
  0.005586862564086914,
  0.0057599544525146484,
  0.006255149841308594,
  0.005381107330322266),
 'test_time': (0.0011148452758789062,
  0.0009427070617675781,
  0.0011761188507080078,
  0.0008409023284912109,
  0.0018880367279052734,
  0.0007379055023193359,
  0.0010061264038085938)}

In [37]:
bsl_options = {'method': 'als',
               'n_epochs': 10,
               'reg_u': 12,
               'reg_i': 5
               }
algo_games_base = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo_games_base, game_base_data, measures=['RMSE'], cv=7, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE of algorithm BaselineOnly on 7 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Mean    Std     
RMSE (testset)    0.7261  0.6721  0.7078  0.7246  0.6881  0.7187  0.7423  0.7114  0.0224  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([0.72614108, 0.67205069, 0.70784651, 0.724598  , 0.68812599,
        0.71874717, 0.74225313]),
 'fit_time': (0.0019419193267822266,
  0.0020689964294433594,
  0.0029811859130859375,
  0.0024900436401367188,
  0.003913164138793945,
  0.0026319026947021484,
  0.0019321441650390625),
 'test_time': (0.0005550384521484375,
  0.0005609989166259766,
  0.0009799003601074219,
  0.0005941390991210938,
  0.00061798095703125,
  0.0005769729614257812,
  0.0006048679351806641)}

Now we split the data into train and test sets to produce the predictions for genres and games streamers would like. Our RMSE is low for both algorithms, lower for the genres which makes sense since each genre includes many games which produces more observations for each genres than for each individual game.

In [38]:
genre_trainset, genre_testset = train_test_split(genre_base_data, test_size=0.25)
genre_base_predictions = algo_genre_base.fit(genre_trainset).test(genre_testset)
accuracy.rmse(genre_base_predictions)

RMSE: 0.8745


0.8744919422412785

In [39]:
game_trainset, game_testset = train_test_split(game_base_data, test_size=0.25)
game_base_predictions = algo_games_base.fit(game_trainset).test(game_testset)
accuracy.rmse(game_base_predictions)

Estimating biases using als...
RMSE: 0.7035


0.7034600959466067

Saving the models and predictions to use in the predictor app:

In [40]:
import pickle

In [41]:
algo_genre_base = pickle.load( open( "Data/SlopeOne_genre_model.pkl", "rb" ) )

FileNotFoundError: [Errno 2] No such file or directory: 'Data/SlopeOne_genre_model.pkl'

In [None]:
algo_games_base = pickle.load( open( "Data/BaselineOnly_game_model.pkl", "rb" ) )

In [None]:
genre_base_predictions = pickle.load( open( "Data/SlopeOne_genre_model_predictions.pkl", "rb" ) )

In [None]:
game_base_predictions = pickle.load( open( "Data/BaselineOnly_game_model_predictions.pkl", "rb" ) )

In [None]:
genres_base_df = pickle.load( open( "Data/genres.pkl", "rb" ) )

In [None]:
games_base_df = pickle.load( open( "Data/games.pkl", "rb" ) )

Taking the inputs from the user:

In [None]:
streamer_name = input('What is your streamer name? ')
streamer_genres = list(input ('Which game genres do you currently stream? ').split(', '))
streamer_games = list(input ('Which games do you currently stream? ').split(', '))

In [None]:
streamer_name, streamer_genres, streamer_games

Making a list of streamers' current genres and games by combining any information we already have in our dataset and their own inputs into the app:

In [None]:
genres_base_df.head()

In [None]:
def display_current_genres(streamer_name):
    user_genres = list(genres_base_df[genres_base_df['user_name']==streamer_name]['game_genres'])
    return user_genres
recorder_genres_list = display_current_genres(streamer_name)
full_genres = set(recorder_genres_list + streamer_genres)
full_genres = list(full_genres)
full_genres

In [None]:
def display_current_games(streamer_name):
    user_games = list(games_base_df[games_base_df['user_name']==streamer_name]['game_name'])
    return user_games
recorder_games_list = display_current_games(streamer_name)
full_games = set(recorder_games_list + streamer_games)
full_games = list(full_games)
full_games

####  Predicting Genres for Streamers (user-based similarities) ####

In [None]:
from collections import defaultdict

In [None]:
def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for the user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

Setting the inner ids for genres and games to be able to identify them in Surprise nomenclature. The results are all the genres and games the user has not had exposure to:

In [None]:
iids_genre = genres_base_df['game_genres'].unique()
iids_genre_to_predict = np.setdiff1d(iids_genre, full_genres, assume_unique = True)

In [None]:
iids_game = games_base_df['game_name'].unique()
iids_game_to_predict = np.setdiff1d(iids_game, full_games, assume_unique = True)

In [None]:
iids_genre_to_predict

In [None]:
iids_game_to_predict

Making a personal testset for the user, populating same base rating as true since the actuals are not known and we are trying to predict the expected rating:

In [None]:
genre_testset_personal = [[streamer_name, iid, 0.] for iid in iids_genre_to_predict]
game_testset_personal = [[streamer_name, iid, 0.] for iid in iids_game_to_predict]

Producing a list of predictions based on the inputs:

In [None]:
personal_genre_predictions = algo_genre_base.test(genre_testset_personal)
personal_game_predictions = algo_games_base.test(game_testset_personal)

In [None]:
personal_genre_df = pd.DataFrame(personal_genre_predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])    

In [None]:
personal_genre_pred = personal_genre_df[['iid', 'est']]

In [None]:
personal_game_df = pd.DataFrame(personal_game_predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])    

In [None]:
personal_game_pred = personal_game_df[['iid', 'est']]

In [None]:
top_n_genres = get_top_n(personal_genre_predictions)
top_n_games = get_top_n(personal_game_predictions)

In [None]:
top_n_genres

In [None]:
top_n_games

In [None]:
for uid, user_ratings in top_n_genres.items():
    print('For ' + uid + ', the recommended genres are:'+ str([iid for (iid, _) in user_ratings]))
genre_user_based_list = [iid for (iid, _) in user_ratings]

In [None]:
for uid, user_ratings in top_n_games.items():
    print('For ' + uid + ', the recommended games are:'+ str([iid for (iid, _) in user_ratings]))
game_user_based_list = [iid for (iid, _) in user_ratings]