### Preprocessing

Uses the following data:  
1. games_df, a table of general information about (almost) all Steam games
2. review_table, a table of 6+million reviews scraped from the Steam store, listing the user, game, etc
3. recently_played_df, a table of users' recently played games with playtimes

Accomplishes the following:
1. Generate tables containing vectors for every game and every user. Our inputs will be limited to items in this table.
2. Generate reduced tables containing vectors of only the games and users that have rich enough info to be useful for prediction.

In [1]:
import pandas as pd
import numpy as np

from bidict import bidict

import pickle
import pyarrow as pa
import pyarrow.parquet as pq


import scipy.sparse as sp
from scipy.sparse import coo_matrix, csr_matrix, lil_matrix, save_npz

%store -r tags_dict

In [2]:
# Load

with open('../data/interim/1 - Games DF - Wrangled.pkl', 'rb') as file :
    games_df = pickle.load(file)

review_table = pq.read_table('../data/interim/cleaned_reviews.parquet')

with open('../data/interim/recently_played_cleaned.pkl', 'rb') as file :
    recently_played_df = pickle.load(file)

### Make games vectors/matrices

We'll produce one matrix with all known games. This will be used to vectorize input.  
  
Then we'll produce a matrix that contains only those games with sufficient information to be subjects of recommendation.  

In [3]:
# Isolate the info we'll use for the vectors
games_df_tags_only = games_df[['app_id', 'tags', 'tag_list']]
games_df_tags_only.head()

Unnamed: 0,app_id,tags,tag_list
0,730,"[FPS, Shooter, Multiplayer, Competitive, Actio...","[FPS, Shooter, Multiplayer, Competitive, Actio..."
1,553850,"[Action, Online Co-Op, Multiplayer, Third-Pers...","[Action, Online Co-Op, Third-Person Shooter, M..."
2,1086940,"[RPG, Choices Matter, Story Rich, Character Cu...","[RPG, Choices Matter, Story Rich, Character Cu..."
3,1245620,"[Souls-like, Dark Fantasy, Open World, RPG, Di...","[Souls-like, Dark Fantasy, Open World, RPG, Di..."
4,1623730,"[Multiplayer, Open World, Survival, Creature C...","[Multiplayer, Open World, Survival, Creature C..."


In [4]:
# Make a handy dict for the app_ids and their new indexes
game_to_full_index = bidict()
for index, row in games_df_tags_only.iterrows() :
    game_to_full_index[index] = row['app_id']
game_to_full_index = game_to_full_index.inverse

In [5]:
# Get our list of columns
used_tags = set(tags_dict.values())

In [6]:
# Prepare our sparse matrix
# NOTE: We will weight PRIMARY tags more strongly than non-primary tags.
# A lower weight_ratio favors primary tags more.

weight_ratio = 0.8

matrix_values = []

for index, row in games_df_tags_only.iterrows() :
    skips = set()
    for tag in row['tags'] :
        skips.add(tag)
        tup = (index, tag, 1)
        matrix_values.append(tup)
    for tag in row['tag_list'] :
        if tag not in skips :
                    tup = (index, tag, weight_ratio)
                    matrix_values.append(tup)

matrix_values[0]

(0, 'FPS', 1)

In [7]:
# To make the matrix, we must index our tags.
tag_to_col_index = bidict()
i = 0
for value in tags_dict.values() :
    tag_to_col_index[value] = i
    i += 1

In [8]:
# Make the matrix

rows = [row[0] for row in matrix_values]
columns = [tag_to_col_index[row[1]] for row in matrix_values]
values = [row[2] for row in matrix_values]

matrix_row_count = max(rows)+1
matrix_col_count = max(columns)+1

game_tags_matrix = coo_matrix((values, (rows, columns)), shape=(matrix_row_count, matrix_col_count))
game_tags_matrix = csr_matrix(game_tags_matrix)

In [9]:
# Good job, everybody! Let's save it and move on.
save_npz('../data/processed/full_game_tag_matrix.npz', game_tags_matrix)

# And also the index dicts.
with open('../data/processed/tag_to_col_index.pkl', "wb") as file :
    pickle.dump(tag_to_col_index, file)

with open('../data/processed/game_to_full_index.pkl', "wb") as file :
    pickle.dump(game_to_full_index, file)


Now we subset this matrix to include only games with 10+ tags.

This index will be smaller, but we must be careful to note the original index values. That's the only way we can relate the rows in this matrix to any other matrix.

In [10]:
nonzero_counts = pd.Series(game_tags_matrix.getnnz(axis=1))
can_keep = nonzero_counts >= 10
game_tags_matrix_reduced = game_tags_matrix[can_keep]

## This creates a dict with:
##  KEYS == reduced matrix index
##  VALUES == corresponding full matrix index
game_reduced_index_to_full_index = bidict()
i=0
for index, value in can_keep.items() :
    if value==True :
        game_reduced_index_to_full_index[i]=index
        i += 1

In [11]:
# Save! That! Matrix!
save_npz('../data/processed/reduced_game_tag_matrix.npz', game_tags_matrix_reduced)

# And the dict, of course.
with open('../data/processed/game_reduced_index_to_full_index.pkl', 'wb') as file :
    pickle.dump(game_reduced_index_to_full_index, file)

### Now we make the users matrices...

In [12]:
recently_played_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4181501 entries, 0 to 4181500
Data columns (total 4 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   user         object
 1   app_id       int64 
 2   playtime_2w  int64 
 3   playtime_f   int64 
dtypes: int64(3), object(1)
memory usage: 127.6+ MB


In [13]:
review_table.shape

(6747619, 11)

In [14]:
# Because our games_df does not contain every single game on Steam, it's possible
# that a game will be touched in a review or recently_played about which we cannot
# make inference.
# Let's make a set of all usable games to help limit our tables to legal values.
usable_app_ids = set(games_df['app_id'].values)
len(usable_app_ids)

100892

In [15]:
# Now let's reduce the above datasets to only those which touch usable games.
recently_played_df = recently_played_df[recently_played_df['app_id'].isin(usable_app_ids)].reset_index(drop=True)
recently_played_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3593525 entries, 0 to 3593524
Data columns (total 4 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   user         object
 1   app_id       int64 
 2   playtime_2w  int64 
 3   playtime_f   int64 
dtypes: int64(3), object(1)
memory usage: 109.7+ MB


In [16]:
# To subset the review_talbe, we'll have to pandacize it first.

review_df = review_table.to_pandas()
review_df = review_df[review_df['app_id'].isin(usable_app_ids)].reset_index(drop=True)
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5726093 entries, 0 to 5726092
Data columns (total 10 columns):
 #   Column           Dtype  
---  ------           -----  
 0   user             object 
 1   app_id           int64  
 2   positive         int64  
 3   total_playtime   float64
 4   review_playtime  float64
 5   text             object 
 6   helpful_count    int64  
 7   review_date      object 
 8   edit_date        object 
 9   date_scraped     object 
dtypes: float64(2), int64(3), object(5)
memory usage: 436.9+ MB


In [17]:
# We should save processed versions of these for use in the modeling notebook.
usable_review_table = pa.Table.from_pandas(review_df)
pq.write_table(usable_review_table, '../data/processed/usable_review_table.parquet')

with open('../data/processed/usable_recently_played.pkl', 'wb') as file :
    pickle.dump(recently_played_df, file)

In [18]:
# The two tables above contain different but overlapping sets of users.
# In order to combine them into a single users matrix, we must first create
# a unified index for all touched users.

recently_users = set(recently_played_df['user'].values)
review_users = set(review_table['user'].to_pylist())
touched_users = recently_users | review_users

user_to_full_index = bidict()
i=0
for user in touched_users :
    user_to_full_index[user] = i
    i+=1

In [19]:
# We also need a unified index for games.

recently_games =set(recently_played_df['app_id'].values)
review_games = set(review_table['app_id'].to_pylist())
touched_games = recently_games | review_games

game_to_col_index = bidict()
i=0
for game in touched_games :
    game_to_col_index[game] = i
    i+=1

In [20]:
# Now let's prepare our points.

recently_played_points = recently_played_df.apply( \
                                lambda row: (user_to_full_index[row['user']], game_to_col_index[row['app_id']], 0.2), axis=1) \
                                .tolist()

In [21]:
# Since we'll have to update this matrix in a sec, let's make it a lil_matrix first.

user_info_matrix = lil_matrix((len(touched_users), len(touched_games)), dtype=float)

for point in recently_played_points :
    user_info_matrix[point[0], point[1]] = point[2]

In [22]:
# Now let's prepare the review data to update that matrix.

positive_reviews = review_df[review_df['positive']==1][['user', 'app_id']]
negative_reviews = review_df[review_df['positive']==0][['user', 'app_id']]

positive_points = positive_reviews.apply( \
                lambda row: (user_to_full_index[row['user']], game_to_col_index[row['app_id']], 1), axis=1) \
                .tolist()
                                
negative_points = negative_reviews.apply( \
                lambda row: (user_to_full_index[row['user']], game_to_col_index[row['app_id']], -1), axis=1) \
                .tolist()

In [23]:
## Now, we update the matrix.

for point in positive_points :
    user_info_matrix[point[0], point[1]] = point[2]

for point in negative_points :
    user_info_matrix[point[0], point[1]] = point[2]

In [24]:
## Just for funsies, make it the same format as our games matrix.
user_info_matrix = user_info_matrix.tocsr()

In [25]:
# Save! That! Matrix!
save_npz('../data/processed/user_info_matrix.npz', user_info_matrix)

# And the dicts, of course.
with open('../data/processed/user_to_full_index.pkl', 'wb') as file :
    pickle.dump(user_to_full_index, file)

with open('../data/processed/game_to_col_index.pkl', 'wb') as file :
    pickle.dump(game_to_col_index, file)

In [26]:
# Now we create a subset of the matrix that contains only users with enough info for prediction.
# Where should we put the threshold?

nonzero_counts = pd.Series(user_info_matrix.getnnz(axis=1))
nonzero_counts.describe()

count    1.887354e+06
mean     4.779766e+00
std      7.041132e+00
min      0.000000e+00
25%      1.000000e+00
50%      3.000000e+00
75%      6.000000e+00
max      5.517000e+03
dtype: float64

In [27]:
# I'll arbitrarily choose 5. Sue me! I dare you.

can_keep = nonzero_counts >= 5
user_info_matrix_reduced = user_info_matrix[can_keep]

## This creates a dict with:
##  KEYS == reduced matrix index
##  VALUES == corresponding full matrix index

user_reduced_index_to_full_index = bidict()
i=0
for index, value in can_keep.items() :
    if value==True :
        user_reduced_index_to_full_index[i]=index
        i += 1

In [28]:
# Save! That! Matrix!
save_npz('../data/processed/user_info_matrix_reduced.npz', user_info_matrix_reduced)

# etc etc
with open('../data/processed/user_reduced_index_to_full_index.pkl', 'wb') as file :
    pickle.dump(user_reduced_index_to_full_index, file)

At time of inference, the top X consine similarity users' rows will be called up, and the most common games not already played by the user will be recommended.

# Modeling

Describes a system for taking a single user_id as an input and generating game recommendations.

In [29]:
from scipy import stats

from scipy.sparse import load_npz
from sklearn.metrics.pairwise import cosine_similarity

from bayes_opt import BayesianOptimization

import time
import random

import warnings
warnings.filterwarnings('ignore')

### Load data

In [30]:
# Load the dfs to display the rec results in a human-readable way
with open('../data/interim/1 - Games DF - Wrangled.pkl', 'rb') as file :
    games_df = pickle.load(file)
with open('../data/processed/usable_recently_played.pkl', 'rb') as file :
    recently_played_df = pickle.load(file)
review_table = pq.read_table('../data/processed/usable_review_table.parquet')
with open('../data/raw/all_users', 'rb') as file :
    all_users = pickle.load(file)

# Load the tables used to define input
game_tags_matrix = load_npz('../data/processed/full_game_tag_matrix.npz')
user_info_matrix = load_npz('../data/processed/user_info_matrix.npz')

# Load the tables used for inference
game_tags_matrix_reduced = load_npz('../data/processed/reduced_game_tag_matrix.npz')
user_info_matrix_reduced = load_npz('../data/processed/user_info_matrix_reduced.npz')

# Load the index converters for the games/tags matrix
with open('../data/processed/tag_to_col_index.pkl', "rb") as file :
    tag_to_col_index = pickle.load(file)
with open('../data/processed/game_reduced_index_to_full_index.pkl', 'rb') as file :
    game_reduced_index_to_full_index = pickle.load(file)
with open('../data/processed/game_to_full_index.pkl', 'rb') as file :
    game_to_full_index = pickle.load(file)

# Load the index converters for the users/games matrix
with open('../data/processed/user_to_full_index.pkl', 'rb') as file :
    user_to_full_index = pickle.load(file)
with open('../data/processed/game_to_col_index.pkl', 'rb') as file :
    game_to_col_index = pickle.load(file)
with open('../data/processed/user_reduced_index_to_full_index.pkl', 'rb') as file :
    user_reduced_index_to_full_index = pickle.load(file)

## Define the functions

### Collaborative recommendations, step 1: find similar users

In [31]:
# Get an arbitrary number of similar users
def get_similar_users(target_user_row, similar_user_limit=50, test_user_indices=[], testing=False, verbose=False) :
    """
    Takes a user id and returns a sorted descending series of the X most similar users:
        keys = user index (in the reduced matrix)
        values = cosine similarity
    Rows that have played no games that the target user hasn't also played are removed,*
    as they have no novel info for prediction.
    """

    # Go ahead and run cosine similarity now, so that the scores can be associated with the correct index (the reduced user matrix index).
    row_cosine_similarities = pd.Series(cosine_similarity(target_user_row, user_info_matrix_reduced)[0], name="similarity_score")

    # If we're testing, remove all test users from this step
    if testing==True :
        for index in row_cosine_similarities.index :
            if index in test_user_indices :
                row_cosine_similarities = row_cosine_similarities.drop(index)

    # Remove the 1s.
    # This eliminates any user whose play profile is identical to that of the target user,
    # meaning they would be useless for prediction.
    # Also removes the target user.
    row_cosine_similarities = row_cosine_similarities[row_cosine_similarities < 1]

    row_cosine_similarities.sort_values(ascending=False, inplace=True)
    most_similar_users = row_cosine_similarities[:similar_user_limit]

    if verbose==True :
        print(f"Top {similar_user_limit} most similar users:")
        for index, value in most_similar_users.items() :
            print(f"{round(value, 9)} -- {index}")
            
    return(most_similar_users)

### Collaborative recommendations, step 2: generate scores

In [32]:
# Generate suggestions based on that.
# First, find out all the games these "similar users" have played.
# Then, weight those playes by the similarity score, then sum them across users.

# NOTE：I could speed all this up by just using arrays instead of rows/columns


def get_collab_scores(similar_users, collab_filter_limit=50, target_user_touched_games=[], verbose=False) :
    """
    Takes a series in the format:
        keys: user index (in the reduced matrix)
        values: similarity score to target user (float)

    Looks at all those users' games, calculates a utility score for their played games
    based on the group's collective utility score for those games.

    Returns a series of len {collab_filter_limit} in the format:
        keys: app_id
        values: collab filter score
    """

    # We need to make a column in a df for each game, so we grab all games
    # here (indexed by game col index).
    # We remove any games already touched by the target user along the way.
    relevant_games = set()
    for user in similar_users.keys() :
        for game in user_info_matrix[user].indices :
            if game not in target_user_touched_games :
                relevant_games.add(game)
    
    similar_users_df = pd.DataFrame(index=similar_users.index)

    # Create a column for each game
    for game in relevant_games :
        similar_users_df[game] = 0

    # Now fill those columns with the scores
    for user in similar_users.keys() :
        for game in user_info_matrix[user].indices :
            similar_users_df.loc[user, game] = user_info_matrix_reduced[user, game]

    # Some games have an outlier number of comments, which makes them ALWAYS appear
    # in the collab recs list.
    # This bit of code counteracts that by multiplying each value in each game's col
    # by the inverse log of the sum of the col.
    # Needlessly complex? Maybe. But the motivation behind nerfing the *individual*
    # values in the df instead of just nerfing the games at the end is because 
    # nerfing them at the end (using log) will also nerf the effect of the similarity
    # score, which I don't want.
    # So, finding a way to modify the individual game preference scores before applying
    # the similarity score to the column sums seems preferable.

    # To do this, we can't have negative values. Even zeroes are untrustworthy.
    # Increasing the values here does not skew the overall results, since the
    # results will be normalized.
    # minimum = np.nanmin(similar_users_df.values)

    # if minimum < 0 :
    #     similar_users_df = similar_users_df - (minimum - 1.0001)
    # else :
    #     similar_users_df = similar_users_df + 1.0001

    # for col in similar_users_df.columns :
    #     col_sum = similar_users_df[col].sum()
    #     col_coef = 1/np.log(col_sum ** 1.3)
    #     similar_users_df[col] = similar_users_df[col] * col_coef
    # NOTE: This code just isn't having the desired result! A new approach is needed.
    # Maybe just identify any outliers, and nerf that column iff they appear??? dxdx

    # TODO: Alternatively, consider applying this step while compiling the matrices...!

    # Now multiply the scores by the similarity score
    for user, row in similar_users_df.iterrows() :
        for game in relevant_games :
            if row[game] != 0 :
                similar_users_df.loc[user, game] = similar_users[user] * row[game]

    # Now collect those scores
    # NOTE: CONVERTS FROM GAME COL INDEX BACK TO APP_ID AT THIS STEP
    collab_filt_rec_scores = {}
    for game in relevant_games:
        collab_filt_rec_scores[game_to_col_index.inverse[game]] = similar_users_df[game].sum()

    collab_filt_scores = pd.Series(collab_filt_rec_scores, name="collab_rec_score").sort_values(ascending=False)

    # Limit the output
    collab_filt_scores = collab_filt_scores[:collab_filter_limit]
    
    # Normalize
    collab_filt_scores = stats.zscore(collab_filt_scores.astype(float))

    # To make them interactible with later scores, let's standardize them
    # scaler = MinMaxScaler()
    # collab_filt_scores = pd.Series(scaler.fit_transform(collab_filt_rec_scores.values.reshape(-1,1)).flatten(), index=collab_filt_rec_scores.index)
    # if verbose==True :
    #     for app_id, score in collab_filt_scores.items() :
    #         print(f"{round(score, 3)} -- {games_df[games_df['app_id']==app_id]['title'].values[0]}")
    
    return collab_filt_scores

### Content-based recommendations

In [33]:
# NOTE: STRETCH GOAL: Determine multimodality, generate different lists of scores

In [34]:
# Determine most similar games for each, multiply by user preference, MinMaxScale.

def get_content_scores(target_user_row, content_filter_limit, verbose=False) :

    """
    Takes a row from the users/games table, then does the following:
        1. Finds recs for each from the reduced game/tags matrix
        2. Creates a descending-sorted 10-row Series:
            keys = game's index (relative to main games_df)
            values = queried game's cosine similarity score to the queried game
        3. Weights all values in the series by the user's preference for the game
        4. Combines all resulting series into a single series with sim scores summed
        5. Returns the series
    """

    # Get the most recent games.
    # I already have the games in terms of col indices in the users matrix.
    # I need the indices for the games matrix, to do cos similarity.

    # Go from game col index to app_id
    # played_game_app_ids = [game_to_col_index.inverse[game] for game in target_user_row.indices]
    # # Go from app_id to full game row index
    # played_game_full_matrix_row_indices = [game_to_full_index[game] for game in played_game_app_ids]

    # Find the sim scores for each game, adding them to the main list
    similarity_series_list = []
    full_row_indexes = []

    for query_index in target_user_row.indices :

        # Get the reduced game/tags matrix index
        current_app_id = game_to_col_index.inverse[query_index] 
        current_full_row_index = game_to_full_index[current_app_id]
        try :
            reduced_row_index = game_reduced_index_to_full_index.inverse[current_full_row_index]
        except :
            ####
            continue 

        # Let's grab the full row index. We will later use this to remove
        # already-touched games from the recommendations.
        full_row_indexes.append(current_full_row_index)

        # Find the similarity score between games
        # The resulting series is indexed by the reduced games matrix
        row_cosine_similarities = pd.Series(cosine_similarity(game_tags_matrix[current_full_row_index], game_tags_matrix_reduced)[0])
        # Reindex the predictions back to the full game matrix index
        row_cosine_similarities.index = [game_reduced_index_to_full_index[index] for index in row_cosine_similarities.index]
        row_cosine_similarities.sort_values(ascending=False, inplace=True)
        # Since we cannot know ahead of time how many games the user has touched,
        # and the user may have only touched one game,
        # we can safely limit the results here to the overall content filter limit.
        # This will ensure that the full number of scores are returned no matter 
        # how many games were touched. 
        # This is still indexed by the full game matrix.
        top_similar = row_cosine_similarities[:content_filter_limit]

        # Now, get a coefficient to represent the user's preference for the game in question.
        # All games similar to this game will be modified by this coefficient.
        # First we find the game's column in the full user matrix
        preference_coefficient = target_user_row[0, query_index]
        # Then we just multiply.
        top_similar = top_similar * preference_coefficient

        # That's all we need for the score! Let's append.
        similarity_series_list.append(top_similar)

        # if verbose == True :
        #     print(f"Recs for {games_df.loc[current_full_row_index]['title']}:")
        #     for rec in top_similar.items() :
        #         print(f"{round(rec[1], 3)} -- {games_df.loc[rec[0]]['title']}")

    # Combine the serieses into the main series.
    # Here it's still indexed by full game matrix index.
    final_scores = pd.Series()
    for similarity_series in similarity_series_list :
        final_scores = final_scores.add(similarity_series, fill_value=0)
    final_scores = final_scores.sort_values(ascending=False)

    # Remove already-played games from the main series
    for game in full_row_indexes :
        try :
            final_scores = final_scores.drop(labels=game)
        except :
            continue

    # Normalize the series to make it similar to the collaborative score series
    # I suppose the original values were some wonky kind of float that didn't work
    # with scipy, so we coerce them here.
    content_filt_scores = stats.zscore(final_scores.astype(float))

    # Return the desired number of values
    if len(content_filt_scores) < content_filter_limit :
        content_filter_limit = len(content_filt_scores)
    content_filt_scores = content_filt_scores[:content_filter_limit]

    # Set index to app_id
    content_filt_scores.index = [game_to_full_index.inverse[game] for game in content_filt_scores.index]

    # if verbose == True :
    #     print('------------------')
    #     for game, score in content_filt_scores.items() :
    #         print(f"{round(score, 3)} -- {games_df[games_df['app_id']==game]['title'].values[0]}")     
    
    return(content_filt_scores)

### Combine the recs for final set of recs

In [35]:
# NOTE: Is it worth checking the scale discrepancy between the two and tweaking the ratio
# programmatically for each distribution?

# It may be that tweaking the relative weights of the collaborative and content-based filter results
# can improve accuracy. Let's define this as a function so we can play with that programmatically
# later, if need be.

def combine_scores(collaborative, content_based, double_bonus=0, popular_bias=0, ratio=0.5, recs=10) :
    """
    Takes a series of collaborative filtering scores (key=app_id, value=score)
    And a series of content based filtering scores with the same schema
    And a 0-1 ratio of importance between the two (higher ratio favors collaborative scores)
    And the "double_bonus", which is multiplied/summed to the score of each game that appears in both lists
    And a "popular_bias" which is multiplied to the pos_review_percent and added to the score
    And the number of recommendations to be returned

    Returns a series of game app_ids and recommendation scores
    """

    # Define the ratio-modified scores
    collaborative = collaborative * ratio
    content_based = content_based * (1-ratio)

    # Add them into the base final scores series
    final_recs = collaborative.add(content_based, fill_value=0)

    # Apply doubles bonus, if any
    doubles = []
    for game in collaborative.index :
        if game in content_based.index :
            doubles.append(game)
    for game in doubles :
        final_recs[game] += (final_recs[game] * double_bonus)

    # Apply popularity bonus, if any
    for index in final_recs.index :
        positive_review_percent = games_df[games_df['app_id']==index]['positive_review_percent'].values[0]
        final_recs[index] += (positive_review_percent * popular_bias)

    # Sort descending
    final_recs = final_recs.sort_values(ascending=False)

    # Determine length
    if len(final_recs) < recs :
        recs = len(final_recs)
    
    # Determine final recs
    final_recs = final_recs[:recs]

    return final_recs

### Create unified function

In [36]:
def get_recs(user, similar_user_limit=50, collab_filter_limit=50, content_filter_limit=50, double_bonus=0, popular_bias=0, ratio=0.5, recs=10, test_user_indices=[], test_user_rows=[], testing=False, verbose=False, show_result=False) :

    """
    Takes a user via full user matrix index
    Generates a series of recommendations
    
    """

    # Let's-a-go!
    begin = time.time()

    # Grab some useful info about the target user
    if testing==True :
        # target_user_id = user_reduced_index_to_full_index.inverse[game_reduced_index_to_full_index[user]]
        target_user_row = test_user_rows[test_user_indices.index(user)]
        # print(target_user_row)
    else :
        # target_user_id = user_to_full_index.inverse[user]
        target_user_row = user_info_matrix[user]

    target_user_touched_games = set(target_user_row.indices)

    # Display some stuff
    if show_result==True or verbose==True :
        print("------ User profile:")
        for game in target_user_touched_games :
            app_id = game_to_col_index.inverse[game]
            score = target_user_row[0, game]
            print(f"{score} - {games_df[games_df['app_id']==app_id]['title'].values[0]}")
        print("--------------------")

    # Get similar users
    most_similar_users = get_similar_users(target_user_row, similar_user_limit=similar_user_limit, test_user_indices=test_user_indices, testing=testing, verbose=False)
    if verbose==True :
        print("\nTop 5 most similar users\n")
        print(most_similar_users.head())
        print("--------------------------")

    # Print the params here to make it easier to intuit what's happening in the results
    if verbose==True :
        print(f"similar_user_limit: {similar_user_limit}")
        print(f"collab_filter_limit: {collab_filter_limit}")
        print(f"content_filter_limit: {content_filter_limit}")
        print(f"double_bonus: {round(double_bonus, 3)}")
        print(f"popular_bias: {round(popular_bias, 3)}")
        print(f"ratio col/con: {round(ratio, 3)}")
        print(f"num of recs: {recs}\n")

    # Get scores from them
    collab_filt_scores = get_collab_scores(most_similar_users, collab_filter_limit=collab_filter_limit, target_user_touched_games=target_user_touched_games, verbose=False)
    if verbose==True :
        print("Top 8 collab filt scores")
        for index, item in collab_filt_scores.head(8).items() :
            print(f"{round(item, 3)} -- {games_df[games_df['app_id']==index]['title'].values[0]}")
        print("--------------------------")

    # Get content filtering scores
    content_filt_scores = get_content_scores(target_user_row, content_filter_limit=content_filter_limit,  verbose=False)
    if verbose==True :
        print("Top 8 content filt scores")
        for index, item in content_filt_scores.head(8).items() :
            print(f"{round(item, 3)} -- {games_df[games_df['app_id']==index]['title'].values[0]}")
        print("--------------------------")
    
    # Calculate final scores
    final_recs = combine_scores(collab_filt_scores, content_filt_scores, double_bonus=double_bonus, popular_bias=popular_bias, ratio=ratio, recs=recs)

    if show_result==True or verbose==True:
        print('')
        print('------ Recommendations')
        for index, score in final_recs.items() :
            print(f"{round(score, 3)} -- {games_df[games_df['app_id']==index]['title'].values[0]}")
        print("--------------------")
        print(f"\nRuntime: {round(time.time()-begin, 2)}s\n")

    return final_recs

## Execution!

In [37]:
# NOTE: Commented out so that running the notebook skips directly to testing/evaluation

# params = {
#     "similar_user_limit":50,
#     "collab_filter_limit":50,
#     "content_filter_limit":50,
#     "double_bonus":2,
#     "popular_bias":3,
#     "ratio":0.2,
#     "recs":20,
#     "verbose":False,
#     "show_result": True
# }

# recs = get_recs(50, **params)

### Testing

The basic idea is to remove a game or games from a user's profile (preferably a user with a significant number of touched games) and see if the engine recommends that game for the modified user profile.

In [38]:
# Programmatically assemble a set of usable user profiles (10+ touched games)

def get_test_users(user_count=50, minimum_games=10) :
    """
    Returns user_count number of test users (reduced user matrix index) as a list
    Each user must have at least minimum_games number of touched games and disliked at least one game
    """
    
    # VARS
    test_user_indices = []
    checked_indices = set()
    total_users = user_info_matrix_reduced.shape[0]

    while len(test_user_indices) < user_count :
        # Find a user at random
        index = random.randint(1, total_users)
        # Make sure you haven't done this one before
        if index not in checked_indices :
            # Make sure the user has enough games
            if len(user_info_matrix_reduced[index].indices) >= minimum_games :
                # Make sure they dislike at least one game
                if -1 in user_info_matrix_reduced[index].data :
                    # Log 'em!
                    test_user_indices.append(index)
        # Log 'em!
        checked_indices.add(index)
    
    return test_user_indices

In [39]:
# Randomly remove one or more game(s) from each, and save as a list of rows.

def create_test_rows(test_user_indices) :
    """
    Takes a list of reduced user matrix indices (test users)
    Returns 3 items:
    1. A list of those users' rows with the most- and least- liked games removed
    2. A list of those users' most-liked games (in matching index order)
    3. A list of those users' least-liked games (in matching index order)
    """

    test_rows = []
    liked_games = []
    disliked_games = []

    for test_user in test_user_indices :
        current_row = user_info_matrix_reduced[test_user]

        # Pick a game they LIKED VERY MUCH to remove
        liked_game = np.argmax(current_row.data)

        # Pick a game they HATED to remove.
        # It's possible that all values are positive, in which case this variable will
        # have no meaning. We will check that at the evaluation phase.
        # For now, we'll pull the value no matter what to perserve index relationships.
        disliked_game = np.argmin(current_row.data)

        # Save the values for evaluation
        liked = (test_user, current_row.indices[liked_game], current_row.data[liked_game])
        liked_games.append(liked)
        disliked = (test_user, current_row.indices[disliked_game], current_row.data[disliked_game])
        disliked_games.append(disliked)

        # Remove the values

        lil_row = current_row.tolil()
        lil_row[0, liked[1]] = 0
        lil_row[0, disliked[1]] = 0
        current_row = lil_row.tocsr()
        current_row.eliminate_zeros()

        # Add it to the list
        test_rows.append(current_row)
        
    return test_rows, liked_games, disliked_games


In [40]:
# Evaluation function

# Checks to see if POSITIVE game is in top X recs: +
# Checks to see if NEGATIVE game is in recs at all: -

# TODO: Reframe the holder lists as a series, where the game is the index, values are empty lists, 1 and -1 are added to the list where applicable
#   This will allow us to more easily see which users are best served, and how often we propose positive AND negative games to the same user

def binary_evaluator(results, verbose=False, show_result=False) :
    """"
    Takes a list of tuples with the schema:
        1. Series of recommendations (key=app_id, value=utility score)
        2. app_id of favorite game
        3. app_id of least favorite game

    Evaluates the recommendations in each tuple by:
        Adding 1 to the overall score if the favorite game is recommended
        Subtracting 1 from the overall score if the least favorite game is recommended
    
    Returns the score as an int
    """

    good = []
    bad = []

    for result in results :
        # Grab the series of recommendations
        recs = result[0].index
        # See if we recommended the liked game, and score accordingly
        if result[1] in recs :
            good.append(1)
        # See if we recommended the disliked game, and score accordingly
        if result[2] in recs :
            bad.append(1)
    
    validated = sum(good)
    disproved = sum(bad)
    score = validated - disproved

    print(f"Validated recommendations: {validated}")
    print(f"Disproved recommendations: {disproved}")

    return score

In [41]:
# Another evaluation function

# If a POSITIVE game is in the recs, add the utility score to the score (so higher ranking = higher score)
# If a NEGATIVE game is in the recs, subtract (same as above) (for same reason)

def ranked_evaluator(results, verbose=False, show_result=True) :
    """"
    Takes a list of tuples with the schema:
        1. Series of recommendations (key=app_id, value=utility score)
        2. app_id of favorite game
        3. app_id of least favorite game

    Evaluates the recommendations in each tuple by:
        Adding the inverse rank of the positive game if it's included in the recs
        Subtracting the inverse rank of the negative game if it's included in the recs
    
    Returns the score as an int
    """

    good = []
    bad = []

    score = 0
    user = 0

    for result in results :
        # Grab the series of recommendations
        recs = result[0].index
        # See if we recommended the liked game, and score accordingly
        if result[1] in recs :
            # Get the ranking (since it's as an index, it starts from 0 and needs a little boost to prevent 1/0)
            ranking = recs.get_loc(result[1]) + 1
            good.append((user, ranking))
            score += 1/ranking
        # See if we recommended the disliked game, and score accordingly
        if result[2] in recs :
            # Samesies, except we neg it
            ranking = recs.get_loc(result[2]) + 1
            bad.append((user, ranking))
            score -= 1/ranking
        user += 1 
    
    # validated = sum(good)
    # disproved = sum(bad)
    # score = validated - disproved

    if show_result == True :
        print(f"Positives: {len(good)}")
        print(f"Negatives: {len(bad)}")
        print(f"Score: {score}")
        print(f"\n---\nGood: \n{good}")
        print(f"\n---\nBad: \n{bad}")

    # print(f"Validated recommendations: {validated}")
    # print(f"Disproved recommendations: {disproved}")

    return score

In [42]:
# Run inference on each modified profile


def run_a_test(test_users=50, evaluator=ranked_evaluator, similar_user_limit=50, collab_filter_limit=50, \
               content_filter_limit=50, double_bonus=0, popular_bias=0, ratio=0.5, recs=10, \
               testing=True, verbose=False, show_result=False) :
    
    global results

    # Generate X test user indices
    test_user_indices = get_test_users(test_users)

    test_rows, liked_games, disliked_games = create_test_rows(test_user_indices)

    # test_user_full_indices = [user_reduced_index_to_full_index[user] for user in test_user_indices]
    results = []

    params = {
        "test_user_indices":test_user_indices,
        "test_user_rows":test_rows,
        "similar_user_limit":similar_user_limit,
        "collab_filter_limit":collab_filter_limit,
        "content_filter_limit":content_filter_limit,
        "double_bonus":double_bonus,
        "popular_bias":popular_bias,
        "ratio":ratio,
        "recs":recs,
        "verbose":verbose,
        "show_result":show_result,
        "testing":testing
        }

    for i in range(len(test_user_indices)) :
        result = get_recs(test_user_indices[i], **params)
        results.append((result, \
                        game_to_col_index.inverse[liked_games[i][1]], \
                        game_to_col_index.inverse[disliked_games[i][1]]))
        
    score = evaluator(results, verbose=verbose, show_result=show_result)

    return score

In [43]:
# NOTE: Commented out so that running the notebook skips directly to testing/evaluation
minimums = []
col_sums=[]
test_params = {
    "test_users":1,
    "evaluator":ranked_evaluator,
    "similar_user_limit":50,
    "collab_filter_limit":20,
    "content_filter_limit":10,
    "double_bonus":1.79,
    "popular_bias":0,
    "ratio":0.71,
    "recs":20,
    "testing":True,
    "verbose":True,
    "show_result":True
}

run_a_test(**test_params)

------ User profile:
-1.0 - Overwatch® 2
1.0 - MORDHAU
-1.0 - Brawlhalla
1.0 - Conan Exiles
1.0 - PUBG: BATTLEGROUNDS
1.0 - Bean Battles
-1.0 - PlanetSide 2
1.0 - HELLDIVERS™ 2
1.0 - Battlefield 4™
-1.0 - Red Dead Online
1.0 - Max Payne 3
1.0 - Vampire Survivors
1.0 - Among Us
1.0 - CRSED: F.O.A.D.
1.0 - Garry's Mod
--------------------

Top 5 most similar users

97243     0.438529
138125    0.425596
155504    0.390360
221531    0.390130
344787    0.386501
Name: similarity_score, dtype: float64
--------------------------
similar_user_limit: 50
collab_filter_limit: 20
content_filter_limit: 10
double_bonus: 1.79
popular_bias: 0
ratio col/con: 0.71
num of recs: 20

Top 8 collab filt scores
3.136 -- Counter-Strike 2
2.799 -- Team Fortress 2
0.265 -- Dota 2
-0.365 -- Caliber
-0.365 -- Red Orchestra 2: Heroes of Stalingrad with Rising Storm
-0.365 -- Warface: Clutch
-0.365 -- Bloons TD 6
-0.365 -- PAYDAY 2
--------------------------
Top 8 content filt scores
1.784 -- Realm Royale Reforged
1.

-1.0

### O   P   T   I   M   I   Z   E

In [44]:
def boptimize(test_users=50, evaluator=ranked_evaluator, n_iter=4, init_points=10, verbose=False, show_result=False) :
    
    # We'll define our test set here, so the Bayesian bit below will
    # execute on the same subset each time.

    # Generate X test user indices
    test_user_indices = get_test_users(test_users)

    # Generate test rows
    test_rows, liked_games, disliked_games = create_test_rows(test_user_indices)


    # Define the scoring function within the main function so that it has native access to variables
    def bayes_test(test_user_indices=test_user_indices, liked_games=liked_games, \
               disliked_games=disliked_games, test_rows=test_rows, evaluator=evaluator, \
               similar_user_limit=50, collab_filter_limit=50, \
               content_filter_limit=50, double_bonus=0, popular_bias=0, ratio=0.5, \
               recs=10, testing=True, verbose=verbose, show_result=show_result) :
    
        params = {
        "test_user_indices":test_user_indices,
        "test_user_rows":test_rows,
        "similar_user_limit":int(similar_user_limit),
        "collab_filter_limit":int(collab_filter_limit),
        "content_filter_limit":int(content_filter_limit),
        "double_bonus":double_bonus,
        "popular_bias":popular_bias,
        "ratio":ratio,
        "recs":int(recs),
        "verbose":verbose,
        "show_result":show_result,
        "testing":testing
        }
        
        results = []

        for i in range(len(test_user_indices)) :
            result = get_recs(test_user_indices[i], **params)
            results.append((result, \
                            game_to_col_index.inverse[liked_games[i][1]], \
                            game_to_col_index.inverse[disliked_games[i][1]]))
            
        score = evaluator(results)
        print(score)

        return score


    # Prepare params for Bayes
    param_bounds = {
        "similar_user_limit":(100, 140),
        "collab_filter_limit":(40, 100),
        "content_filter_limit":(10, 50),
        "double_bonus":(0, 3),
        "popular_bias":(2, 5),
        "ratio":(0.2, 0.95),
        "recs":(20, 20),
        }
    
    # Execute
    optimizer = BayesianOptimization(f=bayes_test, pbounds=param_bounds, random_state=42)
    optimizer.maximize(init_points=init_points, n_iter=n_iter)
    best_params = optimizer.max
    print(best_params)

    return best_params

I ran the following cell many times, recording the result each time.

This final iteration represents a honing-in on the most optimal parameters.

A fuller discussion will be the forthcoming project report.

In [45]:
boptimize(test_users=70, n_iter=15, init_points=8, show_result=True)

|   iter    |  target   | collab... | conten... | double... | popula... |   ratio   |   recs    | simila... |
-------------------------------------------------------------------------------------------------------------
------ User profile:
1.0 - Castle Crashers®
1.0 - Session: Skate Sim
1.0 - Supraland
1.0 - FIGHTING EX LAYER
0.2 - Nioh: Complete Edition
0.2 - SMITE®
0.2 - NBA 2K Playgrounds 2
1.0 - Mafia III: Definitive Edition
1.0 - Mortal Kombat 11
0.2 - Animal Shelter
1.0 - SOULCALIBUR VI
1.0 - Risk of Rain 2
1.0 - Skater XL - The Ultimate Skateboarding Game
0.2 - WWE 2K BATTLEGROUNDS
-1.0 - Borderlands Game of the Year Enhanced
0.2 - DARK SOULS™ III
1.0 - Borderlands 2
--------------------

------ Recommendations
7.001 -- Street Fighter V
5.529 -- Counter-Strike 2
5.168 -- BlazBlue Centralfiction
5.101 -- Killer Instinct
5.06 -- GUILTY GEAR Xrd -SIGN-
5.028 -- Mafia: Definitive Edition
4.908 -- THE KING OF FIGHTERS XV
4.907 -- TEKKEN 8
4.882 -- Far Cry 3
4.876 -- BlazBlue: Chrono

{'target': 3.5095959595959596,
 'params': {'collab_filter_limit': 75.54487413172255,
  'content_filter_limit': 11.85801650879991,
  'double_bonus': 1.822634555704315,
  'popular_bias': 2.5115723710618747,
  'ratio': 0.24878869473895965,
  'recs': 20.0,
  'similar_user_limit': 138.62528132298237}}

### Conclusion

While the bulk of the conclusions and analysis will be in the forthcoming presentation step, we can see by the 'target' score that we are able to games that the user will enjoy much more frequently than games the user will not enjoy, and that the optimal parameters for our evaluation function are as follows (I had to remove the outputs to make the notebook readable):  

{'target': 13.0,  
 'params': {'collab_filter_limit': 103.48734651630105,  
  'content_filter_limit': 32.921446485352185,  
  'double_bonus': 1.7930894746349533,  
  'popular_bias': 1.8472839810604431,  
  'ratio': 0.7140703845572055,  
  'recs': 20.0,  
  'similar_user_limit': 250.0}}