In [1]:
version = "REPLACE_PACKAGE_VERSION"

---
# Assignment 4: Collaborative Filtering (50 pts)

Our newsfeed business eventually went bankrupt because we were unable to serve our subscribers' long-term information needs well using Adaptive Filtering. Back from a long trip, fully recharged, you take a freelance job as an external consultant for Steam, a software platform that sells video games. You learned during the trip that yet another way of making recommendations to users is through Collaborative Filtering, and your supervisor provided you with a [data set](https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data) ("Version 1: User and Item Data") about many Steam users' time spent playing various games. You are confident about offering useful recommendations to users this time. 

The function below reads and processes the user-game data, and returns a [CSR](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) user-game matrix. The user-game matrix is very similar to the "user-item matrix" in the lecture slides, in that each row represents a user and each column represents a game (item). The `(i, j)` entry of the user-game matrix represents user `i`'s time spent playing game `j` measured in thousand hours, which we also call "playtime". Playtime is a natural proxy for a user's preference on games --- the more time a user spent playing a game, the more attractive the game is. To allow a qualitative analysis of the filtering results later, this function also returns a `list` containing the names of the games in each column. 

In [2]:
import gzip
from scipy.sparse import csr_matrix

def read_data(max_user=1000):
    items_to_cols = dict()
    data, indices, indptr = [], [], [0]

    with gzip.open("assets/australian_users_items.json.gz", "rb") as f:
        user_count = 0
        for line in f:
            if max_user is not None and user_count >= max_user:
                break

            # Parse each line as a dict
            user_dict = eval(line)
            if not user_dict["items"]:
                continue
            elif all((item["playtime_forever"] == 0 for item in user_dict["items"])):
                continue

            # Ref: last example from https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
            for item in user_dict["items"]:
                col = items_to_cols.setdefault(item["item_name"], len(items_to_cols))
                indices.append(col)
                data.append(item["playtime_forever"] / 1000)
            indptr.append(len(indices))
            user_count += 1
            
    user_game_mat = csr_matrix((data, indices, indptr), dtype=float)
    games = list(items_to_cols.keys())
    
    return user_game_mat, games

In [3]:
# Load the user-game matrix and a list of games
user_game_mat, games = read_data()

# Compute the proportion of non-zero entries in the user-game matrix
density = user_game_mat.count_nonzero() / (user_game_mat.shape[0] * user_game_mat.shape[1])

print(f"Density of user-game matrix = {density * 100:.2f}%") # A very sparse matrix
print(games[:5]) # Any games you are also playing?

Density of user-game matrix = 0.97%
['Counter-Strike', 'Team Fortress Classic', 'Day of Defeat', 'Deathmatch Classic', 'Half-Life: Opposing Force']


## Question 1: Memory-based Collaborative Filtering (20 pts)

Let's first implement memory-based Collaborative Filtering. Complete the function below that takes in a user-game matrix `user_game_mat` as returned by `read_data` and a zero-based user index `user_idx`. It returns a `np.ndarray` (or a `np.matrix`) of shape `(1, user_game_mat.shape[1])` that represents the predicted playtime for each game for the user at `user_idx`. The algorithm is described in the lecture slide titled "Memory-based Approach". We will use **Cosine Similarity** as the similarity measure between two users based on their **raw ratings**. You do **not** need to exclude the user at `user_idx` when averaging the ratings $v_{ij}$ to compute $\hat{v}_{aj}$, since we are interested in recommending games that are not currently being played by the user. 

**Hint:** Even though the algorithm only specifies how to predict a single rating score $r_{aj}$, a fully vectorised solution to this problem is possible. Many of the steps in the algorithm can be performed simultaneously for all users/games. The calculation of $\hat{v}_{aj}$ can be simplified as a matrix operation. 

In [4]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import issparse


def do_memory_cf(matrix, user_index):
    
    if issparse(matrix):
        matrix_dense = matrix.toarray()
    else:
        matrix_dense = matrix
        
    # Calculate the average playtime for each user
    avg_playtime_per_user = np.mean(matrix_dense, axis=1)
    
    # Adjust user playtimes by subtracting the user's average playtime
    adjusted_playtimes = matrix_dense - avg_playtime_per_user.reshape(-1, 1)
    
    # Extract the playtime vector for the target user
    target_user_vector = matrix_dense[user_index, :].reshape(1, -1)
    
    # Compute the similarity of the target user to all users
    similarity_scores = cosine_similarity(target_user_vector, matrix_dense).flatten()
    
    # Ensure adjusted_playtimes is aligned correctly for dot product
    weighted_sum = similarity_scores.dot(adjusted_playtimes)
    
    # Normalize the weighted sum by the total similarity to get the prediction
    prediction = weighted_sum / np.sum(similarity_scores)

    # Adjust the prediction by adding back the mean playtime of the target user
    final_prediction = prediction + avg_playtime_per_user[user_index]
    
    # Ensure the final prediction is shaped correctly as a 1xN array (where N is the number of games)
    final_prediction = final_prediction.reshape(1, -1)
    
    return final_prediction

In [5]:
# Autograder tests
import scipy.sparse
import numpy as np

user_game_mat, games = read_data()

user_idx = 0 # This may vary in the hidden tests

stu_pred_playtime = do_memory_cf(user_game_mat, user_idx)

# Some sanity checks
assert isinstance(stu_pred_playtime, np.ndarray), f"Q1: Your function should return a np.ndarray. "
assert stu_pred_playtime.ndim == 2, f"Q1: Your np.ndarray should be 2-dimensional. "
assert stu_pred_playtime.shape == (1, user_game_mat.shape[1]), f"Q1: Your np.ndarray is of an incorrect shape. "

# Test it with only one user
stu_pred_playtime = do_memory_cf(user_game_mat[0], 0)

if not np.allclose(stu_pred_playtime, user_game_mat[0].todense(), atol=1e-4, rtol=1e-2):
    raise AssertionError(f"Q1: When there is only one user, the user's raw playtime should be returned. ")

# Some hidden tests

del user_game_mat, games, user_idx, stu_pred_playtime

We could do some qualitative analysis on the results of our memory-based Collaborative Filtering. For example, we can determine what *new* games should be recommended to a user and how the recommendations compare with the games that the user spent most of the time playing. Do the recommendations make sense? You may need to google each game a bit to understand what type of game it is. We encourage you to perform other kinds of analysis as you see fit, such as computing the RMSE between the true and predicted playtime. 

In [6]:
import scipy.sparse
import numpy as np

user_game_mat, games = read_data()

user_idx, top_n = 1, 5

# Top favourite games
user_vec = user_game_mat[user_idx].toarray().flatten()
cur_max_inds = np.argsort(-user_vec)
cur_top_games = [games[i] for i in cur_max_inds[:top_n]]
print(f"Top {top_n} favourite games:")
print(cur_top_games)

print()

# Top recommended games
stu_pred_playtime = do_memory_cf(user_game_mat, user_idx)
stu_pred_playtime = np.asarray(stu_pred_playtime).flatten()
rec_max_inds = np.argsort(-stu_pred_playtime[user_vec == 0]) # user_vec == 0 => games currently not played by the user
rec_top_games = [games[i] for i in rec_max_inds[:top_n]]
print(f"Top {top_n} recommended games:")
print(rec_top_games)

print()

# RMSE between true & predicted playtime
rmse = np.sqrt(np.sum((user_vec[user_vec != 0] - stu_pred_playtime[user_vec != 0]) ** 2))
print(f"RMSE = {rmse}")

Top 5 favourite games:
['The Elder Scrolls V: Skyrim', 'Terraria', 'Saints Row: The Third', 'Portal 2', "Deus Ex: Human Revolution - Director's Cut"]

Top 5 recommended games:
['Fallout 3', 'Dishonored', "Life Is Strange™ - Directors' Commentary - 3. Intentions", "Life Is Strange™ - Directors' Commentary - 5. A lively world", "Hunted: The Demon's Forge"]

RMSE = 20.65125916832172


## Question 2: Matrix-based Collaborative Filtering (30 pts)

Collaborative Filtering can also be done via factoring the user-game matrix, which we now explore. 

### Question 2a: SVD-based Collaborative Filtering (20 pts)

The main idea behind SVD-based Collaborative Filtering is to approximate the sparse user-game matrix with a rank-$k$ matrix: $R \approx U_{k}\Sigma_{k}V_{k}^{T}$. $k$ is also referred to as the number of "factors" which are represented by the columns of $U_{k}$ or the rows of $V_{k}^{T}$. Complete the function below that takes in a `user_game_mat` and a `user_idx` as usual, and performs SVD-based Collaborative Filtering with `num_factors` and `random_state`. Likewise, it should return a `np.ndarray` (or a `np.matrix`) of shape `(1, user_game_mat.shape[1])` that represents the predicted playtime for each game for the user at `user_idx`.

**Hint:** The idea of SVD-based Collaborative Filtering may look simple, but it may well take a while to figure out how to implement that with `sklearn` (unless you implement your own SVD). It might be useful to think about what *data* you have available, what *methods* are associated with `sklearn`'s SVD and what the *shapes* of your inputs to those methods should be. **The ideal solution uses just one line of code for making predictions.**

In [7]:
from sklearn.decomposition import TruncatedSVD

def do_svd_cf(user_game_mat, user_idx, num_factors, random_state=None):
    """
    Do SVD-based collaborative filtering for user at user_idx
    """
    svd = TruncatedSVD(n_components=num_factors, random_state=random_state)
    user_game_matrix_reduced = svd.fit_transform(user_game_mat)
    user_game_matrix_approx = np.dot(user_game_matrix_reduced, svd.components_)
    
    # Extract the predicted playtime for the specified user
    pred_playtime = user_game_matrix_approx[user_idx, :].reshape(1, -1)
    

    return pred_playtime

In [8]:
# Autograder tests
import scipy.sparse
import numpy as np

user_game_mat, games = read_data()

# These won't change in the hidden tests
num_factors, random_state = 100, 42

user_idx = 0 # This may vary in the hidden tests

stu_pred_playtime = do_svd_cf(user_game_mat, user_idx, num_factors, random_state)

# Some sanity checks
assert isinstance(stu_pred_playtime, np.ndarray), f"Q2a: Your function should return a np.ndarray. "
assert stu_pred_playtime.ndim == 2, f"Q2a: Your np.ndarray should be 2-dimensional. "
assert stu_pred_playtime.shape == (1, user_game_mat.shape[1]), f"Q2a: Your np.ndarray is of an incorrect shape. "

# Some hidden tests

del user_game_mat, games, user_idx, stu_pred_playtime

Of course, we could carry out the same qualitative analysis on the results of our SVD-based Collaborative Filtering and compute the RMSE between the true and predicted playtime. Do the recommendations make (more) sense? 

In [9]:
import scipy.sparse
import numpy as np

user_game_mat, games = read_data()

user_idx, top_n = 1, 5
num_factors, random_state = 100, 42

# Top favourite games
user_vec = user_game_mat[user_idx].toarray().flatten()
cur_max_inds = np.argsort(-user_vec)
cur_top_games = [games[i] for i in cur_max_inds[:top_n]]
print(f"Top {top_n} favourite games:")
print(cur_top_games)

print()

# Top recommended games
stu_pred_playtime = do_svd_cf(user_game_mat, user_idx, num_factors, random_state)
stu_pred_playtime = np.asarray(stu_pred_playtime).flatten()
rec_max_inds = np.argsort(-stu_pred_playtime[user_vec == 0]) # user_vec == 0 => games currently not played by the user
rec_top_games = [games[i] for i in rec_max_inds[:top_n]]
print(f"Top {top_n} recommended games:")
print(rec_top_games)

print()

# RMSE between true & predicted playtime
rmse = np.sqrt(np.sum((user_vec[user_vec != 0] - stu_pred_playtime[user_vec != 0]) ** 2))
print(f"RMSE = {rmse}")

Top 5 favourite games:
['The Elder Scrolls V: Skyrim', 'Terraria', 'Saints Row: The Third', 'Portal 2', "Deus Ex: Human Revolution - Director's Cut"]

Top 5 recommended games:
["Assassin's Creed Liberation", 'Darkest of Days', 'Call of Duty', 'Half-Life 2', 'Trine']

RMSE = 7.902019281745345


### Question 2b: NMF-based Collaborative Filtering (10 pts)

NMF offers yet another way of factoring the user-game matrix. Complete the function below for NMF-based Collaborative Filtering, which accepts the same inputs and produces the same output as your function in the last question. Use `init="nndsvd"` for your `NMF`. 

**Hint:** If you have come through the challenge of implementing SVD-based Collaborative Filtering with `sklearn`, then this question should be a breeze. 

In [10]:
from sklearn.decomposition import NMF

def do_nmf_cf(user_game_mat, user_idx, num_factors, random_state=None):
    """
    Do NMF-based collaborative filtering for user at user_idx
    """
    nmf = NMF(n_components=num_factors, init='nndsvd', random_state=random_state)
    W = nmf.fit_transform(user_game_mat)
    H = nmf.components_
    
    # Reconstruct the matrix and extract predictions
    user_game_matrix_approx = np.dot(W, H)
    pred_playtime = user_game_matrix_approx[user_idx, :].reshape(1, -1)
    

    return pred_playtime

In [11]:
# Autograder tests
import scipy.sparse
import numpy as np

user_game_mat, games = read_data()

# These won't change in the hidden tests
num_factors, random_state = 100, 42

user_idx = 0 # This may vary in the hidden tests

stu_pred_playtime = do_nmf_cf(user_game_mat, user_idx, num_factors, random_state)

# Some sanity checks
assert isinstance(stu_pred_playtime, np.ndarray), f"Q2b: Your function should return a np.ndarray. "
assert stu_pred_playtime.ndim == 2, f"Q2b: Your np.ndarray should be 2-dimensional. "
assert stu_pred_playtime.shape == (1, user_game_mat.shape[1]), f"Q2b: Your np.ndarray is of an incorrect shape. "

# Some hidden tests - may take a while

del user_game_mat, games, user_idx, stu_pred_playtime

We could do the same qualitative analysis again but this time on the results of our NMF-based Collaborative Filtering. How do the recommendations compare with that from SVD-based Collaborative Filtering? Better or worse?

In [12]:
import scipy.sparse
import numpy as np

user_game_mat, games = read_data()

user_idx, top_n = 1, 5
num_factors, random_state = 100, 42

# Top favourite games
user_vec = user_game_mat[user_idx].toarray().flatten()
cur_max_inds = np.argsort(-user_vec)
cur_top_games = [games[i] for i in cur_max_inds[:top_n]]
print(f"Top {top_n} favourite games:")
print(cur_top_games)

print()

# Top recommended games
stu_pred_playtime = do_nmf_cf(user_game_mat, user_idx, num_factors, random_state)
stu_pred_playtime = np.asarray(stu_pred_playtime).flatten()
rec_max_inds = np.argsort(-stu_pred_playtime[user_vec == 0]) # user_vec == 0 => games currently not played by the user
rec_top_games = [games[i] for i in rec_max_inds[:top_n]]
print(f"Top {top_n} recommended games:")
print(rec_top_games)

print()

# RMSE between true & predicted playtime
rmse = np.sqrt(np.sum((user_vec[user_vec != 0] - stu_pred_playtime[user_vec != 0]) ** 2))
print(f"RMSE = {rmse}")

Top 5 favourite games:
['The Elder Scrolls V: Skyrim', 'Terraria', 'Saints Row: The Third', 'Portal 2', "Deus Ex: Human Revolution - Director's Cut"]

Top 5 recommended games:
["Assassin's Creed Liberation", 'WARP', 'Dragon Saga', 'Beat Hazard', 'Darkest of Days']

RMSE = 8.490670969341032


## References for data used

Kang, W.C., & McAuley, J. (2018). Self-Attentive Sequential Recommendation. In 2018 IEEE International Conference on Data Mining (ICDM) (pp. 197-206).

Wan, M., & McAuley, J. (2018). Item Recommendation on Monotonic Behavior Chains. In Proceedings of the 12th ACM Conference on Recommender Systems (pp. 86–94). Association for Computing Machinery.

Pathak, A., Gupta, K., & McAuley, J. (2017). Generating and Personalizing Bundle Recommendations on Steam. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1073–1076). Association for Computing Machinery.