## Home task: Collaborative Filtering

Find the public dataset and apply collaborative filtering recommendation


For Collaborative Filtering we are using Kaggle dataset [Steam Video Games](https://www.kaggle.com/datasets/tamber/steam-video-games) containing recommend video games from 200,000 Steam user interactions.

Dataset have three columns: user-id, game-title, behavior-name, value. The behaviors included are *purchase* and *play*. The value indicates the degree to which the behavior was performed - in the case of *purchase* the value is always 1, and in the case of *play* the value represents the number of hours the user has played the game.

#### Load dataset and prepare for algorithm

In [1]:
import os

CWD = os.getcwd()
DATA_DIR = os.path.join(CWD, 'data')
STEAM_FILE = os.path.join(DATA_DIR, 'steam-200k.csv')

In [2]:
import pandas as pd

# Read data from CSV file
steam_data = pd.read_csv(
    STEAM_FILE,
    header=None,
    usecols=[0, 1, 2, 3],
    names=['User ID', 'Steam game', 'User behavior', 'Hours played']
)
steam_data.head()

Unnamed: 0,User ID,Steam game,User behavior,Hours played
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0
1,151603712,The Elder Scrolls V Skyrim,play,273.0
2,151603712,Fallout 4,purchase,1.0
3,151603712,Fallout 4,play,87.0
4,151603712,Spore,purchase,1.0


Since for a specific user and a specific game there can be 2 entries where `User behavior` is *purchase* and *play* (2 values for `Hours played`), we should reduce number of entries so that there is a single entry with a single value of `Hours played`.

To achive this, we can group entries by `User ID` and `Steam game` values and select the highest value for `Hours played`.

In [3]:
# Set 'Hours played' to zero in entries where 'User behavior' is 'purchase'
steam_data.loc[steam_data['User behavior'] == 'purchase', 'Hours played'] = 0
steam_data.head()

Unnamed: 0,User ID,Steam game,User behavior,Hours played
0,151603712,The Elder Scrolls V Skyrim,purchase,0.0
1,151603712,The Elder Scrolls V Skyrim,play,273.0
2,151603712,Fallout 4,purchase,0.0
3,151603712,Fallout 4,play,87.0
4,151603712,Spore,purchase,0.0


In [4]:
# Convert the table to format with three columns: 'User ID', 'Steam game' and 'Hours played'
ratings = steam_data.groupby(['User ID', 'Steam game'])['Hours played'].max().reset_index()
ratings.head()

Unnamed: 0,User ID,Steam game,Hours played
0,5250,Alien Swarm,4.9
1,5250,Cities Skylines,144.0
2,5250,Counter-Strike,0.0
3,5250,Counter-Strike Source,0.0
4,5250,Day of Defeat,0.0


In [5]:
# Construct a dictionary mapping game titles to game IDs
games = ratings['Steam game'].unique()
game_ids = dict((game, i) for i, game in enumerate(games))

# Show first 10 entries in game_ids
pd.Series(game_ids).head(10)

Alien Swarm                 0
Cities Skylines             1
Counter-Strike              2
Counter-Strike Source       3
Day of Defeat               4
Deathmatch Classic          5
Deus Ex Human Revolution    6
Dota 2                      7
Half-Life                   8
Half-Life 2                 9
dtype: int64

In [6]:
# Replace the game title with the game IDs
ratings['game id'] = ratings['Steam game'].map(game_ids)

# Construct dataframe in format [uid, iit, rating] where
# uid - column with user IDs
# iit - column with item (game) IDs
# rating - column with ratings given by users to items (games)
ratings.rename(columns={'User ID': 'user id', 'Hours played': 'rating'}, inplace=True)
ratings = ratings[['user id', 'game id', 'rating']]
ratings.head()

Unnamed: 0,user id,game id,rating
0,5250,0,4.9
1,5250,1,144.0
2,5250,2,0.0
3,5250,3,0.0
4,5250,4,0.0


In [7]:
r_min = ratings['rating'].min()
r_max = ratings['rating'].max()
a = 1
b = 5

# Normalize the ratings to range [1, 5] using min-max scaling
ratings['rating'] = ratings['rating'].apply(lambda x: a + (b - a) * (x - r_min) / (r_max - r_min))
ratings.head()

Unnamed: 0,user id,game id,rating
0,5250,0,1.001668
1,5250,1,1.049005
2,5250,2,1.0
3,5250,3,1.0
4,5250,4,1.0


In [8]:
from surprise import Reader, Dataset

# Convert ratings dataset to surprise-acceptable format
reader = Reader()
dataset = Dataset.load_from_df(ratings, reader)

# Provide the whole dataset for training
data = dataset.build_full_trainset()

#### Apply SVD algorithm

In [9]:
from surprise import SVDpp

# Use SVD++ algorithm on training data
algo = SVDpp()
algo.fit(data);

#### Evaluate the algorithm

In [10]:
from surprise.accuracy import mae, mse, rmse

# Get the predictions for training data
test_data = ratings.to_numpy()
predictions = algo.test(test_data)

print('Predictions evaluation:')
print(f'MAE: {mae(predictions, verbose=False):.4f}')
print(f'MSE: {mse(predictions, verbose=False):.4f}')
print(f'RMSE: {rmse(predictions, verbose=False):.4f}')

Predictions evaluation:
MAE: 0.0151
MSE: 0.0030
RMSE: 0.0550


#### Create recommendations for chosen user

In [11]:
# Choose a random user to get recommendations
user_id = ratings['user id'].sample(random_state=1).squeeze()
print('Chosen user is', user_id)

Chosen user is 17567828


In [12]:
# A set of all game IDs
all_game_ids = set(ratings['game id'])

# A set of game IDs rated by the chosen user
rated_game_ids = set(ratings.loc[ratings['user id'] == user_id, 'game id'])

# A set of game IDs that have not yet been rated by the chosen user
unrated_game_ids = all_game_ids - rated_game_ids

print('Number of games:', len(all_game_ids))
print('Number of rated games:', len(rated_game_ids))
print('Number of unrated games:', len(unrated_game_ids))

Number of games: 5155
Number of rated games: 144
Number of unrated games: 5011


In [13]:
# Construct traning dataset for the chosen user
test_user_data = [
    [user_id, i, None]
    for i in unrated_game_ids
]

# Get the predictions and sort them to recommend games with the highest predicted rating
user_predictions = algo.test(test_user_data)
user_predictions.sort(key=lambda x: x.est, reverse=True)

In [14]:
# Show top 10 game recommendations for the chosen user
print(f'Recommendations for user {user_id}:')
for i, pred in enumerate(user_predictions[:10], start=1):    
    game = next(filter(lambda x: game_ids[x] == pred.iid, game_ids))  # Get the game title by given game ID
    print(f'{i}: "{game}" (id {pred.iid}) with rating {pred.est:.4f}')

Recommendations for user 17567828:
1: "Football Manager 2015" (id 1443) with rating 1.1265
2: "Football Manager 2012" (id 2176) with rating 1.1233
3: "Football Manager 2013" (id 1441) with rating 1.1189
4: "Football Manager 2010" (id 2174) with rating 1.1151
5: "Football Manager 2011" (id 2175) with rating 1.1149
6: "Football Manager 2014" (id 1442) with rating 1.1091
7: "FINAL FANTASY XIV A Realm Reborn" (id 1070) with rating 1.0957
8: "Sam & Max 203 Night of the Raving Dead" (id 2016) with rating 1.0899
9: "Starbound - Soundtrack" (id 4482) with rating 1.0852
10: "The Repopulation" (id 3457) with rating 1.0765
