# Task 2

In this task, we will build a collaborative filtering recommender system using user data from Steam, a gaming platform. The dataset has four attributes: the user ID, game, user behaviour, and playtime (if the behaviour is play). This data allows us to infer user preferences in the absence of concrete, explicit ratings given by users towards games.


The main goal is to develop a personalised recommender system which uses previous interaction history between user and games to determine which games a user might enjoy in future. The end result should show a list of N games linked to a specific user_id tailored to the user’s unique profile. As the Steam dataset is 200,000 rows, Spark is a good module to use as this can be considered Big Data.


We will use Alternating Least Squares (ALS) algorithm for matrix factorisation, before analysing the performance of the algorithm using metrics such as RMSE and Precision@10. There will also be some examples of hyperparameter tuning, and all of the experiments will be tracked using MLflow. 



In [0]:
%pip install mlflow

Python interpreter will be restarted.
Collecting mlflow
  Downloading mlflow-2.22.1-py3-none-any.whl (29.0 MB)
Collecting sqlalchemy<3,>=1.4.0
  Downloading sqlalchemy-2.0.41-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
Collecting graphene<4
  Downloading graphene-3.4.3-py2.py3-none-any.whl (114 kB)
Collecting docker<8,>=4.0.0
  Downloading docker-7.1.0-py3-none-any.whl (147 kB)
Collecting Flask<4
  Downloading flask-3.1.1-py3-none-any.whl (103 kB)
Collecting gunicorn<24
  Downloading gunicorn-23.0.0-py3-none-any.whl (85 kB)
Collecting markdown<4,>=3.3
  Downloading markdown-3.8-py3-none-any.whl (106 kB)
Collecting alembic!=1.10.0,<2
  Downloading alembic-1.16.1-py3-none-any.whl (242 kB)
Collecting mlflow-skinny==2.22.1
  Downloading mlflow_skinny-2.22.1-py3-none-any.whl (6.3 MB)
Collecting gitpython<4,>=3.1.9
  Downloading GitPython-3.1.44-py3-none-any.whl (207 kB)
Collecting sqlparse<1,>=0.4.0
  Downloading sqlparse-0.5.3-py3-none-any.whl (44 kB)
Collecting opent

In [0]:
# Import all the necessary libraries

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
from pyspark.ml.feature import StringIndexer
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.mllib.evaluation import RankingMetrics

import mlflow
import logging
import mlflow.spark

from pyspark.sql.functions import col, countDistinct, count, when, avg, explode, collect_list, log1p

from pyspark.sql.functions import sum as sum_

In [0]:
logging.getLogger("mlflow").setLevel(logging.ERROR)
mlflow.pyspark.ml.autolog()


#### Data Overview

The dataset used for this project is `steam-200k.csv`, which contains 200,000 records of user-game interactions collected from the Steam platform. The attributes are as follows:

- **user_id**: A unique identifier for each user.
- **game**: The title of the game involved in the interaction.
- **behaviour**: The action performed by the user towards the game. This is either `purchase` or `play`.
- **value**: For `purchase` interactions, this is set to 1, while for `play` interactions, this is set to the number of hours played.

The first three attributes should make up a primary key for this table, but there might be anomalies for games with downloadable content or games which have been gifted to friends. These interactions are treated as implicit feedback as there are no ratings given by the users.

Having an overview of the data ensures that we know what to look for in the exploratory data analysis section.


In [0]:
df = spark.read.csv("dbfs:/FileStore/tables/steam_200k.csv", inferSchema=True)
df = df.withColumnRenamed("_c0", "user_id") \
       .withColumnRenamed("_c1", "game") \
       .withColumnRenamed("_c2", "behaviour") \
       .withColumnRenamed("_c3", "value")


In [0]:
df.head(10)


Out[4]: [Row(user_id=151603712, game='The Elder Scrolls V Skyrim', behaviour='purchase', value=1.0),
 Row(user_id=151603712, game='The Elder Scrolls V Skyrim', behaviour='play', value=273.0),
 Row(user_id=151603712, game='Fallout 4', behaviour='purchase', value=1.0),
 Row(user_id=151603712, game='Fallout 4', behaviour='play', value=87.0),
 Row(user_id=151603712, game='Spore', behaviour='purchase', value=1.0),
 Row(user_id=151603712, game='Spore', behaviour='play', value=14.9),
 Row(user_id=151603712, game='Fallout New Vegas', behaviour='purchase', value=1.0),
 Row(user_id=151603712, game='Fallout New Vegas', behaviour='play', value=12.1),
 Row(user_id=151603712, game='Left 4 Dead 2', behaviour='purchase', value=1.0),
 Row(user_id=151603712, game='Left 4 Dead 2', behaviour='play', value=8.9)]

#### Exploratory Data Analysis (EDA)

Verifying the validity of the data is crucial before moving on to developing the recommender system. Firstly, we discovered some basic information about the data, specifically around the user_ids and games. 

The dataset contains:
- **Number of unique users**: 12393
- **Number of unique games**: 5155
- **Distribution of behaviour**: ~65% `purchase` entries and ~35% `play` entries  

This final point was expected, as a game has to be purchased before it is played on.

We then split the dataset into two separate dataframes: one where the behaviour was `play` and one for `purchase`. This would allow us to see how users interact with games, and how these interactions might become more apparent during preprocessing and modelling.

Using the ‘play’ dataframe, we checked for the top games by average playtime. The top two were Eastside Hockey Manager and Baldur’s Gate II, with many Football Manager titles taking up the spots from 3-12. A good recommender system might bundle these manager titles together for specific users.

To validate data integrity of the dataset, we then searched for duplicate rows. As we touched upon earlier, if a row has the same user_id, game name, and behaviour, then it is a duplicate row and should be analysed further.

- **719** duplicate rows across the entire df
- **12** of these duplicate rows were where ‘behaviour’ = ‘play’
- **707** duplicate purchase entries

This could be explained by bundle purchases including downloadable content, or it might be a problem with the Steam software logging redundant entries. I made the decision to remove all of the duplicate purchase entries while keeping the multiple play entries, as these represent actual user engagement. This would remove 707 rows, and prepare the newly cleaned dataset for modelling.


In [0]:
# Number of unique users
num_users = df.select("user_id").distinct().count()
print(f"Number of unique users: {num_users}")

# Number of unique games
num_games = df.select("game").distinct().count()
print(f"Number of unique games: {num_games}")

# Proportion of 'purchase' vs 'play'
behavior_counts = df.groupBy("behaviour").count()
behavior_counts.show(10)

Number of unique users: 12393
Number of unique games: 5155
+---------+------+
|behaviour| count|
+---------+------+
| purchase|129511|
|     play| 70489|
+---------+------+



In [0]:
# Filter only "play" and "purchase" behaviour
play_df = df.filter(df.behaviour == "play")
purchase_df = df.filter(df.behaviour == "purchase")

In [0]:
# Group by game and calculate average playtime
avg_playtime = (
    play_df.groupBy("game")
           .agg(avg("value").alias("avg_playtime"))
           .orderBy(col("avg_playtime").desc())
)

# Show top 20 most played (on average)
avg_playtime.show(20, truncate=False)


+----------------------------------------------+------------------+
|game                                          |avg_playtime      |
+----------------------------------------------+------------------+
|Eastside Hockey Manager                       |1295.0            |
|Baldur's Gate II Enhanced Edition             |475.2555555555556 |
|FIFA Manager 09                               |411.0             |
|Perpetuum                                     |400.975           |
|Football Manager 2014                         |391.9846153846154 |
|Football Manager 2012                         |390.45316455696195|
|Football Manager 2010                         |375.04857142857145|
|Football Manager 2011                         |365.7032258064516 |
|Freaking Meatbags                             |331.0             |
|Out of the Park Baseball 16                   |330.4             |
|Football Manager 2015                         |315.364935064935  |
|Football Manager 2013                         |

In [0]:
# Group by the three key columns and count occurrences
df.groupBy("user_id", "game", "behaviour") \
  .count() \
  .filter("count > 1") \
  .limit(10).display()

user_id,game,behaviour,count
11373749,Sid Meier's Civilization IV Warlords,purchase,2
2259650,Grand Theft Auto Vice City,purchase,2
164561444,Sid Meier's Civilization IV Beyond the Sword,purchase,2
2259650,Sid Meier's Civilization IV,purchase,2
81585721,Grand Theft Auto III,purchase,2
31733621,Sid Meier's Civilization IV Colonization,purchase,2
84513749,Sid Meier's Civilization IV Colonization,purchase,2
64787956,Grand Theft Auto Vice City,purchase,2
100351493,Sid Meier's Civilization IV Warlords,purchase,2
33865373,Grand Theft Auto III,purchase,2


In [0]:
# Group play_df by user_id and game, then count
duplicate_plays = play_df.groupBy("user_id", "game") \
                         .count() \
                         .filter("count > 1")

# Join to original play_df to see the actual duplicate rows
duplicate_play_rows = play_df.join(duplicate_plays.drop("count"), on=["user_id", "game"], how="inner")

# Show results
duplicate_play_rows.show(5, truncate=False)

+---------+----------------------------+---------+-----+
|user_id  |game                        |behaviour|value|
+---------+----------------------------+---------+-----+
|118664413|Grand Theft Auto San Andreas|play     |1.9  |
|118664413|Grand Theft Auto San Andreas|play     |0.2  |
|50769696 |Grand Theft Auto San Andreas|play     |10.9 |
|50769696 |Grand Theft Auto San Andreas|play     |3.1  |
|71411882 |Grand Theft Auto III        |play     |1.1  |
+---------+----------------------------+---------+-----+
only showing top 5 rows



In [0]:
# Group purchase_df by user_id and game, then count
duplicate_purchases = purchase_df.groupBy("user_id", "game") \
                         .count() \
                         .filter("count > 1")

# Join to original purchase_df to see the actual duplicate rows
duplicate_purchase_rows = purchase_df.join(duplicate_purchases.drop("count"), on=["user_id", "game"], how="inner")

# Show results
duplicate_purchase_rows.show(5, truncate=False)

+--------+--------------------------------------------+---------+-----+
|user_id |game                                        |behaviour|value|
+--------+--------------------------------------------+---------+-----+
|11373749|Sid Meier's Civilization IV                 |purchase |1.0  |
|11373749|Sid Meier's Civilization IV                 |purchase |1.0  |
|11373749|Sid Meier's Civilization IV Beyond the Sword|purchase |1.0  |
|11373749|Sid Meier's Civilization IV Beyond the Sword|purchase |1.0  |
|11373749|Sid Meier's Civilization IV Warlords        |purchase |1.0  |
+--------+--------------------------------------------+---------+-----+
only showing top 5 rows



In [0]:
# Drop duplicate purchases where user_id + game are the same
clean_purchase_df = purchase_df.dropDuplicates(["user_id", "game"])

# Combine clean purchases back with original plays
df_cleaned = clean_purchase_df.union(play_df)

In [0]:
df_cleaned.describe().show(10)

+-------+-------------------+----------------+---------+------------------+
|summary|            user_id|            game|behaviour|             value|
+-------+-------------------+----------------+---------+------------------+
|  count|             199293|          199293|   199293|            199293|
|   mean| 1.03718100450071E8|           140.0|     null|17.934246561595128|
| stddev|7.212047654050417E7|             0.0|     null|138.29795241565554|
|    min|               5250|     007 Legends|     play|               0.1|
|    max|          309903146|theHunter Primal| purchase|           11754.0|
+-------+-------------------+----------------+---------+------------------+



#### Data Preprocessing and Feature Engineering

To prepare the dataset for training a filtering model, we need to construct a singular rating which will act as the combined value for each user-game interaction. We are relying on the implicit feedback mentioned before, which includes playtime and purchase habits. 

A new column called `rating` was created, and the value was calculated as follows:

- If the user purchased the game, the rating is set to 1, indicating clear interest in the game.
- If the user played the game, a logarithmic transformation `log1p` is applied to the playtime in hours. This transformation:
  - compresses very large values, reducing the effect of outliers
  - preserves the structure of engagement (more hours still gives a higher rating)

After this change, the new dataframe will have adjusted ratings for each combination of user_id and game, which reflect the user’s interest in the game. 


In [0]:
ratings = df_cleaned.withColumn("rating",
    when(col("behaviour")=="purchase", 1.0)
   .otherwise(log1p(col("value")))
).select("user_id", "game", "rating")
ratings.cache()
ratings.describe("rating").show(10)


+-------+-------------------+
|summary|             rating|
+-------+-------------------+
|  count|             199293|
|   mean| 1.3724203573639617|
| stddev| 1.0847619258183574|
|    min|0.09531017980432487|
|    max|   9.37203396097417|
+-------+-------------------+



#### Indexing the `game` Column

Before training the ALS model, we must convert the important categorical features (such as game names) into numerical IDs, as ALS requires integer-based identifiers. That meant employing the `StringIndexer` on the `game` column, encoding each game into a newly created `game_id`. The `game_id` column was then cast to integer type to ensure that it was compatible with ALS.

We then checked the first 5 rows of this new dataframe `indexed` which had the game names, the game IDs, and the ratings created before.

In [0]:

game_indexer = StringIndexer(inputCol="game", outputCol="game_id", handleInvalid="skip")
indexed = game_indexer.fit(ratings).transform(ratings)
indexed = indexed.withColumn("game_id", col("game_id").cast("int"))


In [0]:
indexed.head(5)

Out[15]: [Row(user_id=76767, game='Half-Life', rating=1.0, game_id=45),
 Row(user_id=298950, game='Natural Selection 2', rating=1.0, game_id=169),
 Row(user_id=577614, game='Day of Defeat', rating=1.0, game_id=28),
 Row(user_id=975449, game='Borderlands', rating=1.0, game_id=103),
 Row(user_id=975449, game='Swords and Soldiers HD', rating=1.0, game_id=1610)]

#### Aggregating Ratings

The final step before training the recommender system was to combine the ratings for `purchase` and `play` for each combination of `user_id` and `game_id` into one unified value. This was done simply by using the `sum` function to add the two ratings (1 for purchase, a logarithmic rating for play) into one total score. Some users had multiple records for playing the same game multiple times (as we noted before), and so this justified the use of aggregation.

The new dataframe, `aggRatings`, consisted of the `user_id`, `game_id`, and `rating`, where the user_id and game_id would only have one value for each combination. All of these were numerical values, and could now be passed into the ALS algorithm for collaborative filtering.


In [0]:
aggRatings = (indexed
  .groupBy("user_id", "game_id")
  .agg(sum_("rating").alias("rating"))
)

aggRatings.limit(10).display()


user_id,game_id,rating
57103808,958,2.88706964903238
110906645,4,4.025291075795535
130950166,0,1.0953101798043248
235692659,114,2.6094379124341005
99264709,344,2.568615917913845
11403772,32,3.681021528714291
109175936,104,7.07993319509559
118664413,118,2.2470322937863827
92593907,487,1.0
108454875,385,3.2512917986064958


#### Training the ALS Model and Hyperparameter Tuning

Here, we train the collaborative filtering model, with the aim of retrieving a list of games for each user tailored to their specific profile.

Firstly, we split the dataset into a training set and a test set using a fixed random seed. The test set takes up 20% of the dataset.

We then used the Alternating Least Squares algorithm from Spark MLib, which is well-suited for this type of task where implicit feedback is involved. Hyperparameter tuning is a major part of ensuring the ALS model performs optimally, so we conducted a grid search over several combinations of `alpha` and `regParam`:

- `alpha`: [5.0, 10.0, 20.0]
- `regParam`: [0.01, 0.1]
- `rank` and `maxIter` were fixed (at 6 and 4 respectively) as cycling through these values caused the notebook to run very slowly.

We used root mean squared error (RMSE) as an evaluation metric, and whichever `alpha` and `regParam` gave the lowest score would be stored for use as our values for generating final recommendations. These values turned out to be `alpha` = 20 and `regParam` = 0.01, with RMSE = 2.1161, so these were put into the final ALS algorithm with higher `rank` and `maxIter` to finalise the model.


In [0]:
trainDF, testDF = aggRatings.randomSplit([0.8, 0.2], seed=42)


#### Experiment Tracking with MLflow

To manage and track the experiments, MLflow was integrated into the training process. MLflow would automatically log the parameters and RMSE score for each ALS model run, enabling us to compare results efficiently. We could then identify the best model configuration. Recording these metrics is good practice in tasks like these, where reproducibility is key.


In [0]:
alpha_values = [5.0, 10.0, 20.0]
reg_values = [0.01, 0.1]

best_rmse = float("inf")
best_model = None

evaluator = RegressionEvaluator(
    metricName="rmse", labelCol="rating", predictionCol="prediction"
)

for alpha in alpha_values:
    for reg in reg_values:
        with mlflow.start_run():
            als = ALS(
                userCol="user_id",
                itemCol="game_id",
                ratingCol="rating",
                implicitPrefs=True,
                coldStartStrategy="drop",
                rank=6,
                maxIter=4,
                alpha=alpha,
                regParam=reg
            )
            model = als.fit(trainDF)

            predictions = model.transform(testDF).dropna()
            rmse = evaluator.evaluate(predictions)

            # MLflow logging
            mlflow.log_param("alpha", alpha)
            mlflow.log_param("regParam", reg)
            mlflow.log_param("rank", 6)
            mlflow.log_param("maxIter", 4)
            mlflow.log_metric("rmse", rmse)

            mlflow.spark.log_model(model, "als_model")

            print(f"Alpha: {alpha}, RegParam: {reg} → RMSE: {rmse:.4f}")

            # Save best model parameters
            if rmse < best_rmse:
                best_rmse = rmse
                best_alpha = alpha
                best_reg = reg



Alpha: 5.0, RegParam: 0.01 → RMSE: 2.2521
Alpha: 5.0, RegParam: 0.1 → RMSE: 2.2713
Alpha: 10.0, RegParam: 0.01 → RMSE: 2.1791
Alpha: 10.0, RegParam: 0.1 → RMSE: 2.2031
Alpha: 20.0, RegParam: 0.01 → RMSE: 2.1161
Alpha: 20.0, RegParam: 0.1 → RMSE: 2.1333


In [0]:
als = ALS(
    userCol="user_id",
    itemCol="game_id",
    ratingCol="rating",
    implicitPrefs=True,
    coldStartStrategy="drop",
    rank=10,
    regParam=reg,
    alpha=alpha,
    maxIter=10
)

model = als.fit(trainDF)

In [0]:
predictions = model.transform(testDF).dropna()
predictions.show(10)


+-------+-------+------------------+----------+
|user_id|game_id|            rating|prediction|
+-------+-------+------------------+----------+
|   5250|     34|               1.0|0.94050694|
|  76767|     32| 1.587786664902119| 0.9030746|
|  76767|     61|3.3702437414678603| 0.9001788|
|  86540|      4| 3.856470206220483| 0.9188202|
|  86540|    376|               1.0|0.61156476|
|  86540|    554|               1.0| 0.4846811|
|  86540|   1039|               1.0| 0.3096421|
| 103360|     73|               1.0|0.78958786|
| 229911|     73|               1.0|0.90568924|
| 229911|    127|               1.0| 0.8322181|
+-------+-------+------------------+----------+
only showing top 10 rows



In [0]:
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
    predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print(f"Test RMSE = {rmse:.4f}")


Test RMSE = 2.0934


#### Generating Recommendations

With the ALS model trained and evaluated, we used it to generate game recommendations for each user. The code returned 10 games for each user in the dataset based on predicted interest.

The output was a nested list of game IDs and predicted ratings. To make the data interpretable in a tabular format, we applied the `explode` function, which created 10 rows per user, each with its own game and rating.

We then selected and renamed the relevant fields to create a new DataFrame, `flat`, which contains:
- `user_id`
- `game_id`
- `predicted_rating`

This structure allows for easier filtering, joining with metadata, and interpretation of results in the following steps.


In [0]:
userRecs = model.recommendForAllUsers(10)

# Explode the nested list of game_ids and ratings
exploded = userRecs.select("user_id", explode("recommendations").alias("rec"))

# Flatten to get user_id, game_id, and rating
flat = exploded.select(
    col("user_id"),
    col("rec.game_id").alias("game_id"),
    col("rec.rating").alias("predicted_rating")
)



#### Mapping Recommendations to Game Titles

To make the generated recommendations readable, we joined the dataframe `flat` with the original game titles from the indexed dataset. This allows the `user_id` to match up with game titles rather than a vague integer `game_id`.

The resulting dataframe, `prettyRecs`, contains `user_id`, `game` and `predicted_rating`, all of which are key features in the recommender system. This final structure enables easy interpretation of the data. The data was ordered by `user_id` and then `predicted_rating` in order to group games by `user_id`, and then rank the games from 1 to 10.

Initial analysis of the results reveals that certain game franchises appear very frequently appear the top recommendations across users. This includes popular titles such as ‘Counter Strike’, ‘Half-Life’, and ‘Portal’.


In [0]:
# Select what we need from indexed
game_lookup = indexed.select("game_id", "game").distinct()

# Join to map game_id to game
prettyRecs = flat.join(game_lookup, on="game_id", how="left") \
                 .select("user_id", "game", "predicted_rating") \
                 .orderBy("user_id", col("predicted_rating").desc())

prettyRecs.show(50, truncate=False)



+-------+--------------------------------------------+----------------+
|user_id|game                                        |predicted_rating|
+-------+--------------------------------------------+----------------+
|5250   |Half-Life 2 Lost Coast                      |1.1426936       |
|5250   |Half-Life 2 Deathmatch                      |1.115324        |
|5250   |Half-Life 2                                 |1.1047341       |
|5250   |Portal                                      |1.0443429       |
|5250   |Counter-Strike Source                       |1.0302583       |
|5250   |Half-Life 2 Episode One                     |0.9985522       |
|5250   |Counter-Strike                              |0.9933051       |
|5250   |Counter-Strike Condition Zero               |0.976385        |
|5250   |Counter-Strike Condition Zero Deleted Scenes|0.97527313      |
|5250   |Half-Life 2 Episode Two                     |0.97502095      |
|76767  |Counter-Strike Source                       |1.134582  

#### Identifying Franchise-Based Recommendations: Football Manager Case Study

To explore the types of games being recommended by the model, we filtered the `prettyRecs` dataframe to show titles which contain the phrase ‘Football Manager’. There were 6 editions of this game found in the list of top average playtimes per game, given in the EDA section. Usually, when people engage with one edition of this game, they are more inclined to play the others, which makes the franchise a strong candidate for implicit engagement-based recommendations.

By displaying these entries, we observed that these titles were often clustered within a single user’s top 10 recommendations. For example, `user_id` = 26813952 had six Football Manager games in their top 10 recommended games.

Another block of code was added for extra analysis and observation. Here, we found the users that had at least one Football Manager game in their top 10, and then aggregated these counts to determine the distribution of recommendations. Users with one version of this title usually had 4 or 6 in their recommended top 10, confirming the hypothesis from before.

This analysis highlights how strongly the model associates certain users with one specific franchise, reflecting preference patterns.


In [0]:
# Filter for games that contain the exact phrase "Football Manager"
fm_recs = prettyRecs.filter(col("game").contains("Football Manager"))

# Show the results
fm_recs.limit(10).display()


user_id,game,predicted_rating
18604016,Football Manager 2013,0.98235375
18604016,Football Manager 2012,0.9562315
18604016,Football Manager 2014,0.92788017
18604016,Football Manager 2015,0.9116237
23472549,Football Manager 2013,0.87253237
23472549,Football Manager 2015,0.83456916
23472549,Football Manager 2014,0.77847433
23472549,Football Manager 2012,0.72247535
24444528,Football Manager 2013,0.3320759
24444528,Football Manager 2015,0.3158424


In [0]:
# Count how many Football Manager games were recommended per user
fm_count_by_user = fm_recs.groupBy("user_id").count().withColumnRenamed("count", "fm_count")

# Count how many users had N Football Manager games recommended
distribution = fm_count_by_user.groupBy("fm_count").count().orderBy("fm_count")
distribution.show(10)




+--------+-----+
|fm_count|count|
+--------+-----+
|       1|   61|
|       2|   42|
|       3|   51|
|       4|  118|
|       5|   60|
|       6|  162|
|       7|   32|
|       8|   20|
+--------+-----+



#### Further Model Evaluation

Earlier in the notebook, we used RMSE as a metric to evaluate the model’s predictive accuracy. For recommender systems, it is also important to assess whether the correct items are being predicted near the top of the list. Therefore, three different widely-used ranking metrics were used:

- Precision@10: This measures the proportion of recommended games that match the actual list.
- Mean Average Precision (MAP): Considers both the precision and the order across all positions.
- NDCG@10: Gives higher weight to games found earlier in the list.

The results were given as follows:

- Precision@10: 0.08504941599281232
- MAP: 0.2632924678474174
- NDCG@10: 0.3399809231797672

This provides evidence of a serviceable recommender system, as implicit data can be very noisy and tricky to predict. For large, sparse datasets like this Steam dataset, this is an acceptable performance.


In [0]:
# Top 10 predictions per user
preds = (
    userRecs
    .select("user_id", explode("recommendations").alias("rec"))
    .select("user_id", col("rec.game_id").alias("game_id"))
    .groupBy("user_id")
    .agg(collect_list("game_id").alias("preds"))
)


In [0]:
# Actual games interacted with per user
labels = (
    testDF
    .groupBy("user_id")
    .agg(collect_list("game_id").alias("labels"))
)


In [0]:
# Create RDD of predictions and labels
metrics_rdd = (
    preds
    .join(labels, "user_id")
    .rdd
    .map(lambda row: (row.preds, row.labels))
)

# Evaluate ranking metrics from predictions
metrics = RankingMetrics(metrics_rdd)
print("Precision@10:", metrics.precisionAt(10))
print("MAP:", metrics.meanAveragePrecision)
print("NDCG@10:", metrics.ndcgAt(10))




Precision@10: 0.08504941599281232
MAP: 0.2632924678474174
NDCG@10: 0.3399809231797672


#### Conclusion

In this project, we created a collaborative filtering recommender system using implicit feedback data from Steam. The system was built using ALS and trained on a real dataset containing 200,000 user-game interactions.

Data preprocessing steps were employed, such as deduplicating various rows of data and normalising play hours. Player behaviour was analysed and converted into a singular ‘rating’ value, and categorical features were converted into numeric indices for matrix factorisation.

We then applied hyperparameter tuning and trained the model, before evaluating its performance with different types of metrics such as RMSE and Precision@10. This verified that the model was recommending games that align with user interests.

One case study was that of the Football Manager franchise. The model effectively grouped titles of this series within the same user IDs, reflecting patterns that often take place within gaming culture. The same games would appear at the top of most lists recommended to users, such as ‘Counter Strike’, so there is still room for improvement. The model would benefit from explicit data and more hyperparameter tuning with `rank` and `maxIter`. Nevertheless, the recommender system performed to an acceptable level and there were obvious signs of success, while being a scalable approach.
