<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Libraries" data-toc-modified-id="Libraries-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Evaluation</a></span><ul class="toc-item"><li><span><a href="#Without-Reranking" data-toc-modified-id="Without-Reranking-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Without Reranking</a></span></li><li><span><a href="#With-Reranking" data-toc-modified-id="With-Reranking-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>With Reranking</a></span></li></ul></li></ul></div>

# Introduction

This notebook evaluates the impact of the reranking algorithm on NDCG and coverage of longtail items. 

# Setup

In [1]:
%%capture
%cd ..

# Libraries

In [2]:
import pandas as pd

In [3]:
from typing import List
from pyspark.sql.types import Row

In [4]:
from src import preference_reranker as pr

In [5]:
from lenskit import topn

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Data

In [7]:
spark = SparkSession.builder.getOrCreate()

In [8]:
test_df = spark.read.parquet("/tmp/ml-20m/test_df.parquet/").toPandas()
test_df.head(5)

Unnamed: 0,user,item,rating,timestamp
0,68342,4973,4.5,2011-04-28 03:30:45
1,68342,6287,2.5,2011-04-26 02:42:38
2,68342,39,3.0,2011-04-26 02:39:37
3,68342,30810,5.0,2011-04-26 02:37:39
4,68342,8361,4.0,2011-04-26 02:56:28


In [9]:
eval_df = spark.read.parquet("/tmp/ml-20m/evaluation_dataset.parquet") \
    .orderBy("user", "rank")
eval_df.show(5)

+-----+------------------+----+----+
| item|             score|user|rank|
+-----+------------------+----+----+
|  318| 4.000287098418736| 318|   1|
| 2959| 3.866550017363461| 318|   2|
| 7502|3.8327863796647827| 318|   3|
|  296| 3.830272846994498| 318|   4|
|77658|3.8224656579497074| 318|   5|
+-----+------------------+----+----+
only showing top 5 rows



In [10]:
eval_df.selectExpr("count(DISTINCT user)", "min(rank)", "max(rank)").show()

+--------------------+---------+---------+
|count(DISTINCT user)|min(rank)|max(rank)|
+--------------------+---------+---------+
|                 430|        1|      100|
+--------------------+---------+---------+



In [11]:
movie_cat = spark.read.parquet("/tmp/ml-20m/movie_categories.parquet") \
  .withColumnRenamed("movieId", "item") \
  .withColumnRenamed("category", "item_category")

movie_cat.show(5)

+----+-------------+
|item|item_category|
+----+-------------+
| 296|    shorthead|
| 356|    shorthead|
| 318|    shorthead|
| 593|    shorthead|
| 480|    shorthead|
+----+-------------+
only showing top 5 rows



In [12]:
user_pref = spark.read.parquet("/tmp/ml-20m/user_preference.parquet") \
      .withColumnRenamed("userId", "user") 
user_pref.show(5)

+-----+-------------+
| user|longtail_pref|
+-----+-------------+
|69363|         0.05|
|28486|         0.15|
|83970|         0.23|
|38051|         0.13|
|28546|         0.17|
+-----+-------------+
only showing top 5 rows



# Evaluation

Setup to calculate NDCG:

In [13]:
rla = topn.RecListAnalysis()
rla.add_metric(topn.ndcg)

## Without Reranking

NDCG:

In [14]:
df = eval_df.filter("rank <= 10").toPandas() 
print(f"NDCG@10: {rla.compute(df, test_df).ndcg.mean():.4f}")

NDCG@10: 0.0351


Longtail Coverage:

In [15]:
eval_df \
    .filter("rank <= 10") \
    .join(movie_cat, "item", "left") \
    .groupBy("item_category") \
    .agg(F.expr("count(item) AS total_items"),
         F.expr("count(DISTINCT item) AS unique_items")) \
    .show(truncate=False)

+-------------+-----------+------------+
|item_category|total_items|unique_items|
+-------------+-----------+------------+
|longtail     |1250       |27          |
|shorthead    |3050       |29          |
+-------------+-----------+------------+



## With Reranking

For each user, construct a candidate set and longtail preference:

In [16]:
user_details = eval_df \
    .join(movie_cat, "item", "left") \
    .withColumn("movie", F.struct(F.col("item").alias("movie_id"),
                                  F.col("score").alias("base_score"),
                                  F.col("item_category").alias("category"))) \
    .groupBy("user") \
    .agg(F.expr("collect_set(movie) AS candidate_set")) \
    .join(user_pref, "user", "left") \
    .toPandas()

user_details.head(5)

Unnamed: 0,user,candidate_set,longtail_pref
0,122961,"[(47, 4.362047692584852, shorthead), (97, 4.32...",0.26
1,119757,"[(48780, 3.588963104626058, shorthead), (1201,...",0.42
2,23124,"[(750, 4.291495498505641, shorthead), (5368, 4...",0.24
3,2517,"[(4011, 4.246780052696041, shorthead), (729, 4...",0.34
4,5148,"[(318, 4.227326995556364, shorthead), (94466, ...",0.12


Loop through each row to construct a reranked recommendation for each user:

In [17]:
recsize = 10
longtail_weight = 0.8

In [18]:
def pyspark_rows_to_movies(rows: List[Row]):
    return [pr.Movie(row.movie_id, row.base_score, row.category) 
            for row in rows]

reranked_recs = []

for user, candidate_set, longtail_pref in user_details.itertuples(name=None, index=False):
    candidate_set = pyspark_rows_to_movies(candidate_set)
    new_recs = pr.construct_reclist(candidate_set=candidate_set,
                                    size=recsize,
                                    longtail_pref=longtail_pref,
                                    longtail_weight=longtail_weight)
    
    reranked_recs.append(new_recs)

user_details["reranked_recs"] = reranked_recs

user_details.head(5)

Unnamed: 0,user,candidate_set,longtail_pref,reranked_recs
0,122961,"[(47, 4.362047692584852, shorthead), (97, 4.32...",0.26,"[(318, 4.657749438403102, shorthead), (26587, ..."
1,119757,"[(48780, 3.588963104626058, shorthead), (1201,...",0.42,"[(318, 3.915625297514382, shorthead), (77658, ..."
2,23124,"[(750, 4.291495498505641, shorthead), (5368, 4...",0.24,"[(44555, 4.4193419879089415, shorthead), (7765..."
3,2517,"[(4011, 4.246780052696041, shorthead), (729, 4...",0.34,"[(318, 4.5429949618464205, shorthead), (7926, ..."
4,5148,"[(318, 4.227326995556364, shorthead), (94466, ...",0.12,"[(318, 4.227326995556364, shorthead), (77658, ..."


Transform `user_details` into `eval_df` form:

In [19]:
df = user_details[["user", "reranked_recs"]] \
    .explode("reranked_recs")

df["item"] = df["reranked_recs"].apply(lambda m: m.movie_id)
df["score"] = df["reranked_recs"].apply(lambda m: m.base_score)

df = df[["item", "score", "user"]]
df["rank"] = df.groupby("user")["score"].rank("dense", ascending=True)

df = df.sort_values(by=["user", "rank"])

df.head(5)

Unnamed: 0,item,score,user,rank
192,26587,3.794953,318,1.0
192,7926,3.801407,318,2.0
192,100553,3.807512,318,3.0
192,858,3.813999,318,4.0
192,50,3.817681,318,5.0


NDCG:

In [20]:
print(f"NDCG@10: {rla.compute(df, test_df).ndcg.mean():.4f}")

NDCG@10: 0.0235


Longtail Coverage:

In [21]:
df1 = df.merge(movie_cat.toPandas(), 
               on="item", how="left") \
  .groupby("item_category") \
  .size() \
  .to_frame("total_items")

df2 = df.merge(movie_cat.toPandas(), 
               on="item", how="left") \
  .groupby("item_category") \
  .item \
  .nunique() \
  .to_frame("unique_items")

pd.concat([df1, df2], axis=1)

Unnamed: 0_level_0,total_items,unique_items
item_category,Unnamed: 1_level_1,Unnamed: 2_level_1
longtail,1698,28
shorthead,2602,30
