# SVD
https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/surprise_svd_deep_dive.ipynb

## Goal: try to combine the approach for ALS with a new dataset

## Imports + setup

In [1]:
# set the environment path to find Recommenders
%load_ext autoreload
%autoreload 2

import sys
import os
import numpy as np
import surprise
import papermill as pm
import scrapbook as sb
import pandas as pd


import pyspark
from pyspark.ml.recommendation import ALS
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, FloatType, IntegerType, LongType, StructType, StructField
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, CountVectorizer, VectorAssembler
from pyspark.sql.window import Window
import pyspark.sql.functions as F

from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.utils.notebook_utils import is_jupyter
from recommenders.datasets.python_splitters import python_random_split
from recommenders.datasets.spark_splitters import spark_random_split
from recommenders.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k, 
                                                     recall_at_k, get_top_k_items)
from recommenders.models.surprise.surprise_utils import predict, compute_ranking_predictions
from recommenders.evaluation.spark_evaluation import SparkRatingEvaluation, SparkRankingEvaluation, SparkDiversityEvaluation
from recommenders.utils.spark_utils import start_or_get_spark

print("System version: {}".format(sys.version))
print("Surprise version: {}".format(surprise.__version__))
print("Spark version: {}".format(pyspark.__version__))

System version: 3.7.0 (default, Oct  9 2018, 10:31:47) 
[GCC 7.3.0]
Surprise version: 1.1.1
Spark version: 2.4.8


In [2]:
# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# topk, user, item column names
TOP_K = 20
COL_USER="UserId"
COL_ITEM="MovieId"
COL_RATING="Rating"


In [3]:
spark = start_or_get_spark("ALS PySpark", memory="16g")
spark

In [4]:
spark.conf.set("spark.sql.crossJoin.enabled", "true")

## Load Data

Load data in full format for it to work with diversity metrics

In [5]:
data_full = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=['UserId', 'MovieId', 'Rating', 'Timestamp'],
    title_col='title',
    genres_col='genres'
)

data_full.head()

100%|██████████| 4.81k/4.81k [00:00<00:00, 7.04kKB/s]


Unnamed: 0,UserId,MovieId,Rating,Timestamp,title,genres
0,196,242,3.0,881250949,Kolya (1996),Comedy
1,63,242,3.0,875747190,Kolya (1996),Comedy
2,226,242,5.0,883888671,Kolya (1996),Comedy
3,154,242,3.0,879138235,Kolya (1996),Comedy
4,306,242,5.0,876503793,Kolya (1996),Comedy


Filter out columns used by surprise

In [6]:
data = data_full[['UserId', 'MovieId', 'Rating']]
data.head()

Unnamed: 0,UserId,MovieId,Rating
0,196,242,3.0
1,63,242,3.0
2,226,242,5.0
3,154,242,3.0
4,306,242,5.0


Make pyspark version of data

In [7]:
data_spark=spark.createDataFrame(data_full) 
data_spark.printSchema()
data_spark.show(5)

root
 |-- UserId: long (nullable = true)
 |-- MovieId: long (nullable = true)
 |-- Rating: double (nullable = true)
 |-- Timestamp: long (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)

+------+-------+------+---------+------------+------+
|UserId|MovieId|Rating|Timestamp|       title|genres|
+------+-------+------+---------+------------+------+
|   196|    242|   3.0|881250949|Kolya (1996)|Comedy|
|    63|    242|   3.0|875747190|Kolya (1996)|Comedy|
|   226|    242|   5.0|883888671|Kolya (1996)|Comedy|
|   154|    242|   3.0|879138235|Kolya (1996)|Comedy|
|   306|    242|   5.0|876503793|Kolya (1996)|Comedy|
+------+-------+------+---------+------------+------+
only showing top 5 rows



## Train with surprise - copy paste from recommenders

Note that Surprise has a lot of built-in support for cross-validation or also grid search inspired scikit-learn, but we will here use the provided tools instead.

We start by splitting our data into trainset and testset with the python_random_split function.

In [8]:
train, test = python_random_split(data, 0.75)

Surprise needs to build an internal model of the data. We here use the load_from_df method to build a Dataset object, and then indicate that we want to train on all the samples of this dataset by using the build_full_trainset method.

In [9]:
# 'reader' is being used to get rating scale (for MovieLens, the scale is [1, 5]).
# 'rating_scale' parameter can be used instead for the later version of surprise lib:
# https://github.com/NicolasHug/Surprise/blob/master/surprise/dataset.py
train_set = surprise.Dataset.load_from_df(train, reader=surprise.Reader('ml-100k')).build_full_trainset()
train_set

<surprise.trainset.Trainset at 0x2b53127e0ba8>

The SVD has a lot of parameters. The most important ones are:

n_factors, which controls the dimension of the latent space (i.e. the size of the vectors $p_u$ and $q_i$). Usually, the quality of the training set predictions grows with as n_factors gets higher.
n_epochs, which defines the number of iteration of the SGD procedure.
Note that both parameter also affect the training time.

We will here set n_factors to 200 and n_epochs to 30. To train the model, we simply need to call the fit() method.

In [10]:
svd = surprise.SVD(random_state=0, n_factors=200, n_epochs=30, verbose=True)

with Timer() as train_time:
    svd.fit(train_set)

print("Took {} seconds for training.".format(train_time.interval))

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Took 7.270010836422443 seconds for training.


## Predict with spark top k and surprise

Use cross-join to create all possible user-item pairs with spark

In [11]:
train_df_spark, test_df_spark = spark_random_split(data_spark.select(COL_USER, COL_ITEM, COL_RATING), ratio=0.75, seed=123)
users = train_df_spark.select(COL_USER).distinct()
items = train_df_spark.select(COL_ITEM).distinct()
user_item = users.crossJoin(items)
user_item.show(5)

+------+-------+
|UserId|MovieId|
+------+-------+
|    26|     29|
|    26|    474|
|    26|     26|
|    26|    964|
|    26|   1677|
+------+-------+
only showing top 5 rows



In [12]:
print(users.count())
print(items.count())
print(user_item.count())

943
1636
1542748


Convert user_item to pd df and predict with surprise

In [13]:
user_item_pd = user_item.toPandas()
user_item_pd.head()

Unnamed: 0,UserId,MovieId
0,26,29
1,26,474
2,26,26
3,26,964
4,26,1677


In [14]:
predictions = predict(svd, user_item_pd, usercol='UserId', itemcol='MovieId')
predictions.head()

Unnamed: 0,UserId,MovieId,prediction
0,26,29,2.412635
1,26,474,3.729531
2,26,26,2.828721
3,26,964,3.157108
4,26,1677,3.127901


Convert user item predictions back to spark for further analysis

In [15]:
dfs_pred=spark.createDataFrame(predictions) 
dfs_pred.printSchema()
dfs_pred.show(5)

root
 |-- UserId: long (nullable = true)
 |-- MovieId: long (nullable = true)
 |-- prediction: double (nullable = true)

+------+-------+------------------+
|UserId|MovieId|        prediction|
+------+-------+------------------+
|    26|     29|2.4126350802716225|
|    26|    474|3.7295314050140655|
|    26|     26| 2.828721184303861|
|    26|    964|  3.15710797823856|
|    26|   1677| 3.127900808925026|
+------+-------+------------------+
only showing top 5 rows



## Get top K

In [16]:
# Remove seen items - Remember we only used training data to create user_item
dfs_pred_exclude_train = dfs_pred.alias("pred").join(
    train_df_spark.alias("train"),
    (dfs_pred[COL_USER] == train_df_spark[COL_USER]) & (dfs_pred[COL_ITEM] == train_df_spark[COL_ITEM]),
    how='outer'
)

top_all = dfs_pred_exclude_train.filter(dfs_pred_exclude_train["train.Rating"].isNull()) \
    .select('pred.' + COL_USER, 'pred.' + COL_ITEM, 'pred.' + "prediction")

print(top_all.count())

window = Window.partitionBy(COL_USER).orderBy(F.col("prediction").desc())
top_k_reco = top_all.select("*", F.row_number().over(window).alias("rank")).filter(F.col("rank") <= TOP_K).drop("rank")
 
print(top_k_reco.count())
top_k_reco.show()

1467632
18860
+------+-------+------------------+
|UserId|MovieId|        prediction|
+------+-------+------------------+
|    26|    100| 4.722966097416943|
|    26|     98| 4.447861951589148|
|    26|    127| 4.412481083524716|
|    26|    169| 4.368625041008343|
|    26|    483| 4.343002776784539|
|    26|    272| 4.316561844743819|
|    26|    318|  4.22089996249464|
|    26|     22| 4.170004295248998|
|    26|     50|4.1654411384184025|
|    26|    183| 4.158751781269825|
|    26|     64| 4.132351259430597|
|    26|    408| 4.125565516468394|
|    26|    427| 4.116743294473558|
|    26|    114| 4.077805546156425|
|    26|    513| 4.055861117290725|
|    26|    132| 4.047984543408269|
|    26|    187| 4.039850247665109|
|    26|     89| 4.035442825815474|
|    26|    199|4.0301437840058005|
|    26|    173| 4.002678974547844|
+------+-------+------------------+
only showing top 20 rows



## Generate content based version

convert text input into feature vectors

In [17]:
# Get movie features "title" and "genres"
movies = (
    data_spark.groupBy("MovieId", "title", "genres").count()
    .na.drop()  # remove rows with null values
    .withColumn("genres", F.split(F.col("genres"), "\|"))  # convert to array of genres
    .withColumn("title", F.regexp_replace(F.col("title"), "[\(),:^0-9]", ""))  # remove year from title
    .drop("count")  # remove unused columns
)
movies.show(5)

+-------+--------------------+--------------------+
|MovieId|               title|              genres|
+-------+--------------------+--------------------+
|   1240|Ghost in the Shel...| [Animation, Sci-Fi]|
|    133| Gone with the Wind |[Drama, Romance, ...|
|    468|               Rudy |             [Drama]|
|    497|   Bringing Up Baby |            [Comedy]|
|    294|          Liar Liar |            [Comedy]|
+-------+--------------------+--------------------+
only showing top 5 rows



tokenize title and remove stop words

In [18]:
# tokenize "title" column
title_tokenizer = Tokenizer(inputCol="title", outputCol="title_words")
tokenized_data = title_tokenizer.transform(movies)


# remove stop words
remover = StopWordsRemover(inputCol="title_words", outputCol="text")
clean_data = remover.transform(tokenized_data).drop("title", "title_words")

convert text input into feature vectors

In [19]:
# step 1: perform HashingTF on column "text"
text_hasher = HashingTF(inputCol="text", outputCol="text_features", numFeatures=1024)
hashed_data = text_hasher.transform(clean_data)
hashed_data.show(5)

# step 2: fit a CountVectorizerModel from column "genres".
count_vectorizer = CountVectorizer(inputCol="genres", outputCol="genres_features")
count_vectorizer_model = count_vectorizer.fit(hashed_data)
vectorized_data = count_vectorizer_model.transform(hashed_data)
vectorized_data.show(5)

# step 3: assemble features into a single vector
assembler = VectorAssembler(
    inputCols=["text_features", "genres_features"],
    outputCol="features",
)

feature_data = assembler.transform(vectorized_data).select("MovieId", "features")
feature_data = feature_data.withColumn("MovieId",F.col("MovieId").cast("integer"))
feature_data.printSchema()
feature_data.show(10, False)

+-------+--------------------+--------------------+--------------------+
|MovieId|              genres|                text|       text_features|
+-------+--------------------+--------------------+--------------------+
|   1240| [Animation, Sci-Fi]|[ghost, shell, ko...|(1024,[308,321,52...|
|    133|[Drama, Romance, ...|        [gone, wind]|(1024,[224,462],[...|
|    468|             [Drama]|              [rudy]|  (1024,[774],[1.0])|
|    497|            [Comedy]|    [bringing, baby]|(1024,[408,868],[...|
|    294|            [Comedy]|        [liar, liar]|  (1024,[657],[2.0])|
+-------+--------------------+--------------------+--------------------+
only showing top 5 rows

+-------+--------------------+--------------------+--------------------+--------------------+
|MovieId|              genres|                text|       text_features|     genres_features|
+-------+--------------------+--------------------+--------------------+--------------------+
|   1240| [Animation, Sci-Fi]|[ghost

## Evaluation Metrics definitions

In [20]:
def get_ranking_results(ranking_eval):
    metrics = {
        "Precision@k": ranking_eval.precision_at_k(),
        "Recall@k": ranking_eval.recall_at_k(),
        "NDCG@k": ranking_eval.ndcg_at_k(),
        "Mean average precision": ranking_eval.map_at_k()
      
    }
    return metrics  

def get_diversity_results(diversity_eval):
    metrics = {
        "catalog_coverage":diversity_eval.catalog_coverage(),
        "distributional_coverage":diversity_eval.distributional_coverage(), 
        "novelty": diversity_eval.novelty(), 
        "diversity": diversity_eval.diversity(), 
        "serendipity": diversity_eval.serendipity()
    }
    return metrics

def get_rating_results(rating_eval):
    metrics = {
     'rmse': rating_eval.rmse(),
     'mean absolute error' : rating_eval.mae(),
     'R squared': rating_eval.rsquared(),
     'explained variance': rating_eval.exp_var()
    }
    return metrics

def generate_summary(data, algo, k, ranking_metrics, diversity_metrics):
    summary = {"Data": data, "Algo": algo, "K": k}

    if ranking_metrics is None:
        ranking_metrics = {           
            "Precision@k": np.nan,
            "Recall@k": np.nan,            
            "nDCG@k": np.nan,
            "MAP": np.nan,
        }
        #update just adds to the back of the dictionary.
    summary.update(ranking_metrics)
    summary.update(diversity_metrics)
    return summary

diversity_cols = ["catalog_coverage", "distributional_coverage","novelty", "diversity", "serendipity"]
ranking_cols = [ "Precision@k", "Recall@k", "NDCG@k", "Mean average precision"]
rating_cols = ['rmse', 'mean absolute error', 'R squared','explained variance']

## Evaluation - collaborative based

### Diversity

In [21]:
#calculate
svd_diversity_eval = SparkDiversityEvaluation(
    train_df = train_df_spark, 
    reco_df = top_k_reco,
    col_user = COL_USER, 
    col_item = COL_ITEM
)

svd_diversity_metrics = get_diversity_results(svd_diversity_eval)

#display
diversity_results = pd.DataFrame(columns=diversity_cols)
diversity_results.loc[1] = svd_diversity_metrics
diversity_results.head()



Unnamed: 0,catalog_coverage,distributional_coverage,novelty,diversity,serendipity
1,0.339853,7.533735,9.419237,0.740317,0.798199


### Ranking

In [22]:
#calculate
svd_ranking_eval = SparkRankingEvaluation(
    test_df_spark, 
    top_all, 
    k = TOP_K, 
    col_user="UserId", 
    col_item="MovieId",
    col_rating="Rating", 
    col_prediction="prediction",
    relevancy_method="top_k"
)

svd_ranking_metrics = get_ranking_results(svd_ranking_eval)

#display
ranking_results = pd.DataFrame(columns=ranking_cols)
ranking_results.loc[1] = svd_ranking_metrics
ranking_results.head()

Unnamed: 0,Precision@k,Recall@k,NDCG@k,Mean average precision
1,0.157067,0.121627,0.208343,0.057682


### Rating

In [23]:
#calculate
svd_rating_eval = SparkRatingEvaluation(
    test_df_spark, 
    top_all,  
    col_user="UserId", 
    col_item="MovieId",
    col_rating="Rating", 
    col_prediction="prediction")

svd_rating_results = get_rating_results(svd_rating_eval)

#display
rating_res = pd.DataFrame(columns=rating_cols)
rating_res.loc[1] = svd_rating_results
rating_res.head()

Unnamed: 0,rmse,mean absolute error,R squared,explained variance
1,0.584161,0.424652,0.72836,0.728374


## Evaluation - content based

### Diversity

In [24]:
#calculate
svd_diversityContent_eval = SparkDiversityEvaluation(
    train_df = train_df_spark, 
    reco_df = top_k_reco,
    item_feature_df = feature_data, 
    item_sim_measure="item_feature_vector",
    col_user = COL_USER, 
    col_item = COL_ITEM
)

svd_diversityContent_results = get_diversity_results(svd_diversityContent_eval)

#display
content_diversity_results = pd.DataFrame(columns=diversity_cols)
content_diversity_results.loc[1] = svd_diversityContent_results
content_diversity_results.head()

Unnamed: 0,catalog_coverage,distributional_coverage,novelty,diversity,serendipity
1,0.339853,7.533735,9.419237,0.877599,0.884189


In [25]:
print('Content-based diversity')
content_diversity_results.head()

Content-based diversity


Unnamed: 0,catalog_coverage,distributional_coverage,novelty,diversity,serendipity
1,0.339853,7.533735,9.419237,0.877599,0.884189


## Display all results

In [26]:
print("Results for K={} of the Movielens size:{} ".format(TOP_K,MOVIELENS_DATA_SIZE))
print()

Results for K=20 of the Movielens size:100k 



In [27]:
print("DIVERSITY METRICS - COLLABORATIVE")
print()
diversity_results.head()

DIVERSITY METRICS - COLLABORATIVE



Unnamed: 0,catalog_coverage,distributional_coverage,novelty,diversity,serendipity
1,0.339853,7.533735,9.419237,0.740317,0.798199


In [28]:
print("DIVERSITY METRICS - CONTENT BASED")
print()
content_diversity_results.head()

DIVERSITY METRICS - CONTENT BASED



Unnamed: 0,catalog_coverage,distributional_coverage,novelty,diversity,serendipity
1,0.339853,7.533735,9.419237,0.877599,0.884189


In [29]:
print("RATING METRICS")
print()
rating_res.head()

RATING METRICS



Unnamed: 0,rmse,mean absolute error,R squared,explained variance
1,0.584161,0.424652,0.72836,0.728374


In [30]:
print("RANKING METRICS")
print()
ranking_results.head()

RANKING METRICS



Unnamed: 0,Precision@k,Recall@k,NDCG@k,Mean average precision
1,0.157067,0.121627,0.208343,0.057682
