Music Recommendation System
===========================

In general, [recommender
systems](https://en.wikipedia.org/wiki/Recommender_system) are
algorithms designed for suggesting relevant items/products to users. In
the last decades they have gained much interest because of the potential
of increasing the user experience at the same time as generating more
profit to companies. Nowadays, these systems can be found in several
well-known services like Netfilx, Amazon, YouTube and Spotify. As an
indicator of how valuable these algorithms are for such companies: back
in 2006 Netflix announced the open [Netflix Prize
Competition](https://www.netflixprize.com/) for the best algorithm to
predict users movie ratings based on collected data. The winning team
with the best algorithm improving the state-of-the-art performance with
at least 10% was promised an award of 1 000 000$. **In this notebook we
are going to develope a system for recommending musical artists to users
given their listening history**. We will implement a model related to
matrix factorization discussed in the preceeding notebook.

Problem Setting
---------------

We let $U$ be the set containing all $m$ users and let $I$ be the set
containing all $n$ available items. Now, we introduce the matrix $R\\in
\\mathbb{R}^{m \\times n}$ with elements $r{*u}{*i}$ as a value encoding
a possible interaction between user $u\\in U$ and item $i \\in I$. This
matrix is often very sparse because of the huge number of possible
user-item interactions never observed. Depending on the type of
information encoded in the interaction matrix $R$ one usally refers to
either *explicit* or *implicit* data.

For explicit data, the $r{*u}{*i}$ contains information directly related
to user $u$'s preference for item $i$, e.g movie ratings. In the case of
implicit data, $r{*u}{*i}$ contains indirect information of a users
preference for an item by observing past user behavior. Examples could
be the number of times a user played a song or visited a webpage. Note
that in the implicit case we are lacking information about items that
the user dislikes because if a user of a music service has not played
any songs from a particular artist it could either mean that the user
simply doesn't like that artist or that the user hasn't encountered that
artist before but would potentially like it if the user had discovered
that artist.

Given the observations in the interaction matrix $R$, we would like our
model to suggest unseen items relevant to the users.

Collaborative Filtering
-----------------------

Broadly speaking, recommender algorithms can be divided into two
categories: [content
based](https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering)
and [collaborative
filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) (CF).
Here, we will just focus on collaborative filtering which is a technique
using patterns of user-item interactions and disregarding any additional
information about the user or item themselves. It is based on the
assumption that if a user similar to you likes an item, then there is a
high probability that you also like that particular item. In other
words, similar users have similar taste.

There are different approaches to CF, and we have chosen a laten factor
model approach inspired by low-rank SVD factorization of matrices. The
aim is to uncover latent features explaining the observed $r{*u}{*i}$
values. Each user $u$ is associated to a user-feature vector $x{*u}\\in
\\mathbb{R}^f$ and similarly each item $i$ is associated to a
item-feature vector $y{*i} \\in \\mathbb{R}^f$. Then we want the dot
products $x{*u}^Ty{*i}$ to explain the observed $r{*u}{*i}$ values. With
all user- and item-features at hand in the latent space $R^f$ we can
estimate a user $u$'s preference for an unseen item $j$ by simply
computing $x{*u}^Ty{*j}$.

We transorm the problem of finding the vectors $x{*u}, y{*i}$ into a
minimization problem as suggested in the paper [Collaborative Filtering
for Implicit Feedback
Datasets](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets).
First we introduce the binarized quantitiy $p{*u}{*i}$ defined by:

$$p*{ui}=\\begin{cases}1 \\text{ if } r*{ui}&gt;0, \\\\ 0 \\text{ if }
r*{ui}=0,\\end{cases}$$ encoding whether user $u$ has interacted with
and supposedly likes item $i$. However, our confidence that user $u$
likes item $i$ given that $p{*u}{*i}=1$ should vary with the actual
$r{*u}{*i}$ value. As an example, we would be more confident that a user
likes an artist he/she has listened to hundreds of times than an artist
played by the user only once. Therefore we introduce the confidence
$c{*u}{\_i}$:

$$c*{ui}=1+\\alpha r*{ui}$$,

where $\\alpha$ is a hyperparameter. From the above equation we can see
that the confidence for non observed user-item interaction defaults to
1. Now we formulize the minimization problem:

$$\\min*{X,Y}\\sum*{u\\in U,i \\in
I}c*{ui}(p*{ui}-x*u^Ty*i)^2+\\lambda(\\sum*{u\\in
U}||x*u||^2+\\sum*{i\\in I}||y*i||^2),$$ where $X,Y$ are matrices
holding the $x*u,y*i$ as columns respectively. Notice the regularization
term in the objective function.

Dataset
-------

For this application we use a
[dataset](https://grouplens.org/datasets/hetrec-2011/) containing
user-artist listening information from the online music service
[Last.fm](http://www.last.fm).

One of the available files contains triplets (`userID` `artistID`
`play_count`) describing the number of times a user has played an
artist. Another file contains tuples (`artistID` `name`) mapping the
artistID:s to actual artist names. There are a total of 92834 (`userID`
`artistID` `play_count`) triplets containing 1892 unique `userID`s and
17632 unique `artistID`s. Since the observations in the dataset do not
contain direct information about artist preferences, this is an implicit
dataset viewed from the description above. Based on this dataset we want
our model to give artist recommendations to the users.

In [None]:
import spark.implicits._
import org.apache.spark.sql.functions._

  

>     import spark.implicits._
>     import org.apache.spark.sql.functions._

  

**Lets load the data!**

In [None]:
// Load the (userID, artistID, play_count) triplets.
val fileName_data="dbfs:/FileStore/tables/project4/hetrec2011-lastfm-2k/user_artists.dat"
val df_raw = spark.read.format("csv").option("header", "true").option("delimiter", "\t").option("inferSchema","true").load(fileName_data).withColumnRenamed("weight","play_count")
df_raw.cache()
df_raw.orderBy(rand()).show(5)

// Load the (artistID, name) tuples.
val fileName_names="dbfs:/FileStore/tables/project4/hetrec2011-lastfm-2k/artists.dat"
val artist_names = spark.read.format("csv").option("header", "true").option("delimiter", "\t").option("inferSchema","true").load(fileName_names).withColumnRenamed("id","artistID").select("artistID","name")
artist_names.cache()
artist_names.show(5)

  

>     +------+--------+----------+
>     |userID|artistID|play_count|
>     +------+--------+----------+
>     |  1553|    2478|       171|
>     |    65|    1858|       254|
>     |   496|       9|       161|
>     |   304|     163|        19|
>     |   747|     507|       626|
>     +------+--------+----------+
>     only showing top 5 rows
>
>     +--------+-----------------+
>     |artistID|             name|
>     +--------+-----------------+
>     |       1|     MALICE MIZER|
>     |       2|  Diary of Dreams|
>     |       3|Carpathian Forest|
>     |       4|     Moi dix Mois|
>     |       5|      Bella Morte|
>     +--------+-----------------+
>     only showing top 5 rows
>
>     fileName_data: String = dbfs:/FileStore/tables/project4/hetrec2011-lastfm-2k/user_artists.dat
>     df_raw: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 1 more field]
>     fileName_names: String = dbfs:/FileStore/tables/project4/hetrec2011-lastfm-2k/artists.dat
>     artist_names: org.apache.spark.sql.DataFrame = [artistID: int, name: string]

  

We print some statistics and visualize the raw data.

In [None]:
val n_data = df_raw.count().asInstanceOf[Long].floatValue(); // Number of observations
val n_users = df_raw.agg(countDistinct("userID")).collect()(0)(0).asInstanceOf[Long].floatValue(); // Number of unique users
val n_artists = df_raw.agg(countDistinct("artistID")).collect()(0)(0).asInstanceOf[Long].floatValue(); // Number of unique artists
val sparsity = 1-n_data/(n_users*n_artists) //Sparsity of the data

println("Number of data points: " + n_data)
println("Number of users: " + n_users)
println("Number of artists: " + n_artists)
print("Sparsity:" + sparsity.toString + "\n")


  

>     Number of data points: 92834.0
>     Number of users: 1892.0
>     Number of artists: 17632.0
>     Sparsity:0.9972172
>     n_data: Float = 92834.0
>     n_users: Float = 1892.0
>     n_artists: Float = 17632.0
>     sparsity: Float = 0.9972172

In [None]:
display(df_raw.select("play_count"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(df_raw.select("play_count").filter($"play_count"<1000))

  

[TABLE]

Truncated to 30 rows

  

We count the total plays and number of unique listeners for each artist.

In [None]:
// Compute some statistics for the artists.
val artist_data_raw = df_raw.groupBy("artistID").agg(count("artistID") as "unique_users",
                                                      sum("play_count") as "total_plays_artist")
artist_data_raw.sort(desc("total_plays_artist")).join(artist_names,"artistID").show(5) // Top artists based on total plays
artist_data_raw.sort(desc("unique_users")).join(artist_names,"artistID").show(5) // Top artists based on number of unique listener

  

>     +--------+------------+------------------+------------------+
>     |artistID|unique_users|total_plays_artist|              name|
>     +--------+------------+------------------+------------------+
>     |     289|         522|           2393140|    Britney Spears|
>     |      72|         282|           1301308|      Depeche Mode|
>     |      89|         611|           1291387|         Lady Gaga|
>     |     292|         407|           1058405|Christina Aguilera|
>     |     498|         399|            963449|          Paramore|
>     +--------+------------+------------------+------------------+
>     only showing top 5 rows
>
>     +--------+------------+------------------+--------------+
>     |artistID|unique_users|total_plays_artist|          name|
>     +--------+------------+------------------+--------------+
>     |      89|         611|           1291387|     Lady Gaga|
>     |     289|         522|           2393140|Britney Spears|
>     |     288|         484|            905423|       Rihanna|
>     |     227|         480|            662116|   The Beatles|
>     |     300|         473|            532545|    Katy Perry|
>     +--------+------------+------------------+--------------+
>     only showing top 5 rows
>
>     artist_data_raw: org.apache.spark.sql.DataFrame = [artistID: int, unique_users: bigint ... 1 more field]

In [None]:
display(artist_data_raw.select("total_plays_artist"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(artist_data_raw.select("total_plays_artist").filter($"total_plays_artist"<10000))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(artist_data_raw.select("unique_users"))

  

[TABLE]

Truncated to 30 rows

  

We count the total plays and the number of unique artists each users has
listened to.

In [None]:
// Compute statistics for each user.
val user_data_raw = df_raw.groupBy("userID").agg(count("userID") as "unique_artists",
                                                  sum("play_count") as "total_plays_user")
user_data_raw.sort(desc("total_plays_user")).show(5) // Show users with most total plays

  

>     +------+--------------+----------------+
>     |userID|unique_artists|total_plays_user|
>     +------+--------------+----------------+
>     |   757|            50|          480039|
>     |  2000|            50|          468409|
>     |  1418|            50|          416349|
>     |  1642|            50|          388251|
>     |  1094|            50|          379125|
>     +------+--------------+----------------+
>     only showing top 5 rows
>
>     user_data_raw: org.apache.spark.sql.DataFrame = [userID: int, unique_artists: bigint ... 1 more field]

In [None]:
display(user_data_raw.select("total_plays_user"))

  

[TABLE]

Truncated to 30 rows

  

Now we join all statistics into a single dataframe.

In [None]:
// Merge all statistics and data into a single dataframe.
val df_joined = df_raw.join(artist_data_raw, "artistID").join(user_data_raw, "userID").join(artist_names,"artistID").select("userID", "artistID","play_count", "name", "unique_artists","unique_users", "total_plays_user","total_plays_artist")
df_joined.show(5)

  

>     +------+--------+----------+-------------+--------------+------------+----------------+------------------+
>     |userID|artistID|play_count|         name|unique_artists|unique_users|total_plays_user|total_plays_artist|
>     +------+--------+----------+-------------+--------------+------------+----------------+------------------+
>     |     2|      51|     13883|  Duran Duran|            50|         111|          168737|            348919|
>     |     2|      52|     11690|    Morcheeba|            50|          23|          168737|             18787|
>     |     2|      53|     11351|          Air|            50|          75|          168737|             44230|
>     |     2|      54|     10300| Hooverphonic|            50|          18|          168737|             15927|
>     |     2|      55|      8983|Kylie Minogue|            50|         298|          168737|            449292|
>     +------+--------+----------+-------------+--------------+------------+----------------+------------------+
>     only showing top 5 rows
>
>     df_joined: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 6 more fields]

  

Collaborative filtering models suffer from the [cold-start
problem](https://yuspify.com/blog/cold-start-problem-recommender-systems/),
meaning they have difficulties in making inference of new users or
items. Therefore we will filter out artists with fewer than 20 unique
listeners and users that have listened to less than 5 artists.

In [None]:
// Remove artists with less than 20 unique users, and recompute the statistics.
val df_filtered_1 = df_joined.filter($"unique_users">=20).select(df_joined("userID"),df_joined("artistID"),df_joined("play_count"))
val artist_data_1 = df_filtered_1.groupBy("artistID").agg(count("artistID") as "unique_users",
                                                          sum("play_count") as "total_plays_artist")
                                                     .withColumnRenamed("artistID","artistID_1")

val user_data_1 = df_filtered_1.groupBy("userID").agg(count("userID") as "unique_artists",
                                                          sum("play_count") as "total_plays_user")
                                                     .withColumnRenamed("userID","userID_1")

val df_joined_filtered_1 = df_filtered_1.join(artist_data_1, artist_data_1("artistID_1")===df_filtered_1("artistID"))
                                        .join(user_data_1, user_data_1("userID_1")===df_filtered_1("userID"))
                                        .select(df_filtered_1("userID"),df_filtered_1("artistID"),df_filtered_1("play_count"), 
                                                 artist_data_1("unique_users"),artist_data_1("total_plays_artist"),
                                                 user_data_1("unique_artists"), user_data_1("total_plays_user"))

// Remove users with less than 5 unique users, and recompute the statistics.
val df_filtered_2 = df_joined_filtered_1.filter($"unique_artists">=5).select(df_filtered_1("userID"),df_filtered_1("artistID"),
                                                                             df_filtered_1("play_count"))

val artist_data = df_filtered_2.groupBy("artistID").agg(count("artistID") as "unique_users",
                                                        sum("play_count") as "total_plays_artist")
                                                   .withColumnRenamed("artistID","artistID_2")

val user_data = df_filtered_2.groupBy("userID").agg(count("userID") as "unique_artists",
                                                         sum("play_count") as "total_plays_user")
                                                   .withColumnRenamed("userID","userID_2")

// Now we collect our new filtered data.
val user_artist_data = df_filtered_2.join(artist_data, artist_data("artistID_2")===df_filtered_2("artistID"))
                                    .join(user_data, user_data("userID_2")===df_filtered_2("userID"))
                                    .select("userID","artistID","play_count","unique_users","total_plays_artist","unique_artists","total_plays_user")

user_artist_data.show(5)

  

>     +------+--------+----------+------------+------------------+--------------+----------------+
>     |userID|artistID|play_count|unique_users|total_plays_artist|unique_artists|total_plays_user|
>     +------+--------+----------+------------+------------------+--------------+----------------+
>     |   148|    1118|       214|          62|             53915|            12|            3026|
>     |   148|    1206|       245|          50|             32827|            12|            3026|
>     |   148|     206|       214|          83|             36944|            12|            3026|
>     |   148|     233|       170|         138|            160317|            12|            3026|
>     |   148|     429|       430|         162|             91740|            12|            3026|
>     +------+--------+----------+------------+------------------+--------------+----------------+
>     only showing top 5 rows
>
>     df_filtered_1: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 1 more field]
>     artist_data_1: org.apache.spark.sql.DataFrame = [artistID_1: int, unique_users: bigint ... 1 more field]
>     user_data_1: org.apache.spark.sql.DataFrame = [userID_1: int, unique_artists: bigint ... 1 more field]
>     df_joined_filtered_1: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 5 more fields]
>     df_filtered_2: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 1 more field]
>     artist_data: org.apache.spark.sql.DataFrame = [artistID_2: int, unique_users: bigint ... 1 more field]
>     user_data: org.apache.spark.sql.DataFrame = [userID_2: int, unique_artists: bigint ... 1 more field]
>     user_artist_data: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 5 more fields]

  

Below we can see that we have reduced the amount of data. The number of
users are quite similar as before but the number of artists is
significantly reduced.

In [None]:
val n_data_new = user_artist_data.count().asInstanceOf[Long].floatValue(); // Number of observations
val n_users_new = user_artist_data.agg(countDistinct("userID")).collect()(0)(0).asInstanceOf[Long].floatValue(); // Number of unique users
val n_artists_new = user_artist_data.agg(countDistinct("artistID")).collect()(0)(0).asInstanceOf[Long].floatValue(); // Number of unique artists
val sparsity_new = 1-n_data/(n_users*n_artists) // Compute the sparsity

println("Number of data points: " + n_data_new)
println("Number of users: " + n_users_new)
println("Number of artists: " + n_artists_new)
print("Sparsity:" + sparsity.toString + "\n")


  

>     Number of data points: 53114.0
>     Number of users: 1819.0
>     Number of artists: 804.0
>     Sparsity:0.9972172
>     n_data_new: Float = 53114.0
>     n_users_new: Float = 1819.0
>     n_artists_new: Float = 804.0
>     sparsity_new: Float = 0.9972172

In [None]:
display(user_artist_data.select("play_count"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(artist_data.select("total_plays_artist"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(artist_data.select("unique_users"))

  

[TABLE]

Truncated to 30 rows

  

The total number of plays are correlated to the number of unique
listeners (as expected) as illustrated in the figure below.

In [None]:
display(artist_data)

  

[TABLE]

Truncated to 30 rows

  

In the
[paper](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets)
mentioned above, the authors suggest scaling the $r{*u}{*i}$ if the
values tends to vary over large range as it is in our case. They
presented a log scaling scheme but after testing different approaches we
found that scaling by taking the square root of the observed play counts
(thus reducing the range) worked best.

In [None]:
//Scaling the play_counts
val user_artist_data_scaled = user_artist_data
.withColumn("scaled_value", sqrt(col("play_count"))).drop("play_count").withColumnRenamed("scaled_value","play_count")
user_artist_data_scaled.show(5)

  

>     +------+--------+------------+------------------+--------------+----------------+------------------+
>     |userID|artistID|unique_users|total_plays_artist|unique_artists|total_plays_user|        play_count|
>     +------+--------+------------+------------------+--------------+----------------+------------------+
>     |   148|     436|         124|             88270|            12|            3026| 18.33030277982336|
>     |   148|    1206|          50|             32827|            12|            3026|15.652475842498529|
>     |   148|     512|          67|             62933|            12|            3026|15.620499351813308|
>     |   148|     429|         162|             91740|            12|            3026| 20.73644135332772|
>     |   148|    1943|          25|             13035|            12|            3026| 17.26267650163207|
>     +------+--------+------------+------------------+--------------+----------------+------------------+
>     only showing top 5 rows
>
>     user_artist_data_scaled: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 5 more fields]

  

Display the scaled data

In [None]:
display(user_artist_data_scaled.select("play_count"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(user_artist_data_scaled.select("play_count"))

  

[TABLE]

Truncated to 30 rows

  

We split our scaled dataset into training, validation and test sets.

In [None]:
// Split data into training, validation and test sets.
val Array(training_set, validation_set, test_set) = user_artist_data_scaled.select("userID","artistID","play_count").randomSplit(Array(0.6, 0.2, 0.2))
training_set.cache()
validation_set.cache()
test_set.cache()

  

>     training_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int, artistID: int ... 1 more field]
>     validation_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int, artistID: int ... 1 more field]
>     test_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int, artistID: int ... 1 more field]
>     res109: test_set.type = [userID: int, artistID: int ... 1 more field]

  

Alternating Least Squares
-------------------------

By looking at the minimization problem again, we see that if one of $X$
and $Y$ is fixed, the cost function is just quadratic and hence the
minimum can be computed easily. Thus, we can alternate between
re-computing the user and artist features and it turns out that the over
all const function is guaranteed to decrease in each iteration. This
procedure is called [Alternating Least
Squares](https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/)
and is implemented in Spark. $$\\min*{X,Y}\\sum*{u\\in U,i \\in
I}c*{ui}(p*{ui}-x*u^Ty*i)^2+\\lambda(\\sum*{u\\in
U}||x*u||^2+\\sum*{i\\in I}||y*i||^2),$$

The solution to the respective quadratic problems are:

$$x*u=(Y^TC^uY+\\lambda Id )^{-1}Y^TC^up(u) \\quad \\forall u\\in U,$$
$$y*i=(X^TC^iX+\\lambda Id )^{-1}X^TC^ip(i) \\quad \\forall i\\in I,$$

where $C^u, C^i$ are a diagonal matrices with diagonal entries
$c{*u}{*i}$ $i \\in I$ and $c{*u}{*i}$ $u \\in U$ respectively. The
$p(u)$ and $p(i)$ are vectors containing all binarized user and artist
observations for user $u$ and artist $i$ respectively. The computational
bottlneck is to compute the $Y^TC^uY$ (require time $O(f^2n)$ for each
user). However, we can rewrite the product as $Y^TC^uY=Y^TY+Y^T(C^u-I)Y$
and now we see that the term $Y^TY$ does not depend on $u$ and that
$(C^u-I)$ will only have a number of non-zero entries equal to the
number of artists user $u$ has interacted with (which is usually much
smaller than the total number of artists). Hence, that representation is
much more beneficial computationally. A similar approach can be applied
to $X^TC^iX$. The matrix inversions need to be done on matrices of size
$f \\times f$ where $f$ is the dimension of the latent feature space and
thus relatively small.

When we have all the user and artist features we can produce a
recommendation list of artist for user $u$ by taking the dot products
$x*u^Ty* i$ for all artists and arrange them in a list in descending
order with respect to these computed values.

Evaluation
----------

One approach to measure the performance of the model would be to measure
the RMSE:

$$\\sqrt{\\frac{1}{\\\#\\text{observations}}\\sum*{u,
i}(p*{ui}^t-x*u^Ty*i)^2},$$ where $p*{ui}^t$ is the binarized
observations from the test set. However, this metric is not very
suitable for this particular application since for the zero entries of
$p^t{*u}{\_i}$ we don't know if the user dislikes the artist or just
hasn't discovered it. In the
[paper](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets)
they suggest the mean percentile rank metric:

$$\\overline{rank}=\\frac{\\sum*{u, i}r^t*{ui}rank*{ui}}{\\sum*{u, i}
r^t*{ui}},$$ where $rank*{ui}$ is the percentile rank of artist $i$ in
the produced recommendation list for user $u$. Hence if artist $j$ is in
the first place in the list for user $u$ we get that $rank{*u}{*j}=0\\%$
and if it is in the last place we get $rank{*u}{*j}=100\\%$. Thus, this
metric is an weighted average of the percentiles of the artists the
users have listened to. If user $u$ has listened to artist $j$ many
times we have a large $r{*u}{*j}$ value, but if the artist is ranked
very low in the recommendation list for this user, it will increase the
value of $\\overline{rank}$ drastically. If the model instead ranks this
artist correctly in the top, the product $r{*u}{*j}rank{*u}{*j}$ will
get small. Hence, low values of $\\overline{rank}$ is desired.

Unfortunately, the $\\overline{rank}$ metric is not implemented in Spark
yet, so below we have written our own function for computing it given
the ranked artist lists for each user. We also remove an artist from the
recommendation list for a user if we have observed that the user
listened to that artist in the training data. This eliminates the easy
recommendation, that is, recommending the same artists we know that the
user has already listened to.

In [None]:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Dataset 
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame

// Function for computing the mean rank metric.
// Input: 
// - prediction_scores_new: DataFrame with userIDs and corresponding recommendation lists.
// - training_set: DataFrame with observations in training set
// - validation_set: DataFrame with the observations needed for the evaluation of the metric.
// Output: Float corresponind to the mean_rank score.
def eval_model(predictions_scores_new: DataFrame, training_set: DataFrame, validation_set: DataFrame) : Float = {
  
  val predictions_scores = predictions_scores_new.withColumnRenamed("userID","userID_new") // Avoinding duplicate column names.
  val recommendations = predictions_scores.withColumn("recommendations", explode($"recommendations")) // Rearrange the recommendation lists.
                                      .select("userID_new","recommendations.artistID", "recommendations.rating")
  
  val recommendations_filtered = recommendations.join(training_set, training_set("userID")===recommendations("userID_new") && training_set("artistID")===recommendations("artistID"), "leftanti") // Erase artists appearing in the training for each user.
  
  // Compute ranking percentiles.
  val recommendations_percentiles = recommendations_filtered.withColumn("rank",percent_rank()
                                                            .over(Window.partitionBy("userID_new").orderBy(desc("rating")))) 
  // Store everything in single DataFrame.
  val table_data = recommendations_percentiles.join(validation_set, recommendations_percentiles("userID_new")===validation_set("userID") && recommendations_percentiles("artistID")===validation_set("artistID"))
  
  // Compute the sum in the numerator for the metric.
  val numerator = table_data.withColumn("ru1rankui", $"rank"*$"play_count"*100.0)
                            .agg(sum("ru1rankui"))
                            .collect()(0)(0).asInstanceOf[Double]
  
  // Compute the sum in the denominator for the metric.
  val denumerator = table_data.agg(sum("play_count"))
                              .collect()(0)(0)
                              .asInstanceOf[Double]
  // Compute the mean percentile rank.
  val rank_score = numerator/denumerator
  rank_score.toFloat
}

  

>     import org.apache.spark.sql.expressions.Window
>     import org.apache.spark.sql.Dataset
>     import org.apache.spark.sql.Row
>     import org.apache.spark.sql.DataFrame
>     eval_model: (predictions_scores_new: org.apache.spark.sql.DataFrame, training_set: org.apache.spark.sql.DataFrame, validation_set: org.apache.spark.sql.DataFrame)Float

  

Now we import the [ALS
module](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/recommendation/ALSModel.html)
from Spark and start the training. We perform a grid search over the
hyper-parameters: the latent dimension $f$, confidence parameter
$\\alpha$ and regularization parameter $\\lambda$. We choose the
parameter combinations based on the performance on the validation set.

In [None]:
import org.apache.spark.ml.recommendation.ALS
// Number of iterations in the ALS algorithm
val numIter = 10

 
 val ranks = List(10,50,100,150) // Dimension of latent feature space
 val lambdas=List(0.1, 1.0, 2.0) // Regularization parameter
 val alphas=List(0.5, 1.0, 5.0) // Confidence parameter

// Loop over all parameter combinations
for ( alpha <- alphas ){
  for ( lambda <- lambdas ){
    for ( rank <- ranks ){
      val als = new ALS()
        .setRank(rank)
        .setMaxIter(numIter)
        .setRegParam(lambda)
        .setUserCol("userID")
        .setItemCol("artistID")
        .setRatingCol("play_count")
        .setImplicitPrefs(true) // Indicate we have implicit data
        .setAlpha(alpha)
        .setNonnegative(true) // Constrain to non-negative values
      
      // Fit the model
      val model = als.fit(training_set)
      
      model.setColdStartStrategy("drop") // This is to ensure we handle unseen users or unseen artist saftely during the prediction.
           .setUserCol("userID")
           .setItemCol("artistID")
      // Generate the recommendations
      val predictions_scores = model.recommendForUserSubset(validation_set,n_artists_new.toInt)
      
      // Evaluate the model
      println("rank=" + rank + ", alpha=" + alpha + ", lambda=" + lambda + ", mean_rank=" + eval_model(predictions_scores, training_set, validation_set))
    }
  }
}


  

>     rank=10, alpha=0.5, lambda=0.1, mean_rank=10.410393
>     rank=50, alpha=0.5, lambda=0.1, mean_rank=10.524447
>     rank=100, alpha=0.5, lambda=0.1, mean_rank=11.609247
>     rank=150, alpha=0.5, lambda=0.1, mean_rank=13.049584
>     rank=10, alpha=0.5, lambda=1.0, mean_rank=9.834879
>     rank=50, alpha=0.5, lambda=1.0, mean_rank=8.388225
>     rank=100, alpha=0.5, lambda=1.0, mean_rank=8.468931
>     rank=150, alpha=0.5, lambda=1.0, mean_rank=8.435649
>     rank=10, alpha=0.5, lambda=2.0, mean_rank=9.819813
>     rank=50, alpha=0.5, lambda=2.0, mean_rank=8.098052
>     rank=100, alpha=0.5, lambda=2.0, mean_rank=7.8016405
>     rank=150, alpha=0.5, lambda=2.0, mean_rank=7.7979865
>     rank=10, alpha=1.0, lambda=0.1, mean_rank=10.217629
>     rank=50, alpha=1.0, lambda=0.1, mean_rank=10.891969
>     rank=100, alpha=1.0, lambda=0.1, mean_rank=12.113014
>     rank=150, alpha=1.0, lambda=0.1, mean_rank=12.9173355
>     rank=10, alpha=1.0, lambda=1.0, mean_rank=9.814854
>     rank=50, alpha=1.0, lambda=1.0, mean_rank=8.9062605
>     rank=100, alpha=1.0, lambda=1.0, mean_rank=8.994811
>     rank=150, alpha=1.0, lambda=1.0, mean_rank=9.323116
>     rank=10, alpha=1.0, lambda=2.0, mean_rank=9.574796
>     rank=50, alpha=1.0, lambda=2.0, mean_rank=8.102262
>     rank=100, alpha=1.0, lambda=2.0, mean_rank=7.919162
>     rank=150, alpha=1.0, lambda=2.0, mean_rank=7.806969
>     rank=10, alpha=5.0, lambda=0.1, mean_rank=11.04583
>     rank=50, alpha=5.0, lambda=0.1, mean_rank=12.466616
>     rank=100, alpha=5.0, lambda=0.1, mean_rank=13.031898
>     rank=150, alpha=5.0, lambda=0.1, mean_rank=13.128654
>     rank=10, alpha=5.0, lambda=1.0, mean_rank=10.496872
>     rank=50, alpha=5.0, lambda=1.0, mean_rank=10.822062
>     rank=100, alpha=5.0, lambda=1.0, mean_rank=10.659593
>     rank=150, alpha=5.0, lambda=1.0, mean_rank=10.679214
>     rank=10, alpha=5.0, lambda=2.0, mean_rank=10.021061
>     rank=50, alpha=5.0, lambda=2.0, mean_rank=9.592773
>     rank=100, alpha=5.0, lambda=2.0, mean_rank=9.313808
>     rank=150, alpha=5.0, lambda=2.0, mean_rank=9.194027
>     import org.apache.spark.ml.recommendation.ALS
>     numIter: Int = 10
>     ranks: List[Int] = List(10, 50, 100, 150)
>     lambdas: List[Double] = List(0.1, 1.0, 2.0)
>     alphas: List[Double] = List(0.5, 1.0, 5.0)

  

We get our final model by choosing $f=150, \\alpha=0.5$ and
$\\lambda=2.0$ train the model again and evaluating it on the test set.
We observe a test error of 7.75 %.

In [None]:
// Retrain the best model.

val numIter_final=10
val rank_final=150
val alpha_final=0.5
val lambda_final=2.0
val als_final = new ALS()
        .setRank(rank_final)
        .setMaxIter(numIter_final)
        .setRegParam(lambda_final)
        .setUserCol("userID")
        .setItemCol("artistID")
        .setRatingCol("play_count")
        .setImplicitPrefs(true)
        .setAlpha(alpha_final)
        .setNonnegative(true)
val model_final = als_final.fit(training_set)
model_final.setColdStartStrategy("drop")
     .setUserCol("userID")
     .setItemCol("artistID")

// Evaluate on the validation set.
val predictions_scores_val = model_final.recommendForUserSubset(validation_set,n_artists_new.toInt)
println("Validation set: mean_rank=" + eval_model(predictions_scores_val, training_set, validation_set))

// Evaluate on the test set.
val predictions_scores_test = model_final.recommendForUserSubset(test_set,n_artists_new.toInt)
println("Test set: mean_rank=" + eval_model(predictions_scores_val, training_set, test_set))

  

>     Validation set: mean_rank=7.7979865
>     Test set: mean_rank=7.75016
>     numIter_final: Int = 10
>     rank_final: Int = 150
>     alpha_final: Double = 0.5
>     lambda_final: Double = 2.0
>     als_final: org.apache.spark.ml.recommendation.ALS = als_f778f6cc23db
>     model_final: org.apache.spark.ml.recommendation.ALSModel = als_f778f6cc23db
>     predictions_scores_val: org.apache.spark.sql.DataFrame = [userID: int, recommendations: array<struct<artistID:int,rating:float>>]
>     predictions_scores_test: org.apache.spark.sql.DataFrame = [userID: int, recommendations: array<struct<artistID:int,rating:float>>]

  

Model Comparison
----------------

We compare our model with two naive ones.

**Random Recommendations**: First we just produce a ranom recommendation
list for each user and evaluate the metric. Note that for a random
ranking the expected ranking percentile for an artist would be 50%, so
we expect the $\\overline{rank}\\approx 50 \\%$ for this random model.

In [None]:
case class Rating(artistID: Int, rating: Float) // Simple class for getting the recommendations in suitable form.

// Generating random array of artistIDs.
val random = artist_data.select("artistID_2").distinct().orderBy(rand()).withColumn("idx",monotonically_increasing_id)
           .withColumn("rownumber",row_number.over(Window.orderBy(desc("idx")))).drop("idx").sort(desc("rownumber"))
          .collect.map(row =>Rating(row.getInt(0),row.getInt(1).toFloat))

val test_users = test_set.select("userID").distinct()

//Append the arrays to DataFrame.
val prediction_scores = user_artist_data.select("userID").distinct().withColumn("recommendations",typedLit(random))
                                        .join(test_users,"userID")

  

>     defined class Rating
>     random: Array[Rating] = Array(Rating(1131,804.0), Rating(504,803.0), Rating(985,802.0), Rating(1914,801.0), Rating(3739,800.0), Rating(792,799.0), Rating(3470,798.0), Rating(1424,797.0), Rating(599,796.0), Rating(3481,795.0), Rating(674,794.0), Rating(779,793.0), Rating(2815,792.0), Rating(204,791.0), Rating(2006,790.0), Rating(1609,789.0), Rating(88,788.0), Rating(2014,787.0), Rating(1943,786.0), Rating(1047,785.0), Rating(3200,784.0), Rating(1155,783.0), Rating(540,782.0), Rating(2619,781.0), Rating(250,780.0), Rating(507,779.0), Rating(2018,778.0), Rating(1596,777.0), Rating(1833,776.0), Rating(330,775.0), Rating(1394,774.0), Rating(441,773.0), Rating(221,772.0), Rating(6120,771.0), Rating(680,770.0), Rating(307,769.0), Rating(3258,768.0), Rating(1372,767.0), Rating(4828,766.0), Rating(157,765.0), Rating(262,764.0), Rating(841,763.0), Rating(340,762.0), Rating(3078,761.0), Rating(610,760.0), Rating(2044,759.0), Rating(53,758.0), Rating(426,757.0), Rating(704,756.0), Rating(63,755.0), Rating(1249,754.0), Rating(472,753.0), Rating(51,752.0), Rating(523,751.0), Rating(1180,750.0), Rating(1713,749.0), Rating(1073,748.0), Rating(64,747.0), Rating(481,746.0), Rating(1545,745.0), Rating(4251,744.0), Rating(56,743.0), Rating(527,742.0), Rating(686,741.0), Rating(85,740.0), Rating(923,739.0), Rating(5926,738.0), Rating(3404,737.0), Rating(532,736.0), Rating(227,735.0), Rating(65,734.0), Rating(2783,733.0), Rating(2343,732.0), Rating(1790,731.0), Rating(1104,730.0), Rating(163,729.0), Rating(440,728.0), Rating(2544,727.0), Rating(823,726.0), Rating(2661,725.0), Rating(3501,724.0), Rating(697,723.0), Rating(547,722.0), Rating(816,721.0), Rating(1364,720.0), Rating(1198,719.0), Rating(1130,718.0), Rating(2380,717.0), Rating(320,716.0), Rating(187,715.0), Rating(548,714.0), Rating(802,713.0), Rating(712,712.0), Rating(1300,711.0), Rating(2137,710.0), Rating(3110,709.0), Rating(1451,708.0), Rating(67,707.0), Rating(915,706.0), Rating(468,705.0), Rating(575,704.0), Rating(290,703.0), Rating(488,702.0), Rating(1351,701.0), Rating(237,700.0), Rating(534,699.0), Rating(1470,698.0), Rating(498,697.0), Rating(471,696.0), Rating(1366,695.0), Rating(492,694.0), Rating(1203,693.0), Rating(183,692.0), Rating(1398,691.0), Rating(2600,690.0), Rating(846,689.0), Rating(1512,688.0), Rating(444,687.0), Rating(726,686.0), Rating(3173,685.0), Rating(96,684.0), Rating(1192,683.0), Rating(3332,682.0), Rating(1414,681.0), Rating(2342,680.0), Rating(152,679.0), Rating(2835,678.0), Rating(2091,677.0), Rating(518,676.0), Rating(4385,675.0), Rating(546,674.0), Rating(3411,673.0), Rating(2616,672.0), Rating(1866,671.0), Rating(2521,670.0), Rating(1098,669.0), Rating(1167,668.0), Rating(679,667.0), Rating(500,666.0), Rating(859,665.0), Rating(530,664.0), Rating(744,663.0), Rating(4180,662.0), Rating(875,661.0), Rating(3171,660.0), Rating(1298,659.0), Rating(84,658.0), Rating(25,657.0), Rating(366,656.0), Rating(910,655.0), Rating(1035,654.0), Rating(3259,653.0), Rating(806,652.0), Rating(542,651.0), Rating(344,650.0), Rating(229,649.0), Rating(554,648.0), Rating(1643,647.0), Rating(231,646.0), Rating(171,645.0), Rating(524,644.0), Rating(545,643.0), Rating(211,642.0), Rating(949,641.0), Rating(2716,640.0), Rating(1206,639.0), Rating(951,638.0), Rating(525,637.0), Rating(2139,636.0), Rating(375,635.0), Rating(349,634.0), Rating(646,633.0), Rating(1453,632.0), Rating(153,631.0), Rating(418,630.0), Rating(323,629.0), Rating(234,628.0), Rating(715,627.0), Rating(1409,626.0), Rating(1886,625.0), Rating(326,624.0), Rating(1455,623.0), Rating(439,622.0), Rating(998,621.0), Rating(1766,620.0), Rating(1678,619.0), Rating(3227,618.0), Rating(360,617.0), Rating(1824,616.0), Rating(3417,615.0), Rating(235,614.0), Rating(2206,613.0), Rating(2632,612.0), Rating(1428,611.0), Rating(400,610.0), Rating(851,609.0), Rating(607,608.0), Rating(682,607.0), Rating(1936,606.0), Rating(2605,605.0), Rating(1099,604.0), Rating(544,603.0), Rating(1079,602.0), Rating(1075,601.0), Rating(197,600.0), Rating(4087,599.0), Rating(857,598.0), Rating(3280,597.0), Rating(70,596.0), Rating(959,595.0), Rating(2787,594.0), Rating(4821,593.0), Rating(3462,592.0), Rating(808,591.0), Rating(1613,590.0), Rating(1360,589.0), Rating(2179,588.0), Rating(296,587.0), Rating(797,586.0), Rating(1114,585.0), Rating(1216,584.0), Rating(188,583.0), Rating(1239,582.0), Rating(429,581.0), Rating(253,580.0), Rating(7324,579.0), Rating(955,578.0), Rating(1038,577.0), Rating(7340,576.0), Rating(2548,575.0), Rating(1707,574.0), Rating(2624,573.0), Rating(2130,572.0), Rating(1110,571.0), Rating(1195,570.0), Rating(1429,569.0), Rating(1040,568.0), Rating(517,567.0), Rating(969,566.0), Rating(2535,565.0), Rating(174,564.0), Rating(1511,563.0), Rating(378,562.0), Rating(314,561.0), Rating(1191,560.0), Rating(1947,559.0), Rating(924,558.0), Rating(69,557.0), Rating(230,556.0), Rating(1041,555.0), Rating(329,554.0), Rating(1121,553.0), Rating(310,552.0), Rating(4025,551.0), Rating(475,550.0), Rating(533,549.0), Rating(300,548.0), Rating(191,547.0), Rating(1672,546.0), Rating(5150,545.0), Rating(416,544.0), Rating(1745,543.0), Rating(1418,542.0), Rating(691,541.0), Rating(2821,540.0), Rating(539,539.0), Rating(293,538.0), Rating(602,537.0), Rating(945,536.0), Rating(1787,535.0), Rating(3859,534.0), Rating(433,533.0), Rating(465,532.0), Rating(331,531.0), Rating(1358,530.0), Rating(77,529.0), Rating(768,528.0), Rating(278,527.0), Rating(59,526.0), Rating(907,525.0), Rating(2094,524.0), Rating(954,523.0), Rating(681,522.0), Rating(299,521.0), Rating(1045,520.0), Rating(1410,519.0), Rating(3693,518.0), Rating(562,517.0), Rating(285,516.0), Rating(1505,515.0), Rating(182,514.0), Rating(450,513.0), Rating(1415,512.0), Rating(906,511.0), Rating(279,510.0), Rating(2801,509.0), Rating(4079,508.0), Rating(1222,507.0), Rating(815,506.0), Rating(493,505.0), Rating(2850,504.0), Rating(461,503.0), Rating(459,502.0), Rating(3186,501.0), Rating(412,500.0), Rating(1449,499.0), Rating(510,498.0), Rating(1109,497.0), Rating(4616,496.0), Rating(325,495.0), Rating(1519,494.0), Rating(4118,493.0), Rating(982,492.0), Rating(641,491.0), Rating(1964,490.0), Rating(2265,489.0), Rating(1852,488.0), Rating(2794,487.0), Rating(308,486.0), Rating(2020,485.0), Rating(2855,484.0), Rating(342,483.0), Rating(362,482.0), Rating(30,481.0), Rating(453,480.0), Rating(1416,479.0), Rating(843,478.0), Rating(2797,477.0), Rating(202,476.0), Rating(928,475.0), Rating(45,474.0), Rating(683,473.0), Rating(538,472.0), Rating(3444,471.0), Rating(1854,470.0), Rating(5988,469.0), Rating(629,468.0), Rating(283,467.0), Rating(1755,466.0), Rating(1343,465.0), Rating(166,464.0), Rating(2637,463.0), Rating(874,462.0), Rating(755,461.0), Rating(238,460.0), Rating(352,459.0), Rating(632,458.0), Rating(503,457.0), Rating(487,456.0), Rating(436,455.0), Rating(1934,454.0), Rating(289,453.0), Rating(1043,452.0), Rating(1037,451.0), Rating(5547,450.0), Rating(1243,449.0), Rating(311,448.0), Rating(733,447.0), Rating(1116,446.0), Rating(3768,445.0), Rating(1444,444.0), Rating(89,443.0), Rating(155,442.0), Rating(1338,441.0), Rating(1810,440.0), Rating(421,439.0), Rating(245,438.0), Rating(192,437.0), Rating(3427,436.0), Rating(1377,435.0), Rating(2990,434.0), Rating(3489,433.0), Rating(196,432.0), Rating(732,431.0), Rating(2080,430.0), Rating(2277,429.0), Rating(612,428.0), Rating(1281,427.0), Rating(614,426.0), Rating(1547,425.0), Rating(3741,424.0), Rating(769,423.0), Rating(536,422.0), Rating(1193,421.0), Rating(470,420.0), Rating(1122,419.0), Rating(1241,418.0), Rating(1988,417.0), Rating(176,416.0), Rating(693,415.0), Rating(1379,414.0), Rating(2834,413.0), Rating(2444,412.0), Rating(1632,411.0), Rating(212,410.0), Rating(295,409.0), Rating(407,408.0), Rating(821,407.0), Rating(258,406.0), Rating(918,405.0), Rating(830,404.0), Rating(190,403.0), Rating(620,402.0), Rating(1376,401.0), Rating(485,400.0), Rating(215,399.0), Rating(1350,398.0), Rating(2498,397.0), Rating(389,396.0), Rating(3324,395.0), Rating(968,394.0), Rating(2840,393.0), Rating(3902,392.0), Rating(2956,391.0), Rating(490,390.0), Rating(305,389.0), Rating(1119,388.0), Rating(251,387.0), Rating(1892,386.0), Rating(1905,385.0), Rating(52,384.0), Rating(4814,383.0), Rating(1804,382.0), Rating(2524,381.0), Rating(1504,380.0), Rating(486,379.0), Rating(161,378.0), Rating(193,377.0), Rating(585,376.0), Rating(964,375.0), Rating(962,374.0), Rating(4704,373.0), Rating(605,372.0), Rating(210,371.0), Rating(531,370.0), Rating(297,369.0), Rating(622,368.0), Rating(718,367.0), Rating(464,366.0), Rating(9,365.0), Rating(618,364.0), Rating(972,363.0), Rating(4115,362.0), Rating(997,361.0), Rating(1097,360.0), Rating(4262,359.0), Rating(917,358.0), Rating(4562,357.0), Rating(889,356.0), Rating(1515,355.0), Rating(2407,354.0), Rating(616,353.0), Rating(858,352.0), Rating(2102,351.0), Rating(880,350.0), Rating(1183,349.0), Rating(528,348.0), Rating(1580,347.0), Rating(309,346.0), Rating(3201,345.0), Rating(3940,344.0), Rating(1958,343.0), Rating(3397,342.0), Rating(1083,341.0), Rating(225,340.0), Rating(205,339.0), Rating(217,338.0), Rating(3475,337.0), Rating(1645,336.0), Rating(1001,335.0), Rating(5149,334.0), Rating(3373,333.0), Rating(371,332.0), Rating(2977,331.0), Rating(1458,330.0), Rating(1100,329.0), Rating(961,328.0), Rating(3109,327.0), Rating(908,326.0), Rating(868,325.0), Rating(1709,324.0), Rating(1062,323.0), Rating(863,322.0), Rating(55,321.0), Rating(2901,320.0), Rating(2608,319.0), Rating(3081,318.0), Rating(952,317.0), Rating(728,316.0), Rating(1714,315.0), Rating(189,314.0), Rating(2838,313.0), Rating(2370,312.0), Rating(1873,311.0), Rating(2786,310.0), Rating(722,309.0), Rating(1196,308.0), Rating(687,307.0), Rating(1863,306.0), Rating(445,305.0), Rating(2523,304.0), Rating(708,303.0), Rating(1411,302.0), Rating(886,301.0), Rating(423,300.0), Rating(1475,299.0), Rating(644,298.0), Rating(2614,297.0), Rating(107,296.0), Rating(5416,295.0), Rating(785,294.0), Rating(1452,293.0), Rating(1052,292.0), Rating(2744,291.0), Rating(1383,290.0), Rating(3466,289.0), Rating(1633,288.0), Rating(321,287.0), Rating(184,286.0), Rating(1274,285.0), Rating(2379,284.0), Rating(748,283.0), Rating(3302,282.0), Rating(603,281.0), Rating(75,280.0), Rating(1513,279.0), Rating(2352,278.0), Rating(1048,277.0), Rating(301,276.0), Rating(1775,275.0), Rating(845,274.0), Rating(198,273.0), Rating(854,272.0), Rating(1464,271.0), Rating(430,270.0), Rating(1541,269.0), Rating(332,268.0), Rating(2220,267.0), Rating(298,266.0), Rating(757,265.0), Rating(2226,264.0), Rating(1106,263.0), Rating(1814,262.0), Rating(447,261.0), Rating(2938,260.0), Rating(2940,259.0), Rating(932,258.0), Rating(793,257.0), Rating(1201,256.0), Rating(6453,255.0), Rating(1145,254.0), Rating(322,253.0), Rating(228,252.0), Rating(1384,251.0), Rating(2595,250.0), Rating(1601,249.0), Rating(328,248.0), Rating(813,247.0), Rating(1019,246.0), Rating(220,245.0), Rating(730,244.0), Rating(1378,243.0), Rating(316,242.0), Rating(5000,241.0), Rating(2531,240.0), Rating(1046,239.0), Rating(241,238.0), Rating(1957,237.0), Rating(1188,236.0), Rating(377,235.0), Rating(3767,234.0), Rating(812,233.0), Rating(458,232.0), Rating(4675,231.0), Rating(327,230.0), Rating(257,229.0), Rating(1463,228.0), Rating(7,227.0), Rating(8589,226.0), Rating(3333,225.0), Rating(2580,224.0), Rating(1013,223.0), Rating(223,222.0), Rating(526,221.0), Rating(428,220.0), Rating(72,219.0), Rating(1459,218.0), Rating(1772,217.0), Rating(1684,216.0), Rating(609,215.0), Rating(1977,214.0), Rating(497,213.0), Rating(993,212.0), Rating(710,211.0), Rating(1186,210.0), Rating(1534,209.0), Rating(1510,208.0), Rating(690,207.0), Rating(1091,206.0), Rating(154,205.0), Rating(1425,204.0), Rating(1456,203.0), Rating(1081,202.0), Rating(615,201.0), Rating(511,200.0), Rating(1874,199.0), Rating(466,198.0), Rating(181,197.0), Rating(1246,196.0), Rating(1527,195.0), Rating(98,194.0), Rating(170,193.0), Rating(1034,192.0), Rating(324,191.0), Rating(1061,190.0), Rating(707,189.0), Rating(1151,188.0), Rating(1242,187.0), Rating(2025,186.0), Rating(99,185.0), Rating(506,184.0), Rating(424,183.0), Rating(706,182.0), Rating(318,181.0), Rating(405,180.0), Rating(856,179.0), Rating(947,178.0), Rating(267,177.0), Rating(355,176.0), Rating(233,175.0), Rating(1795,174.0), Rating(1427,173.0), Rating(6217,172.0), Rating(1969,171.0), Rating(1853,170.0), Rating(81,169.0), Rating(2439,168.0), Rating(922,167.0), Rating(1778,166.0), Rating(977,165.0), Rating(1976,164.0), Rating(4566,163.0), Rating(1401,162.0), Rating(489,161.0), Rating(1014,160.0), Rating(930,159.0), Rating(1673,158.0), Rating(2668,157.0), Rating(705,156.0), Rating(3502,155.0), Rating(1244,154.0), Rating(246,153.0), Rating(980,152.0), Rating(1807,151.0), Rating(3046,150.0), Rating(4316,149.0), Rating(1089,148.0), Rating(3616,147.0), Rating(167,146.0), Rating(1150,145.0), Rating(689,144.0), Rating(630,143.0), Rating(386,142.0), Rating(2607,141.0), Rating(425,140.0), Rating(3057,139.0), Rating(1123,138.0), Rating(867,137.0), Rating(512,136.0), Rating(1105,135.0), Rating(1681,134.0), Rating(172,133.0), Rating(701,132.0), Rating(317,131.0), Rating(1426,130.0), Rating(1179,129.0), Rating(302,128.0), Rating(703,127.0), Rating(734,126.0), Rating(2347,125.0), Rating(946,124.0), Rating(2542,123.0), Rating(3279,122.0), Rating(2092,121.0), Rating(261,120.0), Rating(936,119.0), Rating(501,118.0), Rating(3400,117.0), Rating(180,116.0), Rating(71,115.0), Rating(432,114.0), Rating(294,113.0), Rating(1983,112.0), Rating(717,111.0), Rating(499,110.0), Rating(2083,109.0), Rating(1339,108.0), Rating(97,107.0), Rating(1182,106.0), Rating(3097,105.0), Rating(1400,104.0), Rating(786,103.0), Rating(4435,102.0), Rating(2176,101.0), Rating(1295,100.0), Rating(978,99.0), Rating(1990,98.0), Rating(203,97.0), Rating(434,96.0), Rating(195,95.0), Rating(714,94.0), Rating(304,93.0), Rating(986,92.0), Rating(420,91.0), Rating(291,90.0), Rating(162,89.0), Rating(15,88.0), Rating(920,87.0), Rating(966,86.0), Rating(601,85.0), Rating(456,84.0), Rating(1200,83.0), Rating(356,82.0), Rating(1556,81.0), Rating(437,80.0), Rating(1044,79.0), Rating(383,78.0), Rating(2923,77.0), Rating(222,76.0), Rating(226,75.0), Rating(1072,74.0), Rating(1520,73.0), Rating(208,72.0), Rating(957,71.0), Rating(1413,70.0), Rating(1406,69.0), Rating(1639,68.0), Rating(370,67.0), Rating(621,66.0), Rating(2959,65.0), Rating(1447,64.0), Rating(374,63.0), Rating(860,62.0), Rating(709,61.0), Rating(5605,60.0), Rating(159,59.0), Rating(2446,58.0), Rating(1009,57.0), Rating(1032,56.0), Rating(3317,55.0), Rating(403,54.0), Rating(790,53.0), Rating(5079,52.0), Rating(265,51.0), Rating(2346,50.0), Rating(2121,49.0), Rating(1369,48.0), Rating(913,47.0), Rating(1747,46.0), Rating(484,45.0), Rating(515,44.0), Rating(604,43.0), Rating(953,42.0), Rating(186,41.0), Rating(716,40.0), Rating(903,39.0), Rating(207,38.0), Rating(1042,37.0), Rating(173,36.0), Rating(543,35.0), Rating(1974,34.0), Rating(1803,33.0), Rating(724,32.0), Rating(735,31.0), Rating(479,30.0), Rating(1375,29.0), Rating(61,28.0), Rating(455,27.0), Rating(1181,26.0), Rating(1118,25.0), Rating(58,24.0), Rating(333,23.0), Rating(877,22.0), Rating(3763,21.0), Rating(1090,20.0), Rating(288,19.0), Rating(537,18.0), Rating(898,17.0), Rating(810,16.0), Rating(535,15.0), Rating(4177,14.0), Rating(292,13.0), Rating(209,12.0), Rating(3488,11.0), Rating(306,10.0), Rating(1390,9.0), Rating(1705,8.0), Rating(206,7.0), Rating(1700,6.0), Rating(1185,5.0), Rating(199,4.0), Rating(1412,3.0), Rating(805,2.0), Rating(2582,1.0))
>     test_users: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int]
>     prediction_scores: org.apache.spark.sql.DataFrame = [userID: int, recommendations: array<struct<artistID:int,rating:float>>]

  

The actual value we get is $\\overline{rank}\\approx 50.86 \\%$ which
agrees with the above reasoning.

In [None]:
println("Random_model: mean_rank=" + eval_model(prediction_scores, training_set, test_set))

  

>     Random_model: mean_rank=50.860817

  

**Popular Recommendations:** We recommend each user the list of artist
sorted by the number of total plays in the training dataset. Hence the
over all most popular artist will be in the top of the recommendations
independent of the user. Hence, this is not personalized recommenations.

In [None]:
//Generating arrays of artistIDs w.r.t most plays.
val most_popular = artist_data.select("artistID_2", "total_plays_artist").sort(desc("total_plays_artist"))
                              .collect.map(row =>Rating(row.getInt(0),row.getLong(1).toFloat))
val test_users = test_set.select("userID").distinct()

//Append the arrays to DataFrame.
val prediction_scores = user_artist_data.select("userID").distinct().withColumn("recommendations",typedLit(most_popular))
                                        .join(test_users,"userID")

  

>     most_popular: Array[Rating] = Array(Rating(289,2261407.0), Rating(72,1299323.0), Rating(89,1257059.0), Rating(292,1058405.0), Rating(498,963449.0), Rating(67,921068.0), Rating(288,905422.0), Rating(701,688529.0), Rating(227,662116.0), Rating(300,532452.0), Rating(333,525844.0), Rating(344,525291.0), Rating(378,513473.0), Rating(679,505661.0), Rating(295,499257.0), Rating(511,493024.0), Rating(461,489065.0), Rating(486,485532.0), Rating(190,485076.0), Rating(163,466104.0), Rating(55,449292.0), Rating(154,385165.0), Rating(466,384405.0), Rating(257,384307.0), Rating(707,371916.0), Rating(917,368710.0), Rating(792,350035.0), Rating(51,348844.0), Rating(65,330757.0), Rating(475,319471.0), Rating(203,318221.0), Rating(157,296861.0), Rating(207,288520.0), Rating(198,277309.0), Rating(377,265362.0), Rating(291,253027.0), Rating(614,251440.0), Rating(173,245878.0), Rating(503,237148.0), Rating(687,215777.0), Rating(903,213103.0), Rating(302,207761.0), Rating(187,205143.0), Rating(1412,203665.0), Rating(1098,202178.0), Rating(1672,200949.0), Rating(458,200027.0), Rating(229,191979.0), Rating(234,190232.0), Rating(306,188634.0), Rating(56,175895.0), Rating(325,166644.0), Rating(533,165975.0), Rating(294,162288.0), Rating(233,160317.0), Rating(209,159733.0), Rating(230,155321.0), Rating(455,153101.0), Rating(159,151103.0), Rating(228,148452.0), Rating(299,144710.0), Rating(681,139363.0), Rating(704,138049.0), Rating(2044,136963.0), Rating(1249,133931.0), Rating(418,133772.0), Rating(599,133511.0), Rating(1369,131397.0), Rating(349,130855.0), Rating(424,130227.0), Rating(1104,129745.0), Rating(298,129352.0), Rating(1459,126314.0), Rating(318,123682.0), Rating(285,120665.0), Rating(1246,120336.0), Rating(316,117229.0), Rating(220,116174.0), Rating(544,115026.0), Rating(680,114690.0), Rating(959,114439.0), Rating(195,113631.0), Rating(689,113467.0), Rating(310,110218.0), Rating(686,110123.0), Rating(412,102309.0), Rating(439,100819.0), Rating(217,100298.0), Rating(918,99854.0), Rating(2102,99845.0), Rating(706,99769.0), Rating(441,99296.0), Rating(161,97089.0), Rating(7,96201.0), Rating(2179,95889.0), Rating(868,95880.0), Rating(889,94551.0), Rating(331,94501.0), Rating(1048,94398.0), Rating(464,94179.0), Rating(293,93828.0), Rating(1001,91869.0), Rating(429,91740.0), Rating(329,91169.0), Rating(301,91066.0), Rating(816,90365.0), Rating(1541,90154.0), Rating(152,88594.0), Rating(436,88270.0), Rating(982,88086.0), Rating(2548,87617.0), Rating(735,87432.0), Rating(304,86342.0), Rating(1400,86195.0), Rating(523,85160.0), Rating(1047,83954.0), Rating(265,83373.0), Rating(1014,83206.0), Rating(930,83061.0), Rating(1090,82999.0), Rating(225,81822.0), Rating(906,80222.0), Rating(808,80184.0), Rating(188,79847.0), Rating(540,79486.0), Rating(779,78578.0), Rating(1044,78363.0), Rating(518,78342.0), Rating(63,77046.0), Rating(1131,76012.0), Rating(1244,75991.0), Rating(182,75803.0), Rating(499,75634.0), Rating(2020,75372.0), Rating(1934,75169.0), Rating(279,74834.0), Rating(748,74039.0), Rating(1513,73961.0), Rating(543,73596.0), Rating(290,72962.0), Rating(932,71710.0), Rating(504,71488.0), Rating(1180,71434.0), Rating(1470,71250.0), Rating(536,70828.0), Rating(1045,70589.0), Rating(683,70360.0), Rating(923,70314.0), Rating(226,69833.0), Rating(88,69629.0), Rating(487,68917.0), Rating(488,68306.0), Rating(314,67979.0), Rating(352,67822.0), Rating(59,67259.0), Rating(191,67187.0), Rating(3501,66688.0), Rating(525,65902.0), Rating(920,65886.0), Rating(3057,65641.0), Rating(1673,65344.0), Rating(212,65145.0), Rating(1116,64897.0), Rating(456,64596.0), Rating(972,63661.0), Rating(328,63582.0), Rating(2521,63510.0), Rating(512,62933.0), Rating(843,62838.0), Rating(199,62735.0), Rating(321,61849.0), Rating(1034,61748.0), Rating(537,61679.0), Rating(813,61163.0), Rating(841,60939.0), Rating(375,60167.0), Rating(757,59938.0), Rating(615,59821.0), Rating(296,59353.0), Rating(1406,58965.0), Rating(528,58312.0), Rating(517,58163.0), Rating(1375,57847.0), Rating(562,57728.0), Rating(1444,57537.0), Rating(538,57009.0), Rating(907,56599.0), Rating(898,56006.0), Rating(1713,55944.0), Rating(1186,55683.0), Rating(768,55642.0), Rating(324,55463.0), Rating(715,55137.0), Rating(323,54909.0), Rating(155,54825.0), Rating(2094,54663.0), Rating(58,54597.0), Rating(1243,53918.0), Rating(1118,53915.0), Rating(1037,53353.0), Rating(425,53054.0), Rating(1854,52706.0), Rating(1075,52137.0), Rating(2850,51988.0), Rating(1684,51641.0), Rating(802,51602.0), Rating(500,51566.0), Rating(527,50645.0), Rating(542,50241.0), Rating(618,49881.0), Rating(1464,49843.0), Rating(416,49647.0), Rating(961,49603.0), Rating(167,49521.0), Rating(1505,49290.0), Rating(854,49014.0), Rating(548,48631.0), Rating(2531,48397.0), Rating(64,47993.0), Rating(969,47200.0), Rating(172,46606.0), Rating(830,46466.0), Rating(524,46370.0), Rating(493,46295.0), Rating(204,46061.0), Rating(430,45886.0), Rating(1358,45672.0), Rating(1339,45666.0), Rating(709,45355.0), Rating(978,44959.0), Rating(2346,44878.0), Rating(70,44785.0), Rating(468,44673.0), Rating(481,44240.0), Rating(2343,44186.0), Rating(53,44182.0), Rating(646,44182.0), Rating(1188,44151.0), Rating(81,43961.0), Rating(703,43860.0), Rating(470,43710.0), Rating(726,43644.0), Rating(497,43559.0), Rating(1643,43536.0), Rating(370,43458.0), Rating(3280,43440.0), Rating(1191,43303.0), Rating(951,43252.0), Rating(2083,42418.0), Rating(1109,42398.0), Rating(423,42305.0), Rating(238,42195.0), Rating(340,41552.0), Rating(3859,41447.0), Rating(197,41316.0), Rating(966,41094.0), Rating(1114,41028.0), Rating(908,40837.0), Rating(1372,40664.0), Rating(342,40510.0), Rating(1099,40473.0), Rating(815,40428.0), Rating(355,40288.0), Rating(440,40234.0), Rating(222,40203.0), Rating(97,40042.0), Rating(366,39882.0), Rating(1072,39658.0), Rating(622,39491.0), Rating(71,39172.0), Rating(311,39111.0), Rating(30,38949.0), Rating(3200,38781.0), Rating(534,38608.0), Rating(465,38190.0), Rating(1377,37800.0), Rating(3616,37663.0), Rating(717,37631.0), Rating(196,37602.0), Rating(428,37532.0), Rating(1613,37521.0), Rating(210,37500.0), Rating(1119,37375.0), Rating(453,37287.0), Rating(718,37209.0), Rating(1043,37136.0), Rating(206,36944.0), Rating(691,36911.0), Rating(851,36669.0), Rating(403,36370.0), Rating(705,35978.0), Rating(1424,35947.0), Rating(724,35924.0), Rating(506,35700.0), Rating(386,35347.0), Rating(1185,35258.0), Rating(1390,35056.0), Rating(192,34976.0), Rating(1772,34934.0), Rating(913,34746.0), Rating(1110,34641.0), Rating(1195,34637.0), Rating(1853,34500.0), Rating(603,34259.0), Rating(1504,34222.0), Rating(221,34165.0), Rating(526,34090.0), Rating(532,34002.0), Rating(1062,33892.0), Rating(857,33757.0), Rating(1803,33624.0), Rating(4566,33603.0), Rating(4821,33559.0), Rating(360,33380.0), Rating(327,33055.0), Rating(2940,33037.0), Rating(2624,32919.0), Rating(1206,32827.0), Rating(547,32234.0), Rating(1239,32136.0), Rating(162,32087.0), Rating(251,32020.0), Rating(1089,31803.0), Rating(1360,31791.0), Rating(1081,31704.0), Rating(2524,31516.0), Rating(1974,31514.0), Rating(3324,31421.0), Rating(952,31404.0), Rating(166,31367.0), Rating(2523,31309.0), Rating(374,31157.0), Rating(1097,31086.0), Rating(1179,31048.0), Rating(1378,30983.0), Rating(383,30891.0), Rating(223,30543.0), Rating(371,30524.0), Rating(790,30446.0), Rating(2840,30349.0), Rating(535,30324.0), Rating(1705,30297.0), Rating(2342,30205.0), Rating(957,30095.0), Rating(432,30083.0), Rating(77,29944.0), Rating(1032,29701.0), Rating(821,29399.0), Rating(1122,29138.0), Rating(472,29041.0), Rating(1983,28956.0), Rating(407,28823.0), Rating(45,28487.0), Rating(4262,28262.0), Rating(928,28262.0), Rating(1035,28256.0), Rating(604,28188.0), Rating(1520,28143.0), Rating(986,28125.0), Rating(1632,28106.0), Rating(3489,28064.0), Rating(859,28038.0), Rating(1512,28029.0), Rating(2661,27969.0), Rating(693,27941.0), Rating(1379,27871.0), Rating(962,27818.0), Rating(714,27680.0), Rating(330,27607.0), Rating(697,27597.0), Rating(1451,27570.0), Rating(716,27569.0), Rating(1041,27321.0), Rating(690,27304.0), Rating(489,27179.0), Rating(964,27175.0), Rating(3502,27153.0), Rating(993,27126.0), Rating(710,26972.0), Rating(305,26762.0), Rating(793,26747.0), Rating(317,26534.0), Rating(2619,26329.0), Rating(320,26323.0), Rating(545,26167.0), Rating(1181,26071.0), Rating(437,25748.0), Rating(2006,25719.0), Rating(812,25567.0), Rating(1414,25453.0), Rating(485,25380.0), Rating(3081,25153.0), Rating(389,25020.0), Rating(1274,24811.0), Rating(176,24799.0), Rating(501,24781.0), Rating(332,24585.0), Rating(1639,24567.0), Rating(1556,24546.0), Rating(632,24476.0), Rating(250,24400.0), Rating(362,24133.0), Rating(922,24109.0), Rating(184,23910.0), Rating(1106,23881.0), Rating(1083,23851.0), Rating(4675,23825.0), Rating(2977,23824.0), Rating(2379,23772.0), Rating(507,23743.0), Rating(267,23652.0), Rating(326,23611.0), Rating(2608,23611.0), Rating(1409,23497.0), Rating(734,23472.0), Rating(2544,23423.0), Rating(641,23397.0), Rating(954,23324.0), Rating(810,23282.0), Rating(2092,23242.0), Rating(1091,23154.0), Rating(733,23102.0), Rating(1976,23022.0), Rating(806,22873.0), Rating(953,22865.0), Rating(1755,22781.0), Rating(2139,22760.0), Rating(1515,22756.0), Rating(1242,22748.0), Rating(444,22661.0), Rating(546,22518.0), Rating(846,22441.0), Rating(863,22356.0), Rating(1151,22321.0), Rating(1886,22299.0), Rating(682,22138.0), Rating(998,22085.0), Rating(433,22039.0), Rating(2277,21972.0), Rating(1964,21943.0), Rating(3046,21772.0), Rating(3693,21504.0), Rating(1601,21462.0), Rating(3768,21368.0), Rating(605,21268.0), Rating(607,21204.0), Rating(208,21160.0), Rating(3227,21097.0), Rating(1428,21092.0), Rating(3411,21065.0), Rating(4177,21052.0), Rating(1042,20997.0), Rating(1394,20977.0), Rating(2834,20906.0), Rating(1866,20893.0), Rating(1447,20760.0), Rating(3110,20697.0), Rating(1376,20674.0), Rating(1398,20670.0), Rating(283,20662.0), Rating(3258,20635.0), Rating(601,20415.0), Rating(3109,20402.0), Rating(6217,20366.0), Rating(4814,20337.0), Rating(805,20294.0), Rating(4828,20292.0), Rating(3739,20257.0), Rating(308,20231.0), Rating(1061,20209.0), Rating(2347,20205.0), Rating(153,20203.0), Rating(107,20123.0), Rating(180,20046.0), Rating(215,19999.0), Rating(1510,19988.0), Rating(5079,19846.0), Rating(1410,19783.0), Rating(3397,19774.0), Rating(1416,19756.0), Rating(490,19736.0), Rating(755,19654.0), Rating(98,19647.0), Rating(728,19637.0), Rating(997,19603.0), Rating(492,19454.0), Rating(1100,19425.0), Rating(1810,19406.0), Rating(732,19335.0), Rating(915,19328.0), Rating(823,19260.0), Rating(69,19249.0), Rating(1383,19176.0), Rating(1013,19131.0), Rating(2080,19046.0), Rating(2014,18911.0), Rating(4079,18905.0), Rating(1009,18885.0), Rating(99,18793.0), Rating(52,18787.0), Rating(1873,18727.0), Rating(856,18719.0), Rating(858,18706.0), Rating(674,18586.0), Rating(2786,18578.0), Rating(936,18556.0), Rating(602,18451.0), Rating(479,18450.0), Rating(1415,18447.0), Rating(262,18404.0), Rating(2838,18315.0), Rating(193,18280.0), Rating(1200,18213.0), Rating(84,18136.0), Rating(985,18041.0), Rating(630,17930.0), Rating(2091,17926.0), Rating(1145,17862.0), Rating(231,17806.0), Rating(1121,17786.0), Rating(1411,17783.0), Rating(1455,17718.0), Rating(237,17674.0), Rating(1418,17626.0), Rating(2835,17594.0), Rating(171,17561.0), Rating(1130,17492.0), Rating(3201,17408.0), Rating(1633,17346.0), Rating(2446,17265.0), Rating(307,17200.0), Rating(405,17190.0), Rating(786,17142.0), Rating(1196,17125.0), Rating(1155,17092.0), Rating(629,17068.0), Rating(181,16937.0), Rating(450,16857.0), Rating(1073,16706.0), Rating(877,16681.0), Rating(1707,16658.0), Rating(2265,16625.0), Rating(356,16612.0), Rating(1182,16581.0), Rating(3940,16561.0), Rating(2959,16561.0), Rating(2637,16533.0), Rating(1216,16477.0), Rating(1203,16446.0), Rating(170,16324.0), Rating(241,16307.0), Rating(471,16155.0), Rating(5988,16116.0), Rating(955,16071.0), Rating(1350,16069.0), Rating(510,16056.0), Rating(867,16022.0), Rating(554,15976.0), Rating(1123,15947.0), Rating(924,15946.0), Rating(3317,15883.0), Rating(1545,15692.0), Rating(1475,15647.0), Rating(297,15634.0), Rating(420,15615.0), Rating(1449,15593.0), Rating(5000,15570.0), Rating(1988,15413.0), Rating(3488,15351.0), Rating(1300,15181.0), Rating(25,15166.0), Rating(186,15113.0), Rating(174,15030.0), Rating(531,15014.0), Rating(744,14995.0), Rating(1547,14991.0), Rating(3333,14885.0), Rating(7340,14846.0), Rating(3902,14810.0), Rating(2607,14790.0), Rating(2206,14748.0), Rating(1413,14723.0), Rating(2542,14721.0), Rating(8589,14703.0), Rating(949,14694.0), Rating(1183,14646.0), Rating(322,14610.0), Rating(2605,14601.0), Rating(874,14566.0), Rating(1678,14557.0), Rating(910,14550.0), Rating(2025,14536.0), Rating(2787,14523.0), Rating(309,14522.0), Rating(1019,14519.0), Rating(1052,14512.0), Rating(278,14504.0), Rating(1040,14434.0), Rating(484,14332.0), Rating(708,14327.0), Rating(1046,14313.0), Rating(1892,14308.0), Rating(1700,14307.0), Rating(1458,14305.0), Rating(6453,14303.0), Rating(3259,14268.0), Rating(644,14234.0), Rating(875,14220.0), Rating(1527,14163.0), Rating(211,14090.0), Rating(1366,14087.0), Rating(1201,13975.0), Rating(785,13906.0), Rating(880,13896.0), Rating(3373,13873.0), Rating(2176,13813.0), Rating(246,13716.0), Rating(1384,13693.0), Rating(1645,13661.0), Rating(712,13640.0), Rating(2018,13569.0), Rating(1167,13520.0), Rating(1038,13504.0), Rating(5149,13501.0), Rating(1193,13473.0), Rating(1580,13461.0), Rating(400,13418.0), Rating(1745,13415.0), Rating(1609,13407.0), Rating(1863,13405.0), Rating(1463,13401.0), Rating(3763,13340.0), Rating(722,13295.0), Rating(1453,13234.0), Rating(2632,13229.0), Rating(4118,13112.0), Rating(1519,13084.0), Rating(1943,13035.0), Rating(5150,12933.0), Rating(3097,12876.0), Rating(61,12833.0), Rating(202,12799.0), Rating(1401,12773.0), Rating(620,12753.0), Rating(459,12715.0), Rating(6120,12661.0), Rating(1874,12565.0), Rating(2498,12551.0), Rating(1852,12430.0), Rating(3404,12425.0), Rating(946,12345.0), Rating(2226,12314.0), Rating(4087,12255.0), Rating(426,12240.0), Rating(886,12176.0), Rating(96,12008.0), Rating(205,11918.0), Rating(245,11867.0), Rating(434,11859.0), Rating(261,11769.0), Rating(75,11768.0), Rating(1957,11732.0), Rating(189,11728.0), Rating(3427,11703.0), Rating(2815,11701.0), Rating(2407,11637.0), Rating(1452,11628.0), Rating(1714,11576.0), Rating(2220,11552.0), Rating(447,11518.0), Rating(1426,11484.0), Rating(1936,11450.0), Rating(2938,11388.0), Rating(1456,11374.0), Rating(1425,11295.0), Rating(5605,11286.0), Rating(3475,11249.0), Rating(3470,11214.0), Rating(3171,11203.0), Rating(1681,11191.0), Rating(616,11151.0), Rating(235,11149.0), Rating(845,11083.0), Rating(2582,11049.0), Rating(5416,10997.0), Rating(85,10997.0), Rating(2744,10906.0), Rating(2439,10902.0), Rating(1343,10875.0), Rating(3417,10861.0), Rating(2352,10754.0), Rating(1778,10741.0), Rating(2444,10670.0), Rating(1804,10651.0), Rating(2580,10646.0), Rating(1338,10643.0), Rating(1709,10641.0), Rating(2535,10640.0), Rating(947,10599.0), Rating(968,10577.0), Rating(4616,10506.0), Rating(2797,10505.0), Rating(609,10474.0), Rating(1990,10447.0), Rating(4115,10419.0), Rating(3078,10401.0), Rating(860,10379.0), Rating(2380,10373.0), Rating(3173,10309.0), Rating(5547,10282.0), Rating(1351,10162.0), Rating(1534,10122.0), Rating(3481,10098.0), Rating(2716,10078.0), Rating(1295,10043.0), Rating(2794,10040.0), Rating(1807,10035.0), Rating(3466,10032.0), Rating(4435,10007.0), Rating(1969,9997.0), Rating(2668,9898.0), Rating(3302,9848.0), Rating(515,9786.0), Rating(1192,9718.0), Rating(1824,9718.0), Rating(2137,9716.0), Rating(421,9706.0), Rating(4385,9669.0), Rating(4025,9611.0), Rating(1427,9582.0), Rating(585,9510.0), Rating(258,9498.0), Rating(1747,9433.0), Rating(1198,9381.0), Rating(2614,9380.0), Rating(4562,9357.0), Rating(1298,9348.0), Rating(1947,9255.0), Rating(2923,9223.0), Rating(530,9185.0), Rating(797,9083.0), Rating(3767,9069.0), Rating(1105,9037.0), Rating(1833,9034.0), Rating(610,9024.0), Rating(1814,8969.0), Rating(15,8963.0), Rating(1222,8953.0), Rating(621,8806.0), Rating(183,8747.0), Rating(4704,8736.0), Rating(253,8728.0), Rating(2990,8727.0), Rating(9,8716.0), Rating(445,8673.0), Rating(2370,8635.0), Rating(1241,8633.0), Rating(1787,8627.0), Rating(1795,8625.0), Rating(2855,8561.0), Rating(1766,8518.0), Rating(730,8451.0), Rating(3444,8427.0), Rating(612,8415.0), Rating(5926,8324.0), Rating(3400,8289.0), Rating(769,8236.0), Rating(1775,8161.0), Rating(2130,8147.0), Rating(1905,8140.0), Rating(2616,8011.0), Rating(1977,7994.0), Rating(2801,7958.0), Rating(539,7833.0), Rating(7324,7745.0), Rating(1150,7717.0), Rating(4316,7461.0), Rating(1364,7434.0), Rating(1511,7382.0), Rating(1958,7237.0), Rating(2901,7144.0), Rating(3332,7056.0), Rating(2121,6877.0), Rating(1596,6870.0), Rating(980,6833.0), Rating(4180,6709.0), Rating(2956,6600.0), Rating(2600,6563.0), Rating(3279,6517.0), Rating(575,6416.0), Rating(1790,6287.0), Rating(3186,6258.0), Rating(2821,6171.0), Rating(4251,5986.0), Rating(1914,5768.0), Rating(945,5527.0), Rating(1281,5526.0), Rating(1429,5442.0), Rating(977,5277.0), Rating(1079,4919.0), Rating(3741,4676.0), Rating(2783,4282.0), Rating(2595,3187.0), Rating(3462,2301.0))
>     test_users: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int]
>     prediction_scores: org.apache.spark.sql.DataFrame = [userID: int, recommendations: array<struct<artistID:int,rating:float>>]

  

For this model we get $\\overline{rank}\\approx 24.6 \\%$ which is
better than the random one but much worse than our ALS model that got
$\\overline{rank}\\approx 7.75 \\%$

In [None]:
println("Popular_model: mean_rank=" + eval_model(prediction_scores, training_set, test_set))

  

>     Popular_model: mean_rank=24.553421

  

Below we define one functions for presenting a users top artists based
on observations in the train set and recommended undiscovered artists
generated by our model.

In [None]:
import org.apache.spark.ml.recommendation.ALSModel

// Function for showing the favorit artists for a given user based on the training set.
// Input:
// - userID: Int, the id of the user.
// - n: Int, number of top artists that should be presented.
// - user_artist_data: DataFrame with observations.
// - artist_names: Dataframe mapping artistIDs to actual artist names
// Output:
// - DataFrame with the users top artists
def userHistory(userID: Int, n: Int, user_artist_data: DataFrame, artist_names: DataFrame): DataFrame = {
  
  // Filter the userID and sort the artists w.r.t the play count. Append the actual artist names. 
  val data = user_artist_data.filter($"userID"===userID).sort(desc("play_count")).join(artist_names, "artistID")
  data.select("userID","artistID","name").show(n)                            
  data.select("userID","artistID","name")
}

// Function for presenting recommended artist for a user.
// Input:
// - Model: ALSModel, the trained model
// - userID: DataFrame, with userID
// - n: Int, number of top artists that should be presented.
// - training_set: DataFrame used during the training.
// - artist_names: Dataframe mapping artistIDs to actual artist names
// Output:
// - DataFrame with the users recommended artists
def recommendToUser(model: ALSModel, userID: DataFrame, n: Int, training_set: DataFrame, artist_names: DataFrame) : DataFrame = {
  // Generate recommendations using the model.
  val recommendations = model.recommendForUserSubset(userID, n_artists_new.toInt).withColumn("recommendations", explode($"recommendations"))
                                      .select("userID","recommendations.artistID", "recommendations.rating").join(artist_names, "artistID").select("userID","artistID","name","rating")
  
  // Remove possible artists observed in the training set
  recommendations.join(training_set,training_set("userID")===recommendations("userID") && training_set("artistID")===recommendations("artistID"),"leftanti")
}



  

>     import org.apache.spark.ml.recommendation.ALSModel
>     userHistory: (userID: Int, n: Int, user_artist_data: org.apache.spark.sql.DataFrame, artist_names: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
>     recommendToUser: (model: org.apache.spark.ml.recommendation.ALSModel, userID: org.apache.spark.sql.DataFrame, n: Int, training_set: org.apache.spark.sql.DataFrame, artist_names: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame

  

Let's generate some recommenations for a user.

In [None]:
println("Listening history:")
// Print top 5 artists for userID 302
val sub_data = userHistory(302, 5, training_set, artist_names)

// Generate top 5 recommendations of undiscovered artists,
val recommendations = recommendToUser(model_final, sub_data, 5, training_set, artist_names)
println("Recommendations:")
recommendations.show(5)

  

>     Listening history:
>     +------+--------+--------------+
>     |userID|artistID|          name|
>     +------+--------+--------------+
>     |   302|      55| Kylie Minogue|
>     |   302|      89|     Lady Gaga|
>     |   302|     265|   Céline Dion|
>     |   302|     288|       Rihanna|
>     |   302|     299|Jennifer Lopez|
>     +------+--------+--------------+
>     only showing top 5 rows
>
>     Recommendations:
>     +------+--------+------------------+----------+
>     |userID|artistID|              name|    rating|
>     +------+--------+------------------+----------+
>     |   302|     289|    Britney Spears|0.88509434|
>     |   302|     292|Christina Aguilera| 0.8353728|
>     |   302|     300|        Katy Perry| 0.7862937|
>     |   302|      67|           Madonna| 0.7838166|
>     |   302|     295|           Beyoncé|0.76221865|
>     +------+--------+------------------+----------+
>     only showing top 5 rows
>
>     sub_data: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 1 more field]
>     recommendations: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 2 more fields]