Music Recommender System
========================

In general, [recommender
systems](https://en.wikipedia.org/wiki/Recommender_system) are
algorithms designed for suggesting relevant items/products to users. In
the last decades they have gained much interest because of the potential
of increasing the user experience at the same time as generating more
profit to companies. Nowadays, these systems can be found in several
different services like Netfilx, Amazon, YouTube and Spotify. As an
indicator of how valuable these algorithms are for such companies: back
in 2006 Netflix announced the open [Netflix Prize
Competition](https://www.netflixprize.com/) for the best algorithm to
predict users movie ratings based on collected data. The winning team
with the best algorithm improving the state-of-the-art performance with
at least 10% was promised an award of 1 000 000$. **In this notebook we
are going to develope a system for recommending musical artists to users
given their listening history**. We will implement a model related to
matrix factorization discussed in the preceeding notebook.

Problem Setting
---------------

We let $U$ be the set containing all $m$ users and let $I$ be the set
containing all $n$ items. Now, we introduce the matrix $R\\in
\\mathbb{R}^{m \\times n}$ with elements $r{*u}{*i}$ as a value encoding
a possible interaction between user $u\\in U$ and item $i \\in I$. This
matrix is often very sparse because of the huge number of possible
user-item interactions never observed. Depending on the type of
information encoded in the interaction matrix $R$ one usally refers to
either *explicit* or *implicit* data.

For explicit data, the $r{*u}{*i}$ contains information directly related
to user $u$'s preference for item $i$, e.g movie ratings. In the case of
implicit data, $r{*u}{*i}$ contains indirect information of a users
preference for an item by observing past user behavior. Examples could
be the number of times a user played a song or visited a webpage. Note
that in the implicit case we are lacking information about items that
the user dislikes because if a user of a music service has not played
any songs from a particular artist it could either mean that the user
simply doesn't like that artist or that the user hasn't encountered that
artist before but would potentially like it if the user had discovered
that artist.

Given the observations in the interaction matrix $R$, we would like our
model to suggest unseen items relevant to the users.

Collaborative Filtering
-----------------------

Broadly speaking, recommender algorithms can be divided into two
categories: [content
based](https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering)
and [collaborative
filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) (CF).
Here, we will just focus on collaborative filtering which is a technique
using patterns of user-item interactions and disregarding any additional
information about the user or item themselves. It is based on the
assumption that if a user similar to you likes an item, then there is a
high probability that you also like that particular item. In other
words, similar users have similar taste.

There are different approaches to CF, and we have chosen a laten factor
model approach inspired by low-rank SVD factorization of matrices. The
aim is to uncover latent features explaining the observed $r{*u}{*i}$
values. Each user $u$ is associated to a user-feature vector $x{*u}\\in
\\mathbb{R}^f$ and similarly each item $i$ is associated to a
item-feature vector $y{*i} \\in \\mathbb{R}^f$. Then we want the dot
products $x{*u}^Ty{*i}$ to explain the observed $r{*u}{*i}$ values. With
all user- and item-features at hand in the latent space $R^f$ we can
estimate a user $u$'s preference for an unseen item $j$ by simply
computing $x{*u}^Ty{*j}$.

We transorm the problem of finding the vectors $x{*u}, y{*i}$ into a
minimization problem as suggested in the paper [Collaborative Filtering
for Implicit Feedback
Datasets](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets).
First we introduce the binarized quantitiy $p{*u}{*i}$ defined by:

$$p*{ui}=\\begin{cases}1 \\text{ if } r*{ui}&gt;0, \\\\ 0 \\text{ if }
r*{ui}=0,\\end{cases}$$ encoding whether user $u$ has interacted with
and supposedly likes item $i$. However, our confidence that user $u$
likes item $i$ given that $p{*u}{*i}=1$ should vary with the actual
$r{*u}{*i}$ value. As an example, we would be more confident that a user
likes an artist he/she has listened to hundreds of times than an artist
played by the user only once. Therefore we introduce the confidence
$c{*u}{\_i}$:

$$c*{ui}=1+\\alpha r*{ui}$$,

where $\\alpha$ is a hyperparameter. From the above equation we can see
that the confidence for non observed user-item interaction defaults to
1. Now we formulize the minimization problem:

$$\\min*{X,Y}\\sum*{u\\in U,i \\in
I}c*{ui}(p*{ui}-x*u^Ty*i)+\\lambda(\\sum*{u\\in U}||x*u||^2+\\sum*{i\\in
I}||y*i||^2),$$ where $X,Y$ are matrices holding the $x*u,y*i$ as
columns respectively. Notice the regularization term in the objective
function.

Dataset
-------

For this application we use a
[dataset](https://grouplens.org/datasets/hetrec-2011/) containing
user-artist listening information from the online music service
[Last.fm](http://www.last.fm).

One of the available files contains triplets (`userID` `artistID`
`play_count`) describing the number of times a user has played an
artist. Another file contains tuples (`artistID` `name`) mapping the
artistID:s to actual artist names. There are a total of 92834 (`userID`
`artistID` `play_count`) triplets containing 1892 unique `userID`s and
17632 unique `artistID`s. Since the observations in the dataset do not
contain direct information about artist preferences, this is an implicit
dataset viewed from the description above. Based on this dataset we want
our model to give artist recommendations to the users.

In [None]:
import spark.implicits._
import org.apache.spark.sql.functions._

  

>     import spark.implicits._
>     import org.apache.spark.sql.functions._

  

**Lets load the data!**

In [None]:
// Load the (userID, artistID, play_count) triplets.
val fileName_data="dbfs:/FileStore/tables/project4/hetrec2011-lastfm-2k/user_artists.dat"
val df_raw = spark.read.format("csv").option("header", "true").option("delimiter", "\t").option("inferSchema","true").load(fileName_data).withColumnRenamed("weight","play_count")
df_raw.cache()
df_raw.orderBy(rand()).show(5)

// Load the (artistID, name) tuples.
val fileName_names="dbfs:/FileStore/tables/project4/hetrec2011-lastfm-2k/artists.dat"
val artist_names = spark.read.format("csv").option("header", "true").option("delimiter", "\t").option("inferSchema","true").load(fileName_names).withColumnRenamed("id","artistID").select("artistID","name")
artist_names.cache()
artist_names.show(5)

  

>     +------+--------+----------+
>     |userID|artistID|play_count|
>     +------+--------+----------+
>     |  1553|    2478|       171|
>     |    65|    1858|       254|
>     |   496|       9|       161|
>     |   304|     163|        19|
>     |   747|     507|       626|
>     +------+--------+----------+
>     only showing top 5 rows
>
>     +--------+-----------------+
>     |artistID|             name|
>     +--------+-----------------+
>     |       1|     MALICE MIZER|
>     |       2|  Diary of Dreams|
>     |       3|Carpathian Forest|
>     |       4|     Moi dix Mois|
>     |       5|      Bella Morte|
>     +--------+-----------------+
>     only showing top 5 rows
>
>     fileName_data: String = dbfs:/FileStore/tables/project4/hetrec2011-lastfm-2k/user_artists.dat
>     df_raw: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 1 more field]
>     fileName_names: String = dbfs:/FileStore/tables/project4/hetrec2011-lastfm-2k/artists.dat
>     artist_names: org.apache.spark.sql.DataFrame = [artistID: int, name: string]

  

We print some statistics and visualize the raw data.

In [None]:
val n_data = df_raw.count().asInstanceOf[Long].floatValue(); // Numer of observations
val n_users = df_raw.agg(countDistinct("userID")).collect()(0)(0).asInstanceOf[Long].floatValue(); // Number of unique users
val n_artists = df_raw.agg(countDistinct("artistID")).collect()(0)(0).asInstanceOf[Long].floatValue(); // Number of unique artists
val sparsity = 1-n_data/(n_users*n_artists) //Sparsity of the data

println("Number of data points: " + n_data)
println("Number of users: " + n_users)
println("Number of artists: " + n_artists)
print("Sparsity:" + sparsity.toString + "\n")


  

>     Number of data points: 92834.0
>     Number of users: 1892.0
>     Number of artists: 17632.0
>     Sparsity:0.9972172
>     n_data: Float = 92834.0
>     n_users: Float = 1892.0
>     n_artists: Float = 17632.0
>     sparsity: Float = 0.9972172

In [None]:
display(df_raw.select("play_count"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(df_raw.select("play_count").filter($"play_count"<1000))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(df_raw.select("play_count"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(df_raw.select("play_count").filter($"play_count"<1000))

  

[TABLE]

Truncated to 30 rows

  

We count the total plays and number of unique listeners for each artist.

In [None]:
val artist_data_raw = df_raw.groupBy("artistID").agg(count("artistID") as "unique_users",
                                                      sum("play_count") as "total_plays_artist")
artist_data_raw.sort(desc("total_plays_artist")).join(artist_names,"artistID").show(5)
artist_data_raw.sort(desc("unique_users")).join(artist_names,"artistID").show(5)

  

>     +--------+------------+------------------+------------------+
>     |artistID|unique_users|total_plays_artist|              name|
>     +--------+------------+------------------+------------------+
>     |     289|         522|           2393140|    Britney Spears|
>     |      72|         282|           1301308|      Depeche Mode|
>     |      89|         611|           1291387|         Lady Gaga|
>     |     292|         407|           1058405|Christina Aguilera|
>     |     498|         399|            963449|          Paramore|
>     +--------+------------+------------------+------------------+
>     only showing top 5 rows
>
>     +--------+------------+------------------+--------------+
>     |artistID|unique_users|total_plays_artist|          name|
>     +--------+------------+------------------+--------------+
>     |      89|         611|           1291387|     Lady Gaga|
>     |     289|         522|           2393140|Britney Spears|
>     |     288|         484|            905423|       Rihanna|
>     |     227|         480|            662116|   The Beatles|
>     |     300|         473|            532545|    Katy Perry|
>     +--------+------------+------------------+--------------+
>     only showing top 5 rows
>
>     artist_data_raw: org.apache.spark.sql.DataFrame = [artistID: int, unique_users: bigint ... 1 more field]

In [None]:
display(artist_data_raw.select("total_plays_artist"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(artist_data_raw.select("total_plays_artist").filter($"total_plays_artist"<10000))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(artist_data_raw.select("unique_users"))

  

[TABLE]

Truncated to 30 rows

  

We count the total plays and the number of unique artists each users has
listened to.

In [None]:
val user_data_raw = df_raw.groupBy("userID").agg(count("userID") as "unique_artists",
                                                  sum("play_count") as "total_plays_user")
user_data_raw.sort(desc("total_plays_user")).show(5)

  

>     +------+--------------+----------------+
>     |userID|unique_artists|total_plays_user|
>     +------+--------------+----------------+
>     |   757|            50|          480039|
>     |  2000|            50|          468409|
>     |  1418|            50|          416349|
>     |  1642|            50|          388251|
>     |  1094|            50|          379125|
>     +------+--------------+----------------+
>     only showing top 5 rows
>
>     user_data_raw: org.apache.spark.sql.DataFrame = [userID: int, unique_artists: bigint ... 1 more field]

In [None]:
display(user_data_raw.select("total_plays_user"))

  

[TABLE]

Truncated to 30 rows

  

Now we join all statistics into a single dataframe.

In [None]:
val df_joined = df_raw.join(artist_data_raw, "artistID").join(user_data_raw, "userID").join(artist_names,"artistID").select("userID", "artistID","play_count", "name", "unique_artists","unique_users", "total_plays_user","total_plays_artist")
df_joined.show(5)

  

>     +------+--------+----------+-------------+--------------+------------+----------------+------------------+
>     |userID|artistID|play_count|         name|unique_artists|unique_users|total_plays_user|total_plays_artist|
>     +------+--------+----------+-------------+--------------+------------+----------------+------------------+
>     |     2|      51|     13883|  Duran Duran|            50|         111|          168737|            348919|
>     |     2|      52|     11690|    Morcheeba|            50|          23|          168737|             18787|
>     |     2|      53|     11351|          Air|            50|          75|          168737|             44230|
>     |     2|      54|     10300| Hooverphonic|            50|          18|          168737|             15927|
>     |     2|      55|      8983|Kylie Minogue|            50|         298|          168737|            449292|
>     +------+--------+----------+-------------+--------------+------------+----------------+------------------+
>     only showing top 5 rows
>
>     df_joined: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 6 more fields]

  

Collaborative filtering models suffer from the [cold-start
problem](https://yuspify.com/blog/cold-start-problem-recommender-systems/),
meaning they have difficulties in making inference of new users or
items. Therefore we will filter out artists with fewer than 20 unique
listeners and users that have listened to less than 5 artists.

In [None]:
val df_filtered_1 = df_joined.filter($"unique_users">=20).select(df_joined("userID"),df_joined("artistID"),df_joined("play_count"))
val artist_data_1 = df_filtered_1.groupBy("artistID").agg(count("artistID") as "unique_users",
                                                          sum("play_count") as "total_plays_artist")
                                                     .withColumnRenamed("artistID","artistID_1")

val user_data_1 = df_filtered_1.groupBy("userID").agg(count("userID") as "unique_artists",
                                                          sum("play_count") as "total_plays_user")
                                                     .withColumnRenamed("userID","userID_1")

val df_joined_filtered_1 = df_filtered_1.join(artist_data_1, artist_data_1("artistID_1")===df_filtered_1("artistID"))
                                        .join(user_data_1, user_data_1("userID_1")===df_filtered_1("userID"))
                                        .select(df_filtered_1("userID"),df_filtered_1("artistID"),df_filtered_1("play_count"), 
                                                 artist_data_1("unique_users"),artist_data_1("total_plays_artist"),
                                                 user_data_1("unique_artists"), user_data_1("total_plays_user"))

val df_filtered_2 = df_joined_filtered_1.filter($"unique_artists">=5).select(df_filtered_1("userID"),df_filtered_1("artistID"),
                                                                             df_filtered_1("play_count"))

val artist_data = df_filtered_2.groupBy("artistID").agg(count("artistID") as "unique_users",
                                                        sum("play_count") as "total_plays_artist")
                                                   .withColumnRenamed("artistID","artistID_2")

val user_data = df_filtered_2.groupBy("userID").agg(count("userID") as "unique_artists",
                                                         sum("play_count") as "total_plays_user")
                                                   .withColumnRenamed("userID","userID_2")

// Now we collect our new filtered data.
val user_artist_data = df_filtered_2.join(artist_data, artist_data("artistID_2")===df_filtered_2("artistID"))
                                    .join(user_data, user_data("userID_2")===df_filtered_2("userID"))
                                    .select("userID","artistID","play_count","unique_users","total_plays_artist","unique_artists","total_plays_user")

user_artist_data.show(5)

  

>     +------+--------+----------+------------+------------------+--------------+----------------+
>     |userID|artistID|play_count|unique_users|total_plays_artist|unique_artists|total_plays_user|
>     +------+--------+----------+------------+------------------+--------------+----------------+
>     |   148|    1118|       214|          62|             53915|            12|            3026|
>     |   148|    1206|       245|          50|             32827|            12|            3026|
>     |   148|     206|       214|          83|             36944|            12|            3026|
>     |   148|     233|       170|         138|            160317|            12|            3026|
>     |   148|     429|       430|         162|             91740|            12|            3026|
>     +------+--------+----------+------------+------------------+--------------+----------------+
>     only showing top 5 rows
>
>     df_filtered_1: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 1 more field]
>     artist_data_1: org.apache.spark.sql.DataFrame = [artistID_1: int, unique_users: bigint ... 1 more field]
>     user_data_1: org.apache.spark.sql.DataFrame = [userID_1: int, unique_artists: bigint ... 1 more field]
>     df_joined_filtered_1: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 5 more fields]
>     df_filtered_2: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 1 more field]
>     artist_data: org.apache.spark.sql.DataFrame = [artistID_2: int, unique_users: bigint ... 1 more field]
>     user_data: org.apache.spark.sql.DataFrame = [userID_2: int, unique_artists: bigint ... 1 more field]
>     user_artist_data: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 5 more fields]

  

Below we can see that we have reduced the amount of data. The number of
users are quite similar as before but the number of artists is
significantly reduced.

In [None]:
val n_data_new = user_artist_data.count().asInstanceOf[Long].floatValue();
val n_users_new = user_artist_data.agg(countDistinct("userID")).collect()(0)(0).asInstanceOf[Long].floatValue();
val n_artists_new = user_artist_data.agg(countDistinct("artistID")).collect()(0)(0).asInstanceOf[Long].floatValue();
val sparsity_new = 1-n_data/(n_users*n_artists)

println("Number of data points: " + n_data_new)
println("Number of users: " + n_users_new)
println("Number of artists: " + n_artists_new)
print("Sparsity:" + sparsity.toString + "\n")


  

>     Number of data points: 53114.0
>     Number of users: 1819.0
>     Number of artists: 804.0
>     Sparsity:0.9972172
>     n_data_new: Float = 53114.0
>     n_users_new: Float = 1819.0
>     n_artists_new: Float = 804.0
>     sparsity_new: Float = 0.9972172

In [None]:
display(user_artist_data.select("play_count"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(artist_data.select("total_plays_artist"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(artist_data.select("unique_users"))

  

[TABLE]

Truncated to 30 rows

  

The total number of plays are correlated to the number of unique
listeners (as expected) as illustrated in the figure below.

In [None]:
display(artist_data)

  

[TABLE]

Truncated to 30 rows

  

In the
[paper](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets)
mentioned above, the authors suggest scaling the $r{*u}{*i}$ if the
values tends to vary over large range as it is in our case. They
presented a log scaling scheme but after testing different approaches we
found that scaling by taking the square root of the observed play counts
(thus reducing the range) worked best.

In [None]:
//Scaling the play_counts
val user_artist_data_scaled = user_artist_data
.withColumn("scaled_value", sqrt(col("play_count"))).drop("play_count").withColumnRenamed("scaled_value","play_count")
user_artist_data_scaled.show(5)

  

>     +------+--------+------------+------------------+--------------+----------------+------------------+
>     |userID|artistID|unique_users|total_plays_artist|unique_artists|total_plays_user|        play_count|
>     +------+--------+------------+------------------+--------------+----------------+------------------+
>     |   148|     436|         124|             88270|            12|            3026| 18.33030277982336|
>     |   148|    1206|          50|             32827|            12|            3026|15.652475842498529|
>     |   148|     512|          67|             62933|            12|            3026|15.620499351813308|
>     |   148|     429|         162|             91740|            12|            3026| 20.73644135332772|
>     |   148|    1943|          25|             13035|            12|            3026| 17.26267650163207|
>     +------+--------+------------+------------------+--------------+----------------+------------------+
>     only showing top 5 rows
>
>     user_artist_data_scaled: org.apache.spark.sql.DataFrame = [userID: int, artistID: int ... 5 more fields]

  

Display the scaled data

In [None]:
display(user_artist_data_scaled.select("play_count"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(user_artist_data_scaled.select("play_count"))

  

[TABLE]

Truncated to 30 rows

  

We split our scaled dataset into training, validation and test sets.

In [None]:
val Array(training_set, validation_set, test_set) = user_artist_data_scaled.select("userID","artistID","play_count").randomSplit(Array(0.6, 0.2, 0.2))
training_set.cache()
validation_set.cache()
test_set.cache()

  

>     training_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int, artistID: int ... 1 more field]
>     validation_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int, artistID: int ... 1 more field]
>     test_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int, artistID: int ... 1 more field]
>     res109: test_set.type = [userID: int, artistID: int ... 1 more field]

  

Alternating Least Squares
-------------------------

By looking at the minimization problem again, we see that if one of $X$
and $Y$ is fixed, the cost function is just quadratic and hence the
minimum can be computed easily. Thus, we can alternate between
re-computing the user and item features and it turns out that the over
all const function is guaranteed to decrease in each iteration. This
procedure is called [Alternating Least
Squares](https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/)
and is implemented in Spark. $$\\min*{X,Y}\\sum*{u\\in U,i \\in
I}c*{ui}(p*{ui}-x*u^Ty*i)+\\lambda(\\sum*{u\\in U}||x*u||^2+\\sum*{i\\in
I}||y*i||^2),$$

The solution to the respective quadratic problems are:

$$x*u=(Y^TC^uY+\\lambda Id )^{-1}Y^TC^up(u) \\quad \\forall u\\in U,$$
$$y*i=(X^TC^iX+\\lambda Id )^{-1}X^TC^ip(i) \\quad \\forall i\\in I,$$

where $C^u, C^i$ are a diagonal matrices with diagonal entries
$c{*u}{*i}$ $i \\in I$ and $c{*u}{*i}$ $u \\in U$ respectively. The
$p(u)$ and $p(i)$ are vectors containing all binarized user and item
preferences for user $u$ and item $i$ respectively. The computational
bottlneck is to compute the $Y^TC^uY$ (require time $O(f^2n)$ for each
user). However, we can rewrite the product as $Y^TC^uY=Y^TY+Y^T(C^u-I)Y$
and now we see that the term $Y^TY$ does not depend on $u$ and that
$(C^u-I)$ will only have a number of non-zero entries equal to the
number of items user $u$ has interacted with (which is usually much
smaller than the total number of items). Hence, that representation is
much more beneficial computationally. A similar approach can be applied
to $X^TC^iX$.

Evaluation
----------

One approach to measure the performance of the model would be to measure
the RMSE:

$$\\sqrt{\\frac{1}{\\\#\\text{observations}}\\sum*{u,
i}(p*{ui}^t-x*u^Ty*i)^2},$$ where $p*{ui}^t$ is the binarized
preferences in the test set. However, this metric is not very suitable
for this particular application since for the zero entries of
$p^t{*u}{\_i}$ we don't know if the user dislikes the item or just
hasn't discovered it. In the
[paper](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets)
they suggest the mean rank metric:

$$\\overline{\\text{rank}}=\\frac{\\sum*{u, i}}{\\sum*{u, i}
r^t\_{ui}}$$

In [None]:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Dataset 
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame

def eval_model(predictions_scores_new: DataFrame, training_set: DataFrame, validation_set: DataFrame) : Float = {
  
  val predictions_scores = predictions_scores_new.withColumnRenamed("userID","userID_new")
  val recommendations = predictions_scores.withColumn("recommendations", explode($"recommendations"))
                                      .select("userID_new","recommendations.artistID", "recommendations.rating")
  
  val recommendations_filtered = recommendations.join(training_set, training_set("userID")===recommendations("userID_new") && training_set("artistID")===recommendations("artistID"), "leftanti")
  

  val recommendations_percentiles = recommendations_filtered.withColumn("rank",percent_rank()
                                                            .over(Window.partitionBy("userID_new").orderBy(desc("rating"))))
  
  val table_data = recommendations_percentiles.join(validation_set, recommendations_percentiles("userID_new")===validation_set("userID") && recommendations_percentiles("artistID")===validation_set("artistID"))
  
  val numerator = table_data.withColumn("ru1rankui", $"rank"*$"play_count"*100.0)
                            .agg(sum("ru1rankui"))
                            .collect()(0)(0).asInstanceOf[Double]
  
  val denumerator = table_data.agg(sum("play_count"))
                              .collect()(0)(0)
                              .asInstanceOf[Double]

  val rank_score = numerator/denumerator
  rank_score.toFloat
}

  

>     import org.apache.spark.sql.expressions.Window
>     import org.apache.spark.sql.Dataset
>     import org.apache.spark.sql.Row
>     import org.apache.spark.sql.DataFrame
>     eval_model: (predictions_scores_new: org.apache.spark.sql.DataFrame, training_set: org.apache.spark.sql.DataFrame, validation_set: org.apache.spark.sql.DataFrame)Float

In [None]:
import org.apache.spark.ml.recommendation.ALS
val numIter = 10

 val ranks = List(10,50,100,150)
 val lambdas=List(0.1, 1.0, 2.0)
 val alphas=List(0.5, 1.0, 5.0)


for ( alpha <- alphas ){
  for ( lambda <- lambdas ){
    for ( rank <- ranks ){
      val als = new ALS()
        .setRank(rank)
        .setMaxIter(numIter)
        .setRegParam(lambda)
        .setUserCol("userID")
        .setItemCol("artistID")
        .setRatingCol("play_count")
        .setImplicitPrefs(true)
        .setAlpha(alpha)
        .setNonnegative(true)
      val model = als.fit(training_set)
      model.setColdStartStrategy("drop")
           .setUserCol("userID")
           .setItemCol("artistID")
      val predictions_scores = model.recommendForUserSubset(validation_set,n_artists_new.toInt)
      println("rank=" + rank + ", alpha=" + alpha + ", lambda=" + lambda + ", mean_rank=" + eval_model(predictions_scores, training_set, validation_set))
    }
  }
}


  

>     rank=10, alpha=0.5, lambda=0.1, mean_rank=10.410393
>     rank=50, alpha=0.5, lambda=0.1, mean_rank=10.524447
>     rank=100, alpha=0.5, lambda=0.1, mean_rank=11.609247
>     rank=150, alpha=0.5, lambda=0.1, mean_rank=13.049584
>     rank=10, alpha=0.5, lambda=1.0, mean_rank=9.834879
>     rank=50, alpha=0.5, lambda=1.0, mean_rank=8.388225
>     rank=100, alpha=0.5, lambda=1.0, mean_rank=8.468931
>     rank=150, alpha=0.5, lambda=1.0, mean_rank=8.435649
>     rank=10, alpha=0.5, lambda=2.0, mean_rank=9.819813
>     rank=50, alpha=0.5, lambda=2.0, mean_rank=8.098052
>     rank=100, alpha=0.5, lambda=2.0, mean_rank=7.8016405
>     rank=150, alpha=0.5, lambda=2.0, mean_rank=7.7979865
>     rank=10, alpha=1.0, lambda=0.1, mean_rank=10.217629
>     rank=50, alpha=1.0, lambda=0.1, mean_rank=10.891969
>     rank=100, alpha=1.0, lambda=0.1, mean_rank=12.113014
>     rank=150, alpha=1.0, lambda=0.1, mean_rank=12.9173355
>     rank=10, alpha=1.0, lambda=1.0, mean_rank=9.814854
>     rank=50, alpha=1.0, lambda=1.0, mean_rank=8.9062605
>     rank=100, alpha=1.0, lambda=1.0, mean_rank=8.994811
>     rank=150, alpha=1.0, lambda=1.0, mean_rank=9.323116
>     rank=10, alpha=1.0, lambda=2.0, mean_rank=9.574796
>     rank=50, alpha=1.0, lambda=2.0, mean_rank=8.102262
>     rank=100, alpha=1.0, lambda=2.0, mean_rank=7.919162
>     rank=150, alpha=1.0, lambda=2.0, mean_rank=7.806969
>     rank=10, alpha=5.0, lambda=0.1, mean_rank=11.04583
>     rank=50, alpha=5.0, lambda=0.1, mean_rank=12.466616
>     rank=100, alpha=5.0, lambda=0.1, mean_rank=13.031898
>     rank=150, alpha=5.0, lambda=0.1, mean_rank=13.128654
>     rank=10, alpha=5.0, lambda=1.0, mean_rank=10.496872
>     rank=50, alpha=5.0, lambda=1.0, mean_rank=10.822062
>     rank=100, alpha=5.0, lambda=1.0, mean_rank=10.659593
>     rank=150, alpha=5.0, lambda=1.0, mean_rank=10.679214
>     rank=10, alpha=5.0, lambda=2.0, mean_rank=10.021061
>     rank=50, alpha=5.0, lambda=2.0, mean_rank=9.592773
>     rank=100, alpha=5.0, lambda=2.0, mean_rank=9.313808
>     rank=150, alpha=5.0, lambda=2.0, mean_rank=9.194027
>     import org.apache.spark.ml.recommendation.ALS
>     numIter: Int = 10
>     ranks: List[Int] = List(10, 50, 100, 150)
>     lambdas: List[Double] = List(0.1, 1.0, 2.0)
>     alphas: List[Double] = List(0.5, 1.0, 5.0)

In [None]:
val numIter_final=10
val rank_final=150
val alpha_final=0.5
val lambda_final=2.0
val als_final = new ALS()
        .setRank(rank_final)
        .setMaxIter(numIter_final)
        .setRegParam(lambda_final)
        .setUserCol("userID")
        .setItemCol("artistID")
        .setRatingCol("play_count")
        .setImplicitPrefs(true)
        .setAlpha(alpha_final)
        .setNonnegative(true)
val model_final = als_final.fit(training_set)
model_final.setColdStartStrategy("drop")
     .setUserCol("userID")
     .setItemCol("artistID")
val predictions_scores_val = model_final.recommendForUserSubset(validation_set,n_artists_new.toInt)
println("Validation set: mean_rank=" + eval_model(predictions_scores_val, training_set, validation_set))

val predictions_scores_test = model_final.recommendForUserSubset(test_set,n_artists_new.toInt)
println("Test set: mean_rank=" + eval_model(predictions_scores_val, training_set, test_set))

  

>     Validation set: mean_rank=7.7979865
>     Test set: mean_rank=7.75016
>     numIter_final: Int = 10
>     rank_final: Int = 150
>     alpha_final: Double = 0.5
>     lambda_final: Double = 2.0
>     als_final: org.apache.spark.ml.recommendation.ALS = als_f778f6cc23db
>     model_final: org.apache.spark.ml.recommendation.ALSModel = als_f778f6cc23db
>     predictions_scores_val: org.apache.spark.sql.DataFrame = [userID: int, recommendations: array<struct<artistID:int,rating:float>>]
>     predictions_scores_test: org.apache.spark.sql.DataFrame = [userID: int, recommendations: array<struct<artistID:int,rating:float>>]

In [None]:
case class Rating(artistID: Int, rating: Float)
val most_popular = artist_data.select("artistID_2", "unique_users").sort(desc("unique_users"))
                              .collect.map(row =>Rating(row.getInt(0),row.getLong(1).toFloat))
val validation_users = validation_set.select("userID").distinct()
val prediction_scores = user_artist_data.select("userID").distinct().withColumn("recommendations",typedLit(most_popular))
                                        .join(validation_users,"userID")

  

>     defined class Rating
>     most_popular: Array[Rating] = Array(Rating(89,610.0), Rating(289,521.0), Rating(288,483.0), Rating(227,480.0), Rating(300,472.0), Rating(67,428.0), Rating(333,417.0), Rating(292,407.0), Rating(190,400.0), Rating(498,399.0), Rating(295,395.0), Rating(154,392.0), Rating(65,369.0), Rating(466,362.0), Rating(701,319.0), Rating(302,305.0), Rating(306,304.0), Rating(229,304.0), Rating(55,298.0), Rating(461,286.0), Rating(72,280.0), Rating(377,280.0), Rating(157,265.0), Rating(291,260.0), Rating(234,258.0), Rating(163,258.0), Rating(679,248.0), Rating(207,248.0), Rating(344,246.0), Rating(298,243.0), Rating(173,242.0), Rating(257,241.0), Rating(230,236.0), Rating(228,235.0), Rating(349,233.0), Rating(159,229.0), Rating(378,225.0), Rating(707,224.0), Rating(220,220.0), Rating(486,216.0), Rating(533,210.0), Rating(299,209.0), Rating(959,207.0), Rating(475,203.0), Rating(599,198.0), Rating(1412,196.0), Rating(903,195.0), Rating(511,185.0), Rating(424,184.0), Rating(325,182.0), Rating(198,180.0), Rating(706,179.0), Rating(917,177.0), Rating(1090,173.0), Rating(1098,172.0), Rating(318,169.0), Rating(982,164.0), Rating(429,162.0), Rating(1249,161.0), Rating(310,161.0), Rating(294,160.0), Rating(689,159.0), Rating(209,157.0), Rating(328,155.0), Rating(544,152.0), Rating(418,152.0), Rating(464,152.0), Rating(56,150.0), Rating(412,149.0), Rating(301,148.0), Rating(525,145.0), Rating(686,143.0), Rating(1400,143.0), Rating(441,141.0), Rating(329,140.0), Rating(538,140.0), Rating(1104,140.0), Rating(868,139.0), Rating(233,138.0), Rating(1246,135.0), Rating(217,134.0), Rating(331,134.0), Rating(7,133.0), Rating(316,131.0), Rating(735,131.0), Rating(540,130.0), Rating(1048,130.0), Rating(293,128.0), Rating(352,128.0), Rating(182,128.0), Rating(425,128.0), Rating(199,128.0), Rating(681,127.0), Rating(1934,126.0), Rating(1369,126.0), Rating(680,126.0), Rating(225,125.0), Rating(523,125.0), Rating(614,124.0), Rating(436,124.0), Rating(88,124.0), Rating(704,123.0), Rating(161,121.0), Rating(528,118.0), Rating(1672,118.0), Rating(1375,115.0), Rating(226,115.0), Rating(203,114.0), Rating(966,112.0), Rating(1044,112.0), Rating(238,111.0), Rating(757,110.0), Rating(51,109.0), Rating(212,108.0), Rating(972,108.0), Rating(59,107.0), Rating(81,107.0), Rating(709,107.0), Rating(324,107.0), Rating(314,106.0), Rating(1513,104.0), Rating(543,104.0), Rating(907,104.0), Rating(918,103.0), Rating(930,103.0), Rating(458,100.0), Rating(518,98.0), Rating(187,96.0), Rating(715,96.0), Rating(320,96.0), Rating(517,96.0), Rating(455,95.0), Rating(1116,95.0), Rating(439,95.0), Rating(1244,93.0), Rating(1001,93.0), Rating(857,92.0), Rating(3057,92.0), Rating(1243,91.0), Rating(285,91.0), Rating(468,91.0), Rating(2346,91.0), Rating(536,91.0), Rating(562,91.0), Rating(969,89.0), Rating(290,89.0), Rating(526,89.0), Rating(488,89.0), Rating(691,89.0), Rating(1239,88.0), Rating(542,87.0), Rating(1037,87.0), Rating(481,86.0), Rating(1358,86.0), Rating(210,85.0), Rating(2521,85.0), Rating(1406,85.0), Rating(70,84.0), Rating(951,84.0), Rating(206,83.0), Rating(913,83.0), Rating(1613,83.0), Rating(1186,83.0), Rating(172,83.0), Rating(3200,82.0), Rating(703,82.0), Rating(265,82.0), Rating(321,82.0), Rating(500,82.0), Rating(191,82.0), Rating(1043,82.0), Rating(58,82.0), Rating(296,81.0), Rating(386,81.0), Rating(311,81.0), Rating(808,81.0), Rating(2343,81.0), Rating(854,80.0), Rating(97,80.0), Rating(1131,80.0), Rating(546,80.0), Rating(430,80.0), Rating(2083,80.0), Rating(618,79.0), Rating(327,79.0), Rating(683,79.0), Rating(841,79.0), Rating(1390,78.0), Rating(497,78.0), Rating(1377,78.0), Rating(859,77.0), Rating(155,77.0), Rating(548,77.0), Rating(432,77.0), Rating(687,77.0), Rating(816,77.0), Rating(176,76.0), Rating(1459,76.0), Rating(1372,76.0), Rating(908,76.0), Rating(920,76.0), Rating(1099,76.0), Rating(403,75.0), Rating(615,75.0), Rating(317,75.0), Rating(716,75.0), Rating(196,74.0), Rating(53,74.0), Rating(366,73.0), Rating(646,73.0), Rating(504,73.0), Rating(537,73.0), Rating(718,73.0), Rating(843,72.0), Rating(1042,72.0), Rating(1464,72.0), Rating(779,72.0), Rating(152,71.0), Rating(1360,71.0), Rating(64,71.0), Rating(1032,70.0), Rating(1379,70.0), Rating(166,70.0), Rating(1639,70.0), Rating(499,70.0), Rating(1047,69.0), Rating(1470,69.0), Rating(1180,69.0), Rating(1106,69.0), Rating(748,69.0), Rating(531,69.0), Rating(221,69.0), Rating(1854,69.0), Rating(342,69.0), Rating(1034,68.0), Rating(1974,68.0), Rating(532,68.0), Rating(188,68.0), Rating(503,67.0), Rating(802,67.0), Rating(1109,67.0), Rating(906,67.0), Rating(512,67.0), Rating(863,67.0), Rating(724,66.0), Rating(1075,66.0), Rating(535,66.0), Rating(790,65.0), Rating(733,64.0), Rating(465,64.0), Rating(1415,64.0), Rating(603,64.0), Rating(1045,63.0), Rating(961,63.0), Rating(2094,63.0), Rating(1118,62.0), Rating(690,62.0), Rating(332,62.0), Rating(1455,61.0), Rating(2277,61.0), Rating(714,61.0), Rating(184,61.0), Rating(2523,60.0), Rating(326,60.0), Rating(2531,60.0), Rating(355,60.0), Rating(323,60.0), Rating(978,60.0), Rating(222,59.0), Rating(1504,59.0), Rating(622,59.0), Rating(1122,58.0), Rating(428,58.0), Rating(889,58.0), Rating(534,58.0), Rating(1713,58.0), Rating(330,57.0), Rating(423,57.0), Rating(2020,57.0), Rating(1424,57.0), Rating(71,57.0), Rating(632,57.0), Rating(484,56.0), Rating(1062,56.0), Rating(697,56.0), Rating(726,56.0), Rating(851,56.0), Rating(416,55.0), Rating(440,55.0), Rating(340,55.0), Rating(472,55.0), Rating(1772,54.0), Rating(1632,54.0), Rating(456,54.0), Rating(1242,54.0), Rating(2347,54.0), Rating(923,53.0), Rating(554,53.0), Rating(728,53.0), Rating(204,52.0), Rating(898,52.0), Rating(193,52.0), Rating(1398,52.0), Rating(2014,52.0), Rating(489,52.0), Rating(1119,52.0), Rating(1414,52.0), Rating(797,51.0), Rating(732,51.0), Rating(1515,51.0), Rating(1803,51.0), Rating(251,51.0), Rating(1512,51.0), Rating(304,51.0), Rating(1376,51.0), Rating(964,51.0), Rating(1206,50.0), Rating(1151,50.0), Rating(604,50.0), Rating(197,50.0), Rating(1014,50.0), Rating(250,49.0), Rating(793,49.0), Rating(180,49.0), Rating(830,49.0), Rating(3411,49.0), Rating(1447,49.0), Rating(99,49.0), Rating(1378,48.0), Rating(1091,48.0), Rating(1195,48.0), Rating(389,48.0), Rating(162,48.0), Rating(506,48.0), Rating(167,47.0), Rating(493,46.0), Rating(360,46.0), Rating(470,46.0), Rating(1684,46.0), Rating(1418,46.0), Rating(1866,46.0), Rating(1873,46.0), Rating(1145,46.0), Rating(734,46.0), Rating(2176,46.0), Rating(957,46.0), Rating(1179,46.0), Rating(815,46.0), Rating(858,45.0), Rating(1073,45.0), Rating(2080,45.0), Rating(527,45.0), Rating(985,45.0), Rating(1155,45.0), Rating(383,45.0), Rating(485,45.0), Rating(371,45.0), Rating(1097,45.0), Rating(986,44.0), Rating(450,44.0), Rating(1110,44.0), Rating(2605,44.0), Rating(547,44.0), Rating(1580,43.0), Rating(1274,43.0), Rating(308,43.0), Rating(1541,43.0), Rating(1673,43.0), Rating(524,43.0), Rating(77,43.0), Rating(1705,43.0), Rating(997,43.0), Rating(1983,43.0), Rating(1456,43.0), Rating(2850,43.0), Rating(1083,43.0), Rating(195,43.0), Rating(1072,42.0), Rating(813,42.0), Rating(2787,42.0), Rating(237,42.0), Rating(1185,42.0), Rating(712,42.0), Rating(915,42.0), Rating(471,42.0), Rating(1426,42.0), Rating(223,42.0), Rating(3859,42.0), Rating(1410,42.0), Rating(96,42.0), Rating(3201,42.0), Rating(437,42.0), Rating(1451,42.0), Rating(1409,42.0), Rating(2542,41.0), Rating(2786,41.0), Rating(1394,41.0), Rating(693,41.0), Rating(84,41.0), Rating(63,41.0), Rating(768,40.0), Rating(998,40.0), Rating(928,40.0), Rating(1556,40.0), Rating(545,40.0), Rating(607,40.0), Rating(1609,40.0), Rating(630,40.0), Rating(2265,40.0), Rating(208,40.0), Rating(1384,39.0), Rating(1810,39.0), Rating(215,39.0), Rating(1458,39.0), Rating(1052,39.0), Rating(953,39.0), Rating(405,39.0), Rating(1089,38.0), Rating(283,38.0), Rating(3280,38.0), Rating(2352,38.0), Rating(192,38.0), Rating(641,38.0), Rating(4814,38.0), Rating(952,38.0), Rating(1383,38.0), Rating(487,38.0), Rating(1976,38.0), Rating(420,38.0), Rating(375,37.0), Rating(1633,37.0), Rating(279,37.0), Rating(1453,37.0), Rating(962,37.0), Rating(2661,37.0), Rating(949,37.0), Rating(1413,37.0), Rating(1339,37.0), Rating(2342,37.0), Rating(602,37.0), Rating(1452,37.0), Rating(374,37.0), Rating(45,37.0), Rating(370,37.0), Rating(362,36.0), Rating(444,36.0), Rating(4262,36.0), Rating(1886,36.0), Rating(955,36.0), Rating(1444,36.0), Rating(2608,36.0), Rating(954,36.0), Rating(2840,35.0), Rating(993,35.0), Rating(629,35.0), Rating(170,35.0), Rating(1416,35.0), Rating(501,35.0), Rating(3258,35.0), Rating(2582,35.0), Rating(453,35.0), Rating(856,35.0), Rating(433,35.0), Rating(2977,35.0), Rating(2137,35.0), Rating(3110,35.0), Rating(1892,34.0), Rating(3373,34.0), Rating(171,34.0), Rating(3317,34.0), Rating(1188,34.0), Rating(2179,34.0), Rating(507,34.0), Rating(2834,34.0), Rating(1009,34.0), Rating(1046,33.0), Rating(2544,33.0), Rating(1449,33.0), Rating(153,33.0), Rating(705,33.0), Rating(407,33.0), Rating(7324,33.0), Rating(183,33.0), Rating(1681,33.0), Rating(1988,33.0), Rating(846,33.0), Rating(309,33.0), Rating(1520,33.0), Rating(1814,33.0), Rating(2855,33.0), Rating(1755,33.0), Rating(710,33.0), Rating(932,33.0), Rating(3470,33.0), Rating(2025,32.0), Rating(305,32.0), Rating(4562,32.0), Rating(1351,32.0), Rating(3739,32.0), Rating(4821,32.0), Rating(30,32.0), Rating(605,32.0), Rating(1013,32.0), Rating(1510,32.0), Rating(601,32.0), Rating(1061,32.0), Rating(1853,32.0), Rating(4704,32.0), Rating(3081,32.0), Rating(75,32.0), Rating(2379,32.0), Rating(755,32.0), Rating(1181,31.0), Rating(1216,31.0), Rating(1123,31.0), Rating(1795,31.0), Rating(1114,31.0), Rating(1601,31.0), Rating(730,31.0), Rating(492,31.0), Rating(875,30.0), Rating(539,30.0), Rating(1130,30.0), Rating(445,30.0), Rating(3616,30.0), Rating(1534,30.0), Rating(1411,30.0), Rating(3475,30.0), Rating(2607,30.0), Rating(1463,30.0), Rating(261,30.0), Rating(968,30.0), Rating(5079,30.0), Rating(2624,30.0), Rating(2548,30.0), Rating(722,30.0), Rating(1019,29.0), Rating(1475,29.0), Rating(877,29.0), Rating(459,29.0), Rating(2938,29.0), Rating(1964,29.0), Rating(4079,29.0), Rating(1700,29.0), Rating(3171,29.0), Rating(3502,29.0), Rating(2102,29.0), Rating(1167,29.0), Rating(2535,29.0), Rating(434,29.0), Rating(3693,29.0), Rating(490,29.0), Rating(1198,28.0), Rating(2637,28.0), Rating(1527,28.0), Rating(609,28.0), Rating(278,28.0), Rating(717,28.0), Rating(69,28.0), Rating(2018,28.0), Rating(1775,28.0), Rating(4177,28.0), Rating(1191,28.0), Rating(620,28.0), Rating(2801,28.0), Rating(2524,28.0), Rating(3902,28.0), Rating(1121,28.0), Rating(3397,28.0), Rating(821,28.0), Rating(245,28.0), Rating(530,27.0), Rating(3404,27.0), Rating(3417,27.0), Rating(267,27.0), Rating(1079,27.0), Rating(1707,27.0), Rating(2006,27.0), Rating(4118,27.0), Rating(1874,27.0), Rating(231,27.0), Rating(1041,27.0), Rating(786,27.0), Rating(1241,27.0), Rating(2815,27.0), Rating(2616,26.0), Rating(674,26.0), Rating(980,26.0), Rating(15,26.0), Rating(241,26.0), Rating(612,26.0), Rating(6453,26.0), Rating(2614,26.0), Rating(792,26.0), Rating(3324,26.0), Rating(1182,26.0), Rating(922,26.0), Rating(3427,26.0), Rating(25,26.0), Rating(2139,26.0), Rating(1105,26.0), Rating(174,26.0), Rating(1787,25.0), Rating(1745,25.0), Rating(1201,25.0), Rating(867,25.0), Rating(936,25.0), Rating(447,25.0), Rating(1957,25.0), Rating(307,25.0), Rating(823,25.0), Rating(1545,25.0), Rating(1943,25.0), Rating(1081,25.0), Rating(181,25.0), Rating(510,25.0), Rating(2901,25.0), Rating(3259,25.0), Rating(1038,25.0), Rating(3466,25.0), Rating(2121,25.0), Rating(3400,25.0), Rating(253,25.0), Rating(1505,25.0), Rating(644,25.0), Rating(3097,25.0), Rating(61,25.0), Rating(2783,25.0), Rating(610,25.0), Rating(810,25.0), Rating(805,25.0), Rating(1366,25.0), Rating(1428,25.0), Rating(186,25.0), Rating(3488,24.0), Rating(1766,24.0), Rating(1183,24.0), Rating(479,24.0), Rating(1035,24.0), Rating(1947,24.0), Rating(2091,24.0), Rating(708,24.0), Rating(1852,24.0), Rating(1958,24.0), Rating(2923,24.0), Rating(3173,24.0), Rating(2668,24.0), Rating(924,24.0), Rating(1425,24.0), Rating(4828,24.0), Rating(2092,24.0), Rating(1427,24.0), Rating(1519,24.0), Rating(5149,24.0), Rating(812,24.0), Rating(3302,24.0), Rating(3767,24.0), Rating(2990,24.0), Rating(3078,24.0), Rating(947,24.0), Rating(7340,23.0), Rating(744,23.0), Rating(2044,23.0), Rating(2380,23.0), Rating(1807,23.0), Rating(886,23.0), Rating(107,23.0), Rating(3501,23.0), Rating(2959,23.0), Rating(1790,23.0), Rating(3763,23.0), Rating(2498,23.0), Rating(6120,23.0), Rating(945,23.0), Rating(2130,23.0), Rating(211,23.0), Rating(806,23.0), Rating(2595,23.0), Rating(52,23.0), Rating(2600,23.0), Rating(3332,23.0), Rating(3333,23.0), Rating(2794,23.0), Rating(2444,23.0), Rating(400,23.0), Rating(2744,23.0), Rating(4316,23.0), Rating(910,23.0), Rating(1977,22.0), Rating(5926,22.0), Rating(860,22.0), Rating(4435,22.0), Rating(235,22.0), Rating(1298,22.0), Rating(1596,22.0), Rating(1714,22.0), Rating(2226,22.0), Rating(621,22.0), Rating(1196,22.0), Rating(4616,22.0), Rating(421,22.0), Rating(2370,22.0), Rating(3481,22.0), Rating(2619,22.0), Rating(5150,22.0), Rating(262,22.0), Rating(8589,22.0), Rating(98,22.0), Rating(1936,22.0), Rating(515,22.0), Rating(2632,22.0), Rating(1193,22.0), Rating(3768,22.0), Rating(1429,22.0), Rating(785,22.0), Rating(2580,21.0), Rating(322,21.0), Rating(2835,21.0), Rating(874,21.0), Rating(2439,21.0), Rating(1350,21.0), Rating(4025,21.0), Rating(3741,21.0), Rating(297,21.0), Rating(3489,21.0), Rating(845,21.0), Rating(189,21.0), Rating(6217,21.0), Rating(1778,21.0), Rating(880,21.0), Rating(1643,21.0), Rating(1833,21.0), Rating(1200,21.0), Rating(4087,21.0), Rating(5988,21.0), Rating(946,21.0), Rating(3462,21.0), Rating(5416,21.0), Rating(4675,21.0), Rating(1511,21.0), Rating(1040,21.0), Rating(1645,21.0), Rating(2821,21.0), Rating(1804,21.0), Rating(1295,21.0), Rating(2407,21.0), Rating(4251,21.0), Rating(246,21.0), Rating(5000,21.0), Rating(1100,21.0), Rating(1222,21.0), Rating(2220,21.0), Rating(9,21.0), Rating(1914,21.0), Rating(2940,21.0), Rating(1905,21.0), Rating(1401,21.0), Rating(977,21.0), Rating(1709,21.0), Rating(4566,21.0), Rating(575,21.0), Rating(1338,21.0), Rating(3046,21.0), Rating(3444,21.0), Rating(1990,20.0), Rating(85,20.0), Rating(2797,20.0), Rating(3279,20.0), Rating(5547,20.0), Rating(3940,20.0), Rating(2956,20.0), Rating(205,20.0), Rating(258,20.0), Rating(4180,20.0), Rating(1343,20.0), Rating(356,20.0), Rating(1364,20.0), Rating(4385,20.0), Rating(1547,20.0), Rating(3186,20.0), Rating(2446,20.0), Rating(682,20.0), Rating(1863,20.0), Rating(5605,20.0), Rating(1824,20.0), Rating(1192,20.0), Rating(1747,20.0), Rating(202,20.0), Rating(2206,20.0), Rating(4115,20.0), Rating(2716,20.0), Rating(769,20.0), Rating(1150,20.0), Rating(426,20.0), Rating(3109,20.0), Rating(1678,20.0), Rating(2838,20.0), Rating(3227,20.0), Rating(1300,19.0), Rating(616,19.0), Rating(1969,19.0), Rating(1203,19.0), Rating(585,18.0), Rating(1281,18.0))
>     validation_users: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userID: int]
>     prediction_scores: org.apache.spark.sql.DataFrame = [userID: int, recommendations: array<struct<artistID:int,rating:float>>]

In [None]:
println("Popular_model: mean_rank=" + eval_model(prediction_scores, training_set, test_set))

  

>     Popular_model: mean_rank=23.911098

In [None]:
val random = artist_data.select("artistID_2").distinct().orderBy(rand()).withColumn("idx",monotonically_increasing_id)
           .withColumn("rownumber",row_number.over(Window.orderBy(desc("idx")))).drop("idx").sort(desc("rownumber"))
          .collect.map(row =>Rating(row.getInt(0),row.getInt(1).toFloat))
val validation_users = validation_set.select("userID").distinct()
val prediction_scores = user_artist_data.select("userID").distinct().withColumn("recommendations",typedLit(random))
                                        .join(validation_users,"userID")

In [None]:
println("Random_model: mean_rank=" + eval_model(prediction_scores, training_set, test_set))

In [None]:
import org.apache.spark.ml.recommendation.ALSModel

def userHistory(userID: Int, n: Int, user_artist_data: DataFrame, artist_names: DataFrame): DataFrame = {
  val data = user_artist_data.filter($"userID"===userID).sort(desc("play_count")).join(artist_names, "artistID")
  data.select("userID","artistID","name").show(n)                            
  data.select("userID","artistID","name")
}

def recommendToUser(model: ALSModel, userID: DataFrame, n: Int, training_set: DataFrame, artist_names: DataFrame) : DataFrame = {
  
  val recommendations = model.recommendForUserSubset(userID, n_artists_new.toInt).withColumn("recommendations", explode($"recommendations"))
                                      .select("userID","recommendations.artistID", "recommendations.rating").join(artist_names, "artistID").select("userID","artistID","name","rating")
  
  recommendations.join(training_set,training_set("userID")===recommendations("userID") && training_set("artistID")===recommendations("artistID"),"leftanti")
}



In [None]:
training_set.filter($"artistID"===89).sort(desc("play_count")).show(5)

In [None]:
println("Listening history:")
val sub_data = userHistory(302, 5, training_set, artist_names)
val recommendations = recommendToUser(model_final, sub_data, 5, training_set, artist_names)
println("Recommendations:")
recommendations.show(5)