# Recommendation System in Spark using ALS

The idea is to obtain data and convert into the form which ALS can interpret and generate top k recommendatations. The data of number of songs listened by the user pertaining to Genre, Artist and Language is taken mostly into consideration. The predictions from ALS model will be top k Genres, artists, and languages. Using these items, comparing with the top songs belonging to the item are recommended to the user. 

In [18]:
pip install kaggle

Note: you may need to restart the kernel to use updated packages.


In [20]:
!kaggle competitions download -c kkbox-music-recommendation-challenge

Downloading kkbox-music-recommendation-challenge.zip to /Users/mahi/Documents/GitHub/BDA_Project
100%|███████████████████████████████████████▉| 344M/345M [00:19<00:00, 17.7MB/s]
100%|████████████████████████████████████████| 345M/345M [00:19<00:00, 18.8MB/s]


In [21]:
import zipfile
with zipfile.ZipFile("kkbox-music-recommendation-challenge.zip","r") as zip_ref:
    zip_ref.extractall()

In [22]:
!ls 

Project_BDA.ipynb
README.md
Recommendation System using ALS.ipynb
kkbox-music-recommendation-challenge.zip
members.csv
members.csv.7z
sample_submission.csv.7z
song_extra_info.csv.7z
songs.csv.7z
test.csv.7z
train.csv.7z


Unfortunately I couldn't get a solution to unzip 7z inside the code. So, the alternative is to unzip the files in the local file explorer. We need train.csv and songs.csv. So we just need to unzip both files and the extracted file can stay in the same working folder.

## Creating Spark session and importing libraries required for running the code.

In [68]:
import os
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

The below dataframe has data of all songs that exist in the dataset.

In [4]:
# Reading the songs data

songs_df = spark.read.option('infer_schema','true').option('header','true').csv('songs.csv')

# Showing the top 5 rows

songs_df.show(5)

+--------------------+-----------+---------+-------------------+--------------------+-----------+--------+
|             song_id|song_length|genre_ids|        artist_name|            composer|   lyricist|language|
+--------------------+-----------+---------+-------------------+--------------------+-----------+--------+
|CXoTN1eb7AI+DntdU...|     247640|      465|張信哲 (Jeff Chang)|                董貞|     何啟弘|     3.0|
|o0kFgae9QtnYgRkVP...|     197328|      444|          BLACKPINK|TEDDY|  FUTURE BO...|      TEDDY|    31.0|
|DwVvVurfpuz+XPuFv...|     231781|      465|       SUPER JUNIOR|                null|       null|    31.0|
|dKMBWoZyScdxSkihK...|     273554|      465|              S.H.E|              湯小康|     徐世珍|     3.0|
|W3bqWd3T+VeHFzHAU...|     140329|      726|           貴族精選|         Traditional|Traditional|    52.0|
+--------------------+-----------+---------+-------------------+--------------------+-----------+--------+
only showing top 5 rows



Let's explore the dataset to know some stats!

In [8]:
# Total number of Genres

print("Total distinct Genres:", songs_df.select('genre_ids').distinct().count())

Total distinct Genres: 1047


In [10]:
# Total number of songs.

print("Total unique songs:", songs_df.select('song_id').distinct().count())

Total unique songs: 2296833


In [11]:
# Total number of artists

print("Total unique artists:",songs_df.select('artist_name').distinct().count())

Total unique artists: 222409


The below dataframe has data of all users (msno) and the songs they listened to (song_id) and the source where the song is played. Target column is 0 if the user listens to a song after a month of listening to first song and it is 1 if user listens to another song within a month of song listening event.

In [12]:
# Reading the data of users and songs

df = spark.read.option('infer_schema','true').option('header','true').csv('train.csv')

# Showing top 5 rows

df.show(5)

+--------------------+--------------------+-----------------+-------------------+---------------+------+
|                msno|             song_id|source_system_tab| source_screen_name|    source_type|target|
+--------------------+--------------------+-----------------+-------------------+---------------+------+
|FGtllVqz18RPiwJj/...|BBzumQNXUHKdEBOB7...|          explore|            Explore|online-playlist|     1|
|Xumu+NIjS6QYVxDS4...|bhp/MpSNoqoxOIB+/...|       my library|Local playlist more| local-playlist|     1|
|Xumu+NIjS6QYVxDS4...|JNWfrrC7zNN7BdMps...|       my library|Local playlist more| local-playlist|     1|
|Xumu+NIjS6QYVxDS4...|2A87tzfnJTSWqD7gI...|       my library|Local playlist more| local-playlist|     1|
|FGtllVqz18RPiwJj/...|3qm6XTZ6MOCU11x8F...|          explore|            Explore|online-playlist|     1|
+--------------------+--------------------+-----------------+-------------------+---------------+------+
only showing top 5 rows



Information about the above data:
1. msno - unique user id
2. song_id - unique song id
3. source_system_tab - the source or tab in the app where song was played
4. source_screen_name - name of the source or tab in the app
5. source_type - type of source
6. target - listening event in a month (1) or later(0)

In [13]:
# Total number of distinct users 

print("Total number of unique users:",df.select('msno').distinct().count())

Total number of unique users: 30755


In [15]:
# Total number of rows in data to know the volume of data we are dealing with
print("Total number of rows:", df.count())

Total number of rows: 7377418


In [17]:
# Total number of monthly listening events of users

print("Total number of listening events in a month:",df.where(df.target==1).count())

Total number of listening events in a month: 3714656


In [26]:
# Total number of listening events with last listening event occured a month ago

print("Users who listened to a song after a month:",df.where(df.target == 0).count())

Users who listened to a song after a month: 3662762


From above it is understood that the number of observations we are dealing with is huge and it is required to work with only a sample of observations.We now create a sample of data to use because the number of rows to compute is huge for even spark and will take longer times and due to computational restrictions we would use sampling from the data to derive predictions. The same idea can be implemented for every other sample.

In [23]:
# Sampling the dataset of size 1% since this 1% itself has 73929 rows to deal with.

df_sample = df.sample(0.01)
df_sample.count()

73929

In [29]:
# The data has even distribution of 1 and 0

print("Proportion of target = 1:", df_sample.where(df_sample.target == 1).count()/df_sample.count())

Proportion of target = 1: 0.49907343532308024


In [31]:
print("Proportion of target = 0:", df_sample.where(df_sample.target == 0).count()/df_sample.count())

Proportion of target = 0: 0.5009265646769198


## Data Preparation

The data now needs to be prepared for feeding into ALS recommender system.

In [37]:
# It is shown the number of songs listened by each user in the sample

df_sample.groupBy('msno').count().orderBy('count', ascending = False).take(10)

[Row(msno='MXIMDXO0j3UpaT7FvOSGW6Y5zfhlh+xYjTqGoUdMzEE=', count=62),
 Row(msno='KGXNZ/H3VxvET/+rGxlrAe7Gpz2eKMXyuSg3xh8Ij1M=', count=55),
 Row(msno='FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=', count=54),
 Row(msno='cqjRBV/jWN2ujhc+z/4tz+Mj6xEfflAAt6qBXCqxKvw=', count=50),
 Row(msno='SZ5NNypqaTWljFO1HiVZwkw3713+rM9x/JNdJd8/fzc=', count=40),
 Row(msno='hYJpPvGod6vy09TnlXdQe3Q0vlxju5u5Ruf8V2XkTio=', count=39),
 Row(msno='OOUnJuX4SteRhUdJZ9B2DqtfiwsfcZVBefEhXLeBsFg=', count=38),
 Row(msno='o+5RNlSWrzvrphgBNGIo1FLkGxBgyICns6qXj3nS7Pk=', count=38),
 Row(msno='LThaiVqGGnVTPmTcmwN/LLo4fVb5dzkduzd7s1SgzIA=', count=37),
 Row(msno='gxxBbzV3eE2XGjUrFVB2FzAve55Oe1s86HD+OEh36Gw=', count=35)]

Users have listened to songs with a maximum of 62.

In [38]:
# Coming back to songs data

songs_df.show(5)

+--------------------+-----------+---------+-------------------+--------------------+-----------+--------+
|             song_id|song_length|genre_ids|        artist_name|            composer|   lyricist|language|
+--------------------+-----------+---------+-------------------+--------------------+-----------+--------+
|CXoTN1eb7AI+DntdU...|     247640|      465|張信哲 (Jeff Chang)|                董貞|     何啟弘|     3.0|
|o0kFgae9QtnYgRkVP...|     197328|      444|          BLACKPINK|TEDDY|  FUTURE BO...|      TEDDY|    31.0|
|DwVvVurfpuz+XPuFv...|     231781|      465|       SUPER JUNIOR|                null|       null|    31.0|
|dKMBWoZyScdxSkihK...|     273554|      465|              S.H.E|              湯小康|     徐世珍|     3.0|
|W3bqWd3T+VeHFzHAU...|     140329|      726|           貴族精選|         Traditional|Traditional|    52.0|
+--------------------+-----------+---------+-------------------+--------------------+-----------+--------+
only showing top 5 rows



In the column genre_ids, it is seen that a song may belong to multiple genre ids, and are separated by | in the cell. They need to added as a new observation with other values of other columns keeping intact.

In [41]:
# Genre ids have multiple entries in same cell which means a song belonging to multiple genres

songs_df.select('genre_ids').rdd.map(lambda x:x.genre_ids).take(15)

['465',
 '444',
 '465',
 '465',
 '726',
 '864|857|850|843',
 '458',
 '465',
 '465',
 '352|1995',
 '2157',
 '465',
 '726',
 '458',
 '359']

In [42]:
# Splitting into individual genre and repeating song entry

songs_df = songs_df.withColumn('genre',explode(split(songs_df.genre_ids, "\|")))

songs_df.select('genre').take(20)

[Row(genre='465'),
 Row(genre='444'),
 Row(genre='465'),
 Row(genre='465'),
 Row(genre='726'),
 Row(genre='864'),
 Row(genre='857'),
 Row(genre='850'),
 Row(genre='843'),
 Row(genre='458'),
 Row(genre='465'),
 Row(genre='465'),
 Row(genre='352'),
 Row(genre='1995'),
 Row(genre='2157'),
 Row(genre='465'),
 Row(genre='726'),
 Row(genre='458'),
 Row(genre='359'),
 Row(genre='359')]

Now, user and song data needs to merged to identify the genre, artists and language of songs

In [44]:
df_cut = df_sample.join(songs_df, 'song_id')
df_cut.show(5)

+--------------------+--------------------+-----------------+--------------------+---------------+------+-----------+---------+-----------+--------------------+--------+--------+-----+
|             song_id|                msno|source_system_tab|  source_screen_name|    source_type|target|song_length|genre_ids|artist_name|            composer|lyricist|language|genre|
+--------------------+--------------------+-----------------+--------------------+---------------+------+-----------+---------+-----------+--------------------+--------+--------+-----+
|o0kFgae9QtnYgRkVP...|kPgqI5lBSLx9/RUjC...|       my library|          My library|  local-library|     0|     197328|      444|  BLACKPINK|TEDDY|  FUTURE BO...|   TEDDY|    31.0|  444|
|o0kFgae9QtnYgRkVP...|tieAKb/19RM0CnPQx...|       my library| Local playlist more|  local-library|     1|     197328|      444|  BLACKPINK|TEDDY|  FUTURE BO...|   TEDDY|    31.0|  444|
|o0kFgae9QtnYgRkVP...|Qr8Cbh9/qXRCaQDpo...|         discover|Online playlis

Now we can see how many songs in each Genre the user listened to. So the count of genre becomes the Genre interest of the user

What is done for this aggregation procedure is, 
1. Grouping the data by msno, genre.
2. Aggregating with 'count'
3. Converting to RDD to perform Map function
4. Converting back to dataframe with columns User, genre, genre_interest

In [48]:
# Creating a dataframe with the genre interest of each user

user_genre = df_cut.select('msno','genre').groupBy('msno',
                                                   'genre').count().rdd.map(lambda x: x).toDF(['User', 
                                                                                               'genre', 'genre_interest'])

user_genre.orderBy('genre_interest', ascending=False).show(5)

+--------------------+-----+--------------+
|                User|genre|genre_interest|
+--------------------+-----+--------------+
|KGXNZ/H3VxvET/+rG...|  465|            38|
|cqjRBV/jWN2ujhc+z...|  465|            31|
|MXIMDXO0j3UpaT7Fv...|  465|            30|
|FGtllVqz18RPiwJj/...|  465|            26|
|dU4RbzpIRRd/EkA9X...|  465|            25|
+--------------------+-----+--------------+
only showing top 5 rows



Similarly we do for artist interest and language interest and the dataframes of each are shown below

In [50]:
# Artist interest

user_artist = df_cut.select('msno','artist_name').groupBy('msno',
                                                          'artist_name').count().rdd.map(lambda x: x).toDF(['User', 
                                                                                                            'artist_name', 'artist_interest'])

user_artist.orderBy('artist_interest', ascending=False).show(5)

+--------------------+-------------------------+---------------+
|                User|              artist_name|artist_interest|
+--------------------+-------------------------+---------------+
|MXIMDXO0j3UpaT7Fv...|証聲音樂圖書館 ECHO MUSIC|             12|
|s59i0bBkU2+a9l66j...|          Various Artists|             10|
|2tmUzRCcD0l3et0ck...|証聲音樂圖書館 ECHO MUSIC|              8|
|N+/5izDHnbJo+15dP...|          Various Artists|              8|
|n7pT0Hb9KHJCHZgp0...|          Various Artists|              8|
+--------------------+-------------------------+---------------+
only showing top 5 rows



In [51]:
# Language Interest

user_language = df_cut.select('msno','language').groupBy('msno',
                                                         'language').count().rdd.map(lambda x: x).toDF(['User', 
                                                                                                        'language', 'language_interest'])

user_language.orderBy('language_interest', ascending=False).show(5)

+--------------------+--------+-----------------+
|                User|language|language_interest|
+--------------------+--------+-----------------+
|MXIMDXO0j3UpaT7Fv...|    52.0|               50|
|cqjRBV/jWN2ujhc+z...|     3.0|               40|
|SZ5NNypqaTWljFO1H...|    52.0|               36|
|DqwB7smOAIbNnnQbW...|    52.0|               34|
|FGtllVqz18RPiwJj/...|    52.0|               31|
+--------------------+--------+-----------------+
only showing top 5 rows



So, now the approach is to find popular songs in each Genre to recommend to the user. The reason for this is, we find the top Genres the user might be interested in and later we find the top songs of the genre and recommend the one user hadn't listened to.

In [52]:
# Aggregating the dataframe to find songs popular in each Genre

genre_pop = df_cut.select('genre','song_id','msno').groupBy('genre',
                                                            'song_id').count().rdd.map(lambda x: x).toDF(['genre', 
                                                                                                          'song', 'genre_song_popularity'])
genre_pop.orderBy('genre_song_popularity', ascending=False).show(5)

+-----+--------------------+---------------------+
|genre|                song|genre_song_popularity|
+-----+--------------------+---------------------+
|  458|reXuGcEWDDCnL0K3T...|                  144|
|  458|FynUyq0+drmIARmK1...|                  137|
|  458|cy10N2j2sdY/X4BDU...|                  134|
|  458|T86YHdD4C9JSc274b...|                  134|
|  465|wBTWuHbjdjxnG1lQc...|                  127|
+-----+--------------------+---------------------+
only showing top 5 rows



In [53]:
# Aggregating the dataframe to find songs popular for each artist

artist_pop = df_cut.select('artist_name','song_id','msno').groupBy('artist_name',
                                                                   'song_id').count().rdd.map(lambda x: x).toDF(['artist_name', 
                                                                                                                 'song', 'artist_song_popularity'])
artist_pop.orderBy('artist_song_popularity', ascending=False).show(5)

+--------------------+--------------------+----------------------+
|         artist_name|                song|artist_song_popularity|
+--------------------+--------------------+----------------------+
|         Alan Walker|J4qKkLIoW7aYACuTu...|                   248|
|         Alan Walker|v/3onppBGoSpGsWb8...|                   182|
|         Alan Walker|zHqZ07gn+YvF36FWz...|                   156|
|周湯豪 (NICKTHEREAL)|reXuGcEWDDCnL0K3T...|                   144|
|         Eric 周興哲|FynUyq0+drmIARmK1...|                   137|
+--------------------+--------------------+----------------------+
only showing top 5 rows



In [54]:
# Aggregating the dataframe to find songs popular in each language

language_pop = df_cut.select('language','song_id','msno').groupBy('language',
                                                                  'song_id').count().rdd.map(lambda x: x).toDF(['language', 
                                                                                                                'song', 'language_song_popularity'])
language_pop.orderBy('language_song_popularity', ascending=False).show(5)

+--------+--------------------+------------------------+
|language|                song|language_song_popularity|
+--------+--------------------+------------------------+
|    52.0|J4qKkLIoW7aYACuTu...|                     248|
|    52.0|v/3onppBGoSpGsWb8...|                     182|
|    52.0|zHqZ07gn+YvF36FWz...|                     156|
|     3.0|reXuGcEWDDCnL0K3T...|                     144|
|     3.0|FynUyq0+drmIARmK1...|                     137|
+--------+--------------------+------------------------+
only showing top 5 rows



In [58]:
# Distinct Genres in the data

print("Distinct Genres in the sample dataset:",df_cut.select('genre').distinct().count())

Distinct Genres in the sample dataset: 108


In [59]:
# Distinct users in the data

print("Distinct users in the sample dataset:",df_cut.select('msno').distinct().count())

Distinct users in the sample dataset: 18684


Let's combine all the three dataframes of users to see the interest of each user in the categories

In [65]:
# User vs Genre interest and Artist interest and language interest

User_side = user_genre.join(user_artist, "User", 'outer').join(user_language, 'User','outer')

User_side.orderBy(['genre_interest', 
                   'artist_interest', 
                   'language_interest'], ascending = False).show(5)

+--------------------+-----+--------------+--------------------+---------------+--------+-----------------+
|                User|genre|genre_interest|         artist_name|artist_interest|language|language_interest|
+--------------------+-----+--------------+--------------------+---------------+--------+-----------------+
|KGXNZ/H3VxvET/+rG...|  465|            38|           Shy Girls|              2|     3.0|               26|
|KGXNZ/H3VxvET/+rG...|  465|            38|   林宥嘉 (Yoga Lin)|              2|     3.0|               26|
|KGXNZ/H3VxvET/+rG...|  465|            38|             BIGBANG|              2|     3.0|               26|
|KGXNZ/H3VxvET/+rG...|  465|            38|蜂蜜幸運草電視原聲帶|              2|     3.0|               26|
|KGXNZ/H3VxvET/+rG...|  465|            38|            Maroon 5|              2|     3.0|               26|
+--------------------+-----+--------------+--------------------+---------------+--------+-----------------+
only showing top 5 rows



Let's see the popular songs in each of these categories. We won't using these dataframes but to have an idea of how it looks when we have statistics of users and songs

In [67]:
# Popular songs in each category

song_side = genre_pop.join(artist_pop, 'song', 'outer').join(language_pop, "song", 'outer')

song_side.orderBy(['genre_song_popularity', 
                   'artist_song_popularity', 
                   'language_song_popularity'], ascending = False).show(5)

+--------------------+-----+---------------------+--------------------+----------------------+--------+------------------------+
|                song|genre|genre_song_popularity|         artist_name|artist_song_popularity|language|language_song_popularity|
+--------------------+-----+---------------------+--------------------+----------------------+--------+------------------------+
|reXuGcEWDDCnL0K3T...|  458|                  144|周湯豪 (NICKTHEREAL)|                   144|     3.0|                     144|
|FynUyq0+drmIARmK1...|  458|                  137|         Eric 周興哲|                   137|     3.0|                     137|
|T86YHdD4C9JSc274b...|  458|                  134|   周杰倫 (Jay Chou)|                   134|     3.0|                     134|
|cy10N2j2sdY/X4BDU...|  458|                  134|     五月天 (Mayday)|                   134|     3.0|                     134|
|M9rAajz4dYuRhZ7jL...|  465|                  127|     林俊傑 (JJ Lin)|                   127|     3.0|         

# Song Recommendations using ALS

We need to index the columns in the data to feed into the ALS

In [69]:
# We would use the StringIndexer to achieve this

indexer = StringIndexer(inputCol = "User", outputCol="UserIndex")

In [75]:
# String Indexed dataframe

indexed = indexer.fit(user_genre).transform(user_genre).drop('User')

In [76]:
# we use this kind of dataframe for generating predictions and it is the format preferred for ALS

indexed.show(5)

+-----+--------------+---------+
|genre|genre_interest|UserIndex|
+-----+--------------+---------+
|  465|             1|   5645.0|
|  465|             8|    137.0|
|  444|             1|   2580.0|
|  465|             3|  10715.0|
|  465|             5|    286.0|
+-----+--------------+---------+
only showing top 5 rows



In [77]:
# the index, genre should be integer, so we cast these columns as int

indexed = indexed.selectExpr('cast(genre as int) as genre_id', 
                             'cast(UserIndex as int)', 
                             'genre_interest')
indexed.show(5)

+--------+---------+--------------+
|genre_id|UserIndex|genre_interest|
+--------+---------+--------------+
|     465|     5645|             1|
|     465|      137|             8|
|     444|     2580|             1|
|     465|    10715|             3|
|     465|      286|             5|
+--------+---------+--------------+
only showing top 5 rows



In [78]:
indexed.printSchema()

root
 |-- genre_id: integer (nullable = true)
 |-- UserIndex: integer (nullable = true)
 |-- genre_interest: long (nullable = true)



In [44]:
# Splitting into training and testing
training, test = indexed.randomSplit([0.8,0.2])

In [45]:
# Initialising the ALS
als = ALS()\
.setMaxIter(5)\
.setRegParam(0.01)\
.setUserCol("UserIndex")\
.setItemCol("genre_id")\
.setRatingCol("genre_interest")

In [46]:
# Below are the hyperparameters can be added to ALS
print(als.explainParams())

alpha: alpha for implicit preference (default: 1.0)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
coldStartStrategy: strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'. (default: nan)
finalStorageLevel: StorageLevel for ALS model factors. (default: MEMORY_AND_DISK)
implicitPrefs: whether to use implicit preference (default: False)
intermediateStorageLevel: StorageLevel for intermediate datasets. Cannot be 'NONE'. (default: MEMORY_AND_DISK)
itemCol: column name for item ids. Ids must be within the integer value range. (default: item, current: genre_id)
maxIter: max number of iterations (>= 

In [47]:
# Training the ALS
alsModel = als.fit(training)

In [48]:
# Generating Predictions
predictions = alsModel.transform(test)

In [49]:
# So for each user, we get 10 top recommended Genres
recomm_genre = alsModel.recommendForAllUsers(10).selectExpr('UserIndex', 'explode(recommendations) as recommended')
recomm_genre.show(10)

+---------+-----------------+
|UserIndex|      recommended|
+---------+-----------------+
|     1580| [545, 7.8692727]|
|     1580|[1995, 5.6934133]|
|     1580|[2086, 5.2582836]|
|     1580|  [850, 4.759304]|
|     1580|[1572, 4.7450814]|
|     1580|  [481, 4.554379]|
|     1580| [873, 4.4750085]|
|     1580|  [516, 4.389927]|
|     1580|  [465, 3.836222]|
|     1580| [388, 3.2870884]|
+---------+-----------------+
only showing top 10 rows



In [50]:
# Generated predictions
top_genre_user = recomm_genre.where(recomm_genre.UserIndex == 1580).rdd.map(lambda x: x.recommended).map(lambda x: x.genre_id).collect()
top_genre_user

[545, 1995, 2086, 850, 1572, 481, 873, 516, 465, 388]

In [104]:
# song recommendations
song_side.sort('genre', 'genre_song_popularity', ascending=False).where(song_side.genre == top_genre_user[1]).select('song').show(10)

+--------------------+
|                song|
+--------------------+
|UMzNsDKRVjcNUYhlT...|
|Sbz9GfHVOAERJA7o4...|
|1iHRL+cugxRVTz0Zb...|
|KnZV077x3kJoBnJm/...|
|d90L+EZxCshVGS4rU...|
|5DfRZqAreMU3ovigj...|
|lTvKbkQhBBLCYrVyR...|
|fh+mNN016KM5hFM7D...|
|9LaJP77sG3MTpRIDO...|
|sjXtOEX2/uwyf36Y5...|
+--------------------+
only showing top 10 rows



# Song recommendations based on Artist user is interested in

In [52]:
# converting columns via string indexing
indexer = StringIndexer(inputCol = "User", outputCol="UserIndex")
indexer_artist = StringIndexer(inputCol = 'artist_name', outputCol = 'artist_index')

In [53]:
indexed_artist = indexer.fit(user_artist).transform(user_artist)
indexed_artist = indexer_artist.fit(indexed_artist).transform(indexed_artist)

In [54]:
indexed_artist.show(5)

+--------------------+--------------------+---------------+---------+------------+
|                User|         artist_name|artist_interest|UserIndex|artist_index|
+--------------------+--------------------+---------------+---------+------------+
|GH5CcteDnxOzg5zkI...|  信樂團 (Shin Band)|              1|    630.0|       120.0|
|S7FV7CKTmx2ymejZJ...|      Rag'N'Bone Man|              1|    541.0|      1400.0|
|5fFQq48H5HIiyfYkk...|   Peter Paul & Mary|              1|  11529.0|      4717.0|
|K5x68e9t0PySwLAoe...|周湯豪 (NICKTHEREAL)|              1|   3066.0|        26.0|
|4Lxw2WOLQw7LBrp6W...|              詹雅雯|              1|   9940.0|       206.0|
+--------------------+--------------------+---------------+---------+------------+
only showing top 5 rows



In [83]:
indexed_artist_als = indexed_artist.selectExpr('cast(artist_index as int)', 'cast(UserIndex as int)', 'artist_interest', 'artist_name')
indexed_artist_als.show(5)

+------------+---------+---------------+--------------------+
|artist_index|UserIndex|artist_interest|         artist_name|
+------------+---------+---------------+--------------------+
|         120|      630|              1|  信樂團 (Shin Band)|
|        1400|      541|              1|      Rag'N'Bone Man|
|        4717|    11529|              1|   Peter Paul & Mary|
|          26|     3066|              1|周湯豪 (NICKTHEREAL)|
|         206|     9940|              1|              詹雅雯|
+------------+---------+---------------+--------------------+
only showing top 5 rows



In [84]:
training_artist, test_artist = indexed_artist_als.randomSplit([0.8, 0.2])

In [85]:
als = ALS()\
.setMaxIter(5)\
.setRegParam(0.01)\
.setUserCol("UserIndex")\
.setItemCol("artist_index")\
.setRatingCol("artist_interest")

In [86]:
alsModel_artist = als.fit(training_artist)

In [87]:
predictions = alsModel_artist.transform(test_artist)

In [89]:
# Recommended artists(string indexed values)
recomm_artist = alsModel_artist.recommendForAllUsers(10).selectExpr('UserIndex', 'explode(recommendations) as recommended')
recomm_artist.show(10)

+---------+-----------------+
|UserIndex|      recommended|
+---------+-----------------+
|     1580| [1651, 4.933778]|
|     1580| [633, 4.1720943]|
|     1580|  [828, 3.925313]|
|     1580|[1781, 3.5492597]|
|     1580|[1559, 3.5222251]|
|     1580|[1094, 3.5146184]|
|     1580|[1118, 3.4517207]|
|     1580| [764, 3.4117632]|
|     1580|[1713, 3.3856668]|
|     1580| [2444, 3.355938]|
+---------+-----------------+
only showing top 10 rows



In [61]:
# string indexed artists
top_artist_user = recomm_artist.where(recomm_artist.UserIndex == 1580).rdd.map(lambda x: x.recommended).collect()#map(lambda x: x.artist_index).collect()
artist_user = sc.parallelize(top_artist_user).toDF()
artist_user.show(10)

+------------+------------------+
|artist_index|            rating|
+------------+------------------+
|         637| 4.602288246154785|
|         603|4.6009321212768555|
|        1603| 4.285262107849121|
|         497| 4.222316265106201|
|         479|  4.03449010848999|
|         686| 4.012558460235596|
|         971| 4.010550022125244|
|        1354| 3.862031936645508|
|         498| 3.829875946044922|
|         594|3.8296868801116943|
+------------+------------------+



In [105]:
# obtaining artist names from above
recomm_artists = indexed_artist.select(indexed_artist.artist_index.cast('int'), 'artist_name').join(artist_user, "artist_index", 'inner').distinct().rdd.map(
lambda x: x.artist_name).collect()
recomm_artists

['Muse',
 '約書亞樂團',
 'Richard Clayderman',
 'Eir Aoi (藍井エイル)',
 'SCANDAL',
 'High School Musical Original Soundtrack',
 'Gallant x Tablo x Eric Nam',
 'Ella Fitzgerald',
 'Nogizaka46 (乃木坂46)',
 '滾石金韻民歌百大精選']

In [122]:
# recommended songs based on artists
song_side.sort('artist_name', 'artist_song_popularity', ascending=False).where(song_side.artist_name == recomm_artists[1]).select('song').show(10)

+--------------------+
|                song|
+--------------------+
|s8KQo0qsDIS42y5xg...|
|sOYNImNuRdogfOvo9...|
|s8KQo0qsDIS42y5xg...|
|cR3/3Zf2wzp9uQteN...|
|t/HLyst7l4EjgPCuQ...|
|gJwLE8jSKMh2YEuLX...|
|QLDzqRYCPufbZve5u...|
|1fruU7bvpRMfDMkco...|
|2hyXkI5tyKL0oEkTS...|
|FHLjRKekX6BnbLjDj...|
+--------------------+
only showing top 10 rows



# Song recommendations based on language

In [143]:
indexer = StringIndexer(inputCol = "User", outputCol="UserIndex")

In [147]:
indexed_language = indexer.fit(user_language).transform(user_language)

In [148]:
indexed_language = indexed_language.select(indexed_language.language.cast('int'), 'UserIndex', 'language_interest')
indexed_language.show()

+--------+---------+-----------------+
|language|UserIndex|language_interest|
+--------+---------+-----------------+
|       3|    719.0|                7|
|      52|   1305.0|                3|
|      52|  15835.0|                2|
|       3|    614.0|                5|
|       3|   1077.0|                9|
|       3|   3951.0|                2|
|      52|    108.0|               11|
|      52|  11014.0|                1|
|       3|   7422.0|                1|
|       3|   9733.0|                5|
|      52|  12981.0|                1|
|      52|   7342.0|                1|
|      52|    615.0|                5|
|       3|  16188.0|                2|
|      -1|   3441.0|                1|
|       3|  13053.0|                3|
|       3|   7082.0|                4|
|      52|   7750.0|                4|
|       3|   9763.0|                2|
|      52|    510.0|                4|
+--------+---------+-----------------+
only showing top 20 rows



In [163]:
indexed_language = indexed_language.fillna({'language':'0'})

In [164]:
als = ALS()\
.setMaxIter(5)\
.setRegParam(0.01)\
.setUserCol("UserIndex")\
.setItemCol("language")\
.setRatingCol("language_interest")

In [165]:
training_language, test_language = indexed_language.randomSplit([0.8, 0.2])

In [167]:
alsModel_language = als.fit(training_language)

In [168]:
predictions = alsModel_language.transform(test_language)

In [171]:
# Top k languages recommended for the user
recomm_language = alsModel_language.recommendForAllUsers(10).selectExpr('UserIndex', 'explode(recommendations) as recommended')
recomm_language.show(10)

+---------+----------------+
|UserIndex|     recommended|
+---------+----------------+
|     1580|  [3, 3.7705626]|
|     1580| [52, 1.8470466]|
|     1580| [24, 0.9479289]|
|     1580|  [10, 0.895591]|
|     1580|[45, 0.89142084]|
|     1580| [59, 0.8420744]|
|     1580| [38, 0.6820886]|
|     1580|[31, 0.57250977]|
|     1580|[17, 0.54873055]|
|     1580|[-1, 0.36088538]|
+---------+----------------+
only showing top 10 rows



In [178]:
# Top 10 languages for a user
top_languages_user = recomm_language.where(recomm_language.UserIndex == 1580).rdd.map(lambda x: x.recommended).map(lambda x: x.language).collect()
top_languages_user

[3, 52, 24, 10, 45, 59, 38, 31, 17, -1]

In [179]:
# Recommended songs.
song_side.sort('language', 'language_song_popularity', ascending=False).where(song_side.language == top_languages_user[1]).select('song').show(10)

+--------------------+
|                song|
+--------------------+
|J4qKkLIoW7aYACuTu...|
|J4qKkLIoW7aYACuTu...|
|v/3onppBGoSpGsWb8...|
|v/3onppBGoSpGsWb8...|
|zHqZ07gn+YvF36FWz...|
|zHqZ07gn+YvF36FWz...|
|+LztcJcPEEwsikk6+...|
|IKMFuL0f5Y8c63Hg9...|
|9YYrODwrXpDcCjOJy...|
|9YYrODwrXpDcCjOJy...|
+--------------------+
only showing top 10 rows



The model is not evaluated using RMSE since we not completely have values for each user and the predictions since some users might not have rated the item. The recommendations are completely new songs which user havent discovered yet.

# Conclusion:

1. The ALS model is used widely by many companies for making recommendations but the ALS is not completely used as an only tool. It is combined with many collaborative filtering models to obtain near predictions of items.
2. Recommender systems are difficult to evaluate: if some classical metrics such that MSE, accuracy, recall or precision can be used, one should keep in mind that some desired properties such as diversity (serendipity) and explainability can’t be assessed this way ; real conditions evaluation (like A/B testing or sample testing) is finally the only real way to evaluate a new recommender system but requires a certain confidence in the model