### University of Virginia
### DS 5559: Big Data Analytics
### Music Recommendation
### Last updated: Feb 29, 2020

**Instructions**  
In this assignment, you will work with a recommendation algorithm based on user listening data from Autoscrobbler.

The code is outlined below. Make the requested modifications, run the code, and copy all answers to the **ANSWER SECTION** at the bottom of the notebook. 

NOTE: For a given userID, some/many recommendation might come back as $None$.  
These should be filtered out using a list comprehension as follows:

In [None]:
print([x for x in recommendationsForUser if x is not None])

**TOTAL POINTS: 10**
***

**About the Alternating Least Squares Parameters**

`rank`  
The number of latent factors in the model, or equivalently, the number of columns $k$ in the user-feature and product-feature matrices. In nontrivial cases, this is also their rank.

`iterations`  
The number of iterations that the factorization runs. More iterations take more time but may produce a better factorization.

`lambda`  
A standard overfitting parameter. Higher values resist overfitting, but values that are too high hurt the factorization’s accuracy.

`alpha`  
Controls the relative weight of observed versus unobserved user-product interactions in the factorization.

In [1]:
# import modules
import os

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.mllib import recommendation
from pyspark.mllib.recommendation import *
import pandas as pd

In [2]:
# set configurations
conf = SparkConf().setMaster("local").setAppName("autoscrobbler")

In [3]:
# set context
sc = SparkContext.getOrCreate(conf=conf)

In [4]:
# pathing and params
user_artist_data_file = 'user_artist_data.txt'
artist_data_file = 'artist_data.txt'
artist_alias_data_file  = 'artist_alias.txt'

numPartitions = 2
topk = 10

In [6]:
# read user_artist_data_file into RDD (417MB file, 24MM records of users’ plays of artists, along with count)
# specifically, each row holds: userID, artistID, count
rawDataRDD = sc.textFile(user_artist_data_file, numPartitions)
rawDataRDD.cache()

user_artist_data.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [7]:
# read artist_data_file
rawArtistRDD = sc.textFile(artist_data_file, numPartitions)
rawArtistRDD.cache()

artist_data.txt MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

In [8]:
# read artist_alias_data_file
rawAliasRDD = sc.textFile(artist_alias_data_file, numPartitions)
rawAliasRDD.cache()

artist_alias.txt MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:0

In [9]:
# 1) (1 PT) Print the first 10 records from rawDataRDD
print(rawDataRDD.take(10))

['1000002 1 55', '1000002 1000006 33', '1000002 1000007 8', '1000002 1000009 144', '1000002 1000010 314', '1000002 1000013 8', '1000002 1000014 42', '1000002 1000017 69', '1000002 1000024 329', '1000002 1000025 1']


In [10]:
print(rawArtistRDD.take(10))

['1134999\t06Crazy Life', '6821360\tPang Nakarin', '10113088\tTerfel, Bartoli- Mozart: Don', '10151459\tThe Flaming Sidebur', '6826647\tBodenstandig 3000', '10186265\tJota Quest e Ivete Sangalo', '6828986\tToto_XX (1977', '10236364\tU.S Bombs -', '1135000\tartist formaly know as Mat', '10299728\tKassierer - Musik für beide Ohren']


In [11]:
def parseArtistIdNamePair(singlePair):
   splitPair = singlePair.rsplit('\t')
   # we should have two items in the list - id and name of the artist.
   if len(splitPair) != 2:
       #print singlePair
       return []
   else:
       try:
           return [(int(splitPair[0]), splitPair[1])]
       except:
           return []


In [12]:
artistByID = dict(rawArtistRDD.flatMap(lambda x: parseArtistIdNamePair(x)).collect())

In [13]:
artist_vals = artistByID.values()

In [14]:
# 2) (1 PT) Print 10 values from artistByID, using topk variable
from collections import Counter
print( Counter(artist_vals).most_common( topk ) )

[('06Crazy Life', 1), ('Pang Nakarin', 1), ('Terfel, Bartoli- Mozart: Don', 1), ('The Flaming Sidebur', 1), ('Bodenstandig 3000', 1), ('Jota Quest e Ivete Sangalo', 1), ('Toto_XX (1977', 1), ('U.S Bombs -', 1), ('artist formaly know as Mat', 1), ('Kassierer - Musik für beide Ohren', 1)]


In [15]:
def parseArtistAlias(alias):
    splitPair = alias.rsplit('\t')
    # we should have two ids in the list.
    if len(splitPair) != 2:
        #print singlePair
        return []
    else:
        try:
            return [(int(splitPair[0]), int(splitPair[1]))]
        except:
            return []

In [16]:
artistAlias = rawAliasRDD.flatMap(lambda x: parseArtistAlias(x)).collectAsMap()

In [17]:
# turn the artistAlias into a broadcast variable.
# This will distribute it to worker nodes efficiently, so we save bandwidth.
artistAliasBroadcast = sc.broadcast(artistAlias)

In [18]:
artistAliasBroadcast.value.get(2097174)

1007797

In [19]:
# Print the number of records from the largest RDD, rawDataRDD
print(rawDataRDD.count())

24296858


In [20]:
# Sample 10% of rawDataRDD using seed 314, to reduce runtime. Call it sample.
weights = [.1, .9]
seed = 314
sample, _ = rawDataRDD.randomSplit(weights, seed)
sample.cache()

PythonRDD[11] at RDD at PythonRDD.scala:53

In [22]:
# take the first 5 records from the sample. each row represents userID, artistID, count.
sample.take(5)

['1000002 1000014 42',
 '1000002 1000088 157',
 '1000002 1000139 56',
 '1000002 1000140 95',
 '1000002 1000210 23']

In [23]:
# Based on sampled data, build the matrix for model training
def mapSingleObservation(x):
    # Returns Rating object represented as (user, product, rating) tuple.
    # [add line of code here to split each record into userID, artistID, count]
    userID, artistID, count = map(lambda lineItem: int(lineItem), x.split())
    # given possible aliasing, get finalArtistID
    finalArtistID = artistAliasBroadcast.value.get(artistID)
    if finalArtistID is None:
        finalArtistID = artistID
    return Rating(userID, finalArtistID, count)

In [24]:
trainData = sample.map(lambda x: mapSingleObservation(x))
trainData.cache()

PythonRDD[14] at RDD at PythonRDD.scala:53

In [25]:
# 3) (1 PT) Print the first 5 records from trainData
trainData.take(5)

[Rating(user=1000002, product=1000014, rating=42.0),
 Rating(user=1000002, product=1000088, rating=157.0),
 Rating(user=1000002, product=1000139, rating=56.0),
 Rating(user=1000002, product=1000140, rating=95.0),
 Rating(user=1000002, product=1000210, rating=23.0)]

In [26]:
# Train the ALS model, using seed 314, rank 10, iterations 5, alpha 0.01
model = ALS.trainImplicit(trainData, rank=10, iterations=5, alpha=0.01)

In [27]:
model

<pyspark.mllib.recommendation.MatrixFactorizationModel at 0x7f23c68d62d0>

In [28]:
# Model Evaluation

# fetch artists for a test user
testUserID = 1000002

# broadcast artistByID for speed
artistByIDBroadcast = sc.broadcast(artistByID)

# from trainData, collect the artists for the test user. Call the object artistsForUser.
# hint: you will need to apply .value.get(x.product) to the broadcast artistByID, where x is the Rating RDD.
# if you don't do this, you may see artistIDs. you want artist names.
artistsForUser = (trainData
                  .filter(lambda observation: observation.user == testUserID)
                  .map(lambda observation: artistByIDBroadcast.value.get(observation.product))
                  .collect())

In [31]:
# 4) (1 PT) Print the artist listens for testUserID = 1000002
print([x for x in artistsForUser if x is not None])

['Café Del Mar', 'Eric Clapton', 'Eurythmics']


In [43]:
# 5) (2 PTS) Make 10 recommendations for testUserID = 1000002
num_recomm = 500
recommendationsForUser = map(lambda observation: artistByID.get(observation.product), model.call("recommendProducts", testUserID, num_recomm))
print([x for x in recommendationsForUser if x is not None])

['Dark Tranquillity', 'Eric Clapton', 'Scorpions', 'Enigma', 'Eurythmics', 'Gary Jules', 'Elvis Costello', 'Echo & the Bunnymen', 'Nena', 'Erasure', '植松伸夫']


In [45]:
# Train a second ALS model, same as first but with rank 20
model2 = ALS.trainImplicit(trainData, rank=20, iterations=5, alpha=0.01)

In [46]:
# 6) (2 PTS) Using the rank 20 model, make 10 recommendations for the same test user
recommendationsForUser_rank20 = map(lambda observation: artistByID.get(observation.product), model2.call("recommendProducts", testUserID, num_recomm))
print([x for x in recommendationsForUser_rank20 if x is not None])

['Eric Clapton', 'Dark Tranquillity', 'Scorpions', 'Gary Jules', 'Enigma', 'Eurythmics', 'Elvis Costello', '植松伸夫', 'Nena', 'Echo & the Bunnymen']


#### ANSWER SECTION (COPY ALL ANSWERS HERE)

##### ANSWER 1 (1 PT)
##### Print the first 10 records from rawDataRDD
['1000002 1 55',  
'1000002 1000006 33',  
'1000002 1000007 8',  
 '1000002 1000009 144',  
 '1000002 1000010 314',  
 '1000002 1000013 8',  
 '1000002 1000014 42',  
 '1000002 1000017 69',  
 '1000002 1000024 329',  
 '1000002 1000025 1']

##### ANSWER 2 (1 PT)
##### Print topk values from artistByID
[('06Crazy Life', 1),  
('Pang Nakarin', 1),  
('Terfel, Bartoli- Mozart: Don', 1),  
('The Flaming Sidebur', 1),  
('Bodenstandig 3000', 1),  
('Jota Quest e Ivete Sangalo', 1),  
('Toto_XX (1977', 1),  
('U.S Bombs -', 1),  
('artist formaly know as Mat', 1),  
('Kassierer - Musik für beide Ohren', 1)]

##### ANSWER 3 (1 PT)  
##### Print the first 5 records from trainData  
['1000002 1000014 42',  
'1000002 1000088 157',  
'1000002 1000139 56',  
'1000002 1000140 95',  
'1000002 1000210 23']

##### ANSWER 4 (1 PT)
##### Print the artist listens for testUserID = 1000002
['Café Del Mar', 'Eric Clapton', 'Eurythmics']

##### ANSWER 5 (2 PTS)
##### Make 10 recommendations for testUserID = 1000002
['Dark Tranquillity',  
'Eric Clapton',  
'Scorpions',  
'Enigma',  
'Eurythmics',  
'Gary Jules',  
'Elvis Costello',  
'Echo & the Bunnymen',  
'Nena',  
'Erasure',]

##### ANSWER 6 (2 PTS)
##### Using the rank 20 model, make 10 recommendations for testUserID = 1000002
['Eric Clapton',  
'Dark Tranquillity',  
'Scorpions',  
'Gary Jules',  
'Enigma',  
'Eurythmics',  
'Elvis Costello',  
'植松伸夫',  
'Nena',  
'Echo & the Bunnymen']

##### ANSWER 7 (2 PTS)
##### How does the rank 10 model seem to perform versus the rank 20 model?
##### The contents of artistsForUser may help answer the question.

**The recommendations are nearly identical, and they make sense.
For example, Erasure is synth-pop, similar to Eurythmics.**