# Advanced Machine Learning (MScA, 32017)

# Project Recommending Music with Audioscrobbler Data

### Yuri Balasanov, Mihail Tselishchev, &copy; iLykei 2017

## Fitting ALS model to Audioscrobbler (LastFM) data

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, \
StringType, Row
from pyspark.ml.recommendation import ALS
import pyspark.sql.functions as func
import random
import time
from datetime import datetime

In [2]:
spark = SparkSession.builder.getOrCreate()
spark

## Data

Create paths to the data files. Add path to file with predictions for the test that will be calculated at the end of this notebook.  

In [3]:
# paths to files
artistdata_path = './data/artist_data.csv'
userartist_path = './data/clean_15_5.csv'
test_path = './data/LastFM_Test_Sample.csv'

In [4]:
# defining schemas
artistdata_struct = StructType([StructField('artistId', IntegerType()), \
                                StructField('name', StringType())])
userartist_struct = StructType([StructField('userId', IntegerType()), \
                                StructField('artistId', IntegerType()), \
                                StructField('count', IntegerType())])

In [5]:
# read artist names data
artistdata_df = spark.read.csv(artistdata_path, sep = '\t', schema = artistdata_struct)
artistdata_df.cache()
artistdata_df.show(10)

+--------+------------------+
|artistId|              name|
+--------+------------------+
| 2000001|        Portishead|
| 2000002|               Air|
| 2000003|     Severed Heads|
| 2000004|Marianne Faithfull|
| 2000005|   Peace Orchestra|
| 2000006|      Gallon Drunk|
| 2000007|             Breed|
| 2000008|         Omni Trio|
| 2000009|    The Last Poets|
| 2000010|    Rhythm & Sound|
+--------+------------------+
only showing top 10 rows



In [6]:
# read user-artist data
userartist_df = spark.read.csv(userartist_path, sep = ',', header=True, schema = userartist_struct)
userartist_df.cache()
userartist_df.show(10)

+-------+--------+-----+
| userId|artistId|count|
+-------+--------+-----+
|1000152| 2000001|   16|
|1000152| 2000002|    6|
|1000152| 2000011|    4|
|1000152| 2000015|    3|
|1000152| 2000023|   26|
|1000152| 2000024|   24|
|1000152| 2000026|   26|
|1000152| 2000032|    3|
|1000152| 2000039|   96|
|1000152| 2000044|    3|
+-------+--------+-----+
only showing top 10 rows



In [7]:
# split data:
(training, test) = userartist_df.randomSplit([0.9, 0.1], seed=0)
training.cache()
# remove 'count' column from test:
test = test.drop('count')
test.cache()
test.show(10)

+-------+--------+
| userId|artistId|
+-------+--------+
|1000152| 2000024|
|1000152| 2000137|
|1000152| 2000170|
|1000152| 2000173|
|1000152| 2000254|
|1000152| 2000275|
|1000152| 2000277|
|1000152| 2000414|
|1000152| 2000606|
|1000152| 2001006|
+-------+--------+
only showing top 10 rows



## Fitting model

Fit the ALS model. <br>
Hyperparameters to specify: <br>

-  `rank` between 5 and 40; default 10; the number of latent factors in the model
-  `regParam` between 0.01 and 8; default 0.1; regularization parameter $\lambda$
-  `alpha` between 1 and 40; default 1; parameter $\alpha$ appears in the expression for confidence $$c_{u,i}=1+\alpha r_{u,i}$$ or $$c_{u,i}=1+\alpha \ln(1+\frac{r_{u,i}}{\epsilon}).$$ If $\alpha=0$  confidence is always 1 regardless of rating$r_{u,i}$. As $\alpha=0$ grows we pay more and more attention to how many times user $u$ consumed item $i$. Thus $\alpha$ controls the relative weight of observed versus unobserved ratings. 

Search for hyperparameters on the grid of 4-5 values in each range.

In [8]:
# building a model
# Note that there are some hyperparameters, that should be fitted during cross-validation 
# (here we use default values for all hyperparameters but rank) 
t1 = time.perf_counter()
model = ALS(implicitPrefs=True, userCol="userId", itemCol="artistId", ratingCol="count", 
            rank=10, alpha=2,regParam=2).fit(training)
t2 = time.perf_counter()
print('Fitting time:', t2-t1)

Fitting time: 8.682182056112335


## Predict test data

From the test shiny download your test sample.

Use it in the following cell to predict ratings, save the results as csv file and upload back to the test shiny for scoring.

Of course, predictions obtained without tuning hyperparameters and using small sample are not expected to be good.

In [22]:
# reading test file
test_struct = StructType([StructField('userId', IntegerType()), \
                          StructField('artistId', IntegerType())])
test_df = spark.read.csv(test_path, sep = '\t', schema = test_struct)
test_df.show(10)

+-------+--------+
| userId|artistId|
+-------+--------+
|1060367| 2342749|
|1094562| 2011589|
|1076129| 2009989|
|1111161| 2000995|
|1040252| 2006472|
|1111874| 2002337|
|1017609| 2013613|
|1097539| 2000918|
|1017830| 2145531|
|1002511| 2001848|
+-------+--------+
only showing top 10 rows



In [23]:
# Note that many predictions are NaN since some users and artists might be out of 
# small train-data
# Full train file has to be used to avoid this.
# However, even using full train file, some users might be new. 
# What artists should we propose to them?
predictions = model.transform(test_df)
predictions.show(10)
assert predictions.count() == test_df.count()

+-------+--------+----------+
| userId|artistId|prediction|
+-------+--------+----------+
|1069012| 2000127|       NaN|
|1048066| 2000127|       NaN|
|1066775| 2000127|       NaN|
|1031070| 2000127|       NaN|
|1003677| 2000127|       NaN|
|1083357| 2000127|       NaN|
|1117592| 2000127|       NaN|
|1063308| 2000127|       NaN|
|1017626| 2000127|       NaN|
|1019146| 2000127|       NaN|
+-------+--------+----------+
only showing top 10 rows



In [24]:
# Save test predictions to CSV-file
#timestamp = datetime.now().isoformat(sep='T', timespec='seconds')
#predictions.coalesce(1).write.csv('./data/test_predictions_{}.csv'.format(timestamp), 
#                                  sep = '\t')
predictions.coalesce(1).write.csv('test_predictions.csv', sep='\t')

Py4JJavaError: An error occurred while calling o618.csv.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
	at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:598)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 630.0 failed 1 times, most recent failure: Lost task 0.0 in stage 630.0 (TID 4632, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Mkdirs failed to create file:/C:/Users/JohntheGreat/Documents/MSCA/AdvancedMachineLearning/Week3_Recommender/data/test_predictions_2017-10-21T14:51:23.csv/_temporary/0/_temporary/attempt_20171021151752_0630_m_000000_0 (exists=false, cwd=file:/C:/Users/JohntheGreat/Documents/MSCA/AdvancedMachineLearning/Week3_Recommender)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:455)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
	at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
	at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
	at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVFileFormat.scala:135)
	at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:77)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:305)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
	... 45 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.io.IOException: Mkdirs failed to create file:/C:/Users/JohntheGreat/Documents/MSCA/AdvancedMachineLearning/Week3_Recommender/data/test_predictions_2017-10-21T14:51:23.csv/_temporary/0/_temporary/attempt_20171021151752_0630_m_000000_0 (exists=false, cwd=file:/C:/Users/JohntheGreat/Documents/MSCA/AdvancedMachineLearning/Week3_Recommender)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:455)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
	at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
	at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
	at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVFileFormat.scala:135)
	at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:77)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:305)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more


Check saved results in `./data` directory. <br>
Solution is saved as a folder with multiple files. <br>
There should be only one file .csv. Upload it in the test shiny.