# Implicit ALS

_Adapted from [@sophwatts](https://github.com/sophwats/Implicit-ALS)_

We usually consider using ALS on a set of user/product ratings. But what if the data isn't so self explanatory?

### A day trip to the library
Consider, for example, the data collected by a local library. The library records which users took out each books and how long they kept the books before returning them. 

As such, we have no explicit indication that a user liked or disliked the books they took out - Just because you borrowed a book does not mean that you enjoyed it, or even read it.
Furthermore, the missing data is of interest - the fact that a user has not taken out a specific book could indicate that they dislike that genre, or that they haven't been to that section of the library.

Furthermore the same user action could have many different causes. Suppose you withdraw a book three times. That might indicate that you loved the book, but it may also indicate that the book doesn't appeal to you as strongly as some other books you withdrew so you never got round to reading it the first two times.

To make the situation even worse, implicit data is often dirty. For example, a user may withdraw a library book for their child using their account, or they may accidentally pick up a book that was sitting on the counter. 

### The solution
Based on the standard ALS implementation, [Hu et al. (2008)](http://yifanhu.net/PUB/cf.pdf) presented a methodolgy for carrying out ALS when dealing with implicit data. 

The general idea is that we have some recorded observations $r_{u,i}$ denoting the level of interaction user $u$ had with product $i$. For example, if a user $1$ borrowed book $4$ once we may set $r_{1,4}=1$. Alternatively we may wish to allow $r_{u,i}$ to hold information about how many days the book was borrowed for. (There is a lot of freedom in this set up, so we need to make some data specific decisions regarding how we will select $r_{u,i}$).

Given the set of observations $r_{u,i}$, a binary indicator $p_{u,i}$ is introduced where:

$ p_{u,i} = \begin{cases} 1 & \mbox{if } r_{i,j}>0 \\
0 & \mbox{otherwise.} \end{cases} $


A confidence parameter $\alpha$ lets the user determine how much importance they wish to place on the recorded $r_{u,i}$. This leads to the introduction of $c_{u,i}$ which we take to be the confidence we have in the strength of user $u$'s reaction to product $i$: 
$c_{u,i} = 1 + \alpha r_{u,i}$.

Let $N_u$ denote the number of users, and $N_p$ denote the number of products. Let $k\in \mathbb{R}^+$ be a user defined number of factors. 
Now, in implicit ALS the goal is to find matrices $X\in \mathbb{R}^{N_u \times k}$ and $Y\in \mathbb{R}^{N_p \times k}$ such that the following cost function is minimised:

$\sum_{u,i} c_{u,i}(p_{u,i}-X_u^T Y_i)^2 + \lambda (\sum_u \| X_u\|^2 + \sum_{i} \| y_u\|^2), $


where
$X_u$ is the $u$th row of X, 
$Y_i$ is the $i$th row of Y,
\lambda is a user defined parameter which prevents overfitting. 

With this minimisation at hand, we are able to recover estimates of $c_{u,i}$, and thus of $r_{u,i}$ for interactions which have not yet occured. 

### Let's get going
We are going to run implicit ALS using the implementation given in the pyspark.mllib.recommendation module. 

The data we will be using can be found at http://www2.informatik.uni-freiburg.de/~cziegler/BX/

In [2]:
#Set up a spark context

import pyspark

spark = (pyspark.sql.SparkSession.builder
         .appName('implicitALS')
         .getOrCreate())
sc = spark.sparkContext

# The Data


In the cell below, we download and unzip the data. The two files we are interested in are BX-Books.csv and BX-Book-Ratings.csv, which follow these schema: 

### BX-Books.csv

| Field Name | Type | Description |
| ---------- | -----| ----------- |
|ISBN |  String | length 10, alphanumeric |
| Book-Title | String | Title of book |
|Book-Author | String| Name of author |
| Year-Of-Publication | String | yyyy|
|Publisher| String |Name of publisher |
|Image-URL-S | String| URL for small image on amazon.com |
|Image-URL-M | String| URL for medium image on amazon.com |
|Image-URL-L | String| URL for large image on amazon.com|


### BX-Book-Ratings.csv
| Field Name | Type | Description |
| ---------- | ---- | ----------- |
|User-ID |  Integer | Range from 2 to 278854 |
| ISBN | String| length 10, alphanumeric |
|Book-Rating| Integer | 1-10 denotes dislike-like. 0 denotes implicit interaction|

In [1]:
#Downloading and unzipping the data
!curl -O http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
!unzip BX-CSV-Dump.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.8M  100 24.8M    0     0  2476k      0  0:00:10  0:00:10 --:--:-- 2895k
Archive:  BX-CSV-Dump.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


The cell above loads three .csv files into the working directory. We are interested in the files "BX-Books-Ratings.csv" and "BX-Books.csv". The first three columns of "BX-Book-Ratings" are the user id, an isbn which identifies the book, and a rating. A '0' in the rating column is used to denote that an implicit interaction occured between the user an the book. It is this data that we are interested in, and we extract such rows using the following grep command:

In [3]:
!head BX-Book-Ratings.csv

"User-ID";"ISBN";"Book-Rating"
"276725";"034545104X";"0"
"276726";"0155061224";"5"
"276727";"0446520802";"0"
"276729";"052165615X";"3"
"276729";"0521795028";"6"
"276733";"2080674722";"0"
"276736";"3257224281";"8"
"276737";"0600570967";"6"
"276744";"038550120X";"7"


In [4]:
!echo '"user_id";"isbn";"observation"' > implicit.csv

In [5]:
!grep '"0"' BX-Book-Ratings.csv >> implicit.csv

In [6]:
!head implicit.csv

"user_id";"isbn";"observation"
"276725";"034545104X";"0"
"276727";"0446520802";"0"
"276733";"2080674722";"0"
"276746";"0425115801";"0"
"276746";"0449006522";"0"
"276746";"0553561618";"0"
"276746";"055356451X";"0"
"276746";"0786013990";"0"
"276746";"0786014512";"0"


In [7]:
#Load in the data
ratings_df = spark.read.csv('implicit.csv', sep=';', header=True, inferSchema=True)

Let's have a look at the first 10 entries in the ratings file: 

In [8]:
ratings_df.show(10)

+-------+----------+-----------+
|user_id|      isbn|observation|
+-------+----------+-----------+
| 276725|034545104X|          0|
| 276727|0446520802|          0|
| 276733|2080674722|          0|
| 276746|0425115801|          0|
| 276746|0449006522|          0|
| 276746|0553561618|          0|
| 276746|055356451X|          0|
| 276746|0786013990|          0|
| 276746|0786014512|          0|
| 276747|0451192001|          0|
+-------+----------+-----------+
only showing top 10 rows



The implicit ALS function we are going to use requires that product ids are integers. At the moment we have unique ISBNs, which contain a mixture of numbers and letters, so we must convert to integers. This can be done using the zipWithIndex() function which takes an RDD and joins unique ids to each entry. 

In [9]:
ratings_df.createGlobalTempView("ratings")

In [33]:
ratings_df = spark.sql("""
SELECT
  user_id
, dense_rank() OVER (ORDER BY isbn) AS isbn_id
, isbn
, COUNT(*) AS rating
FROM global_temp.ratings
GROUP BY 1, 3
ORDER BY 1""")
ratings_df.show(10)

+-------+-------+----------+------+
|user_id|isbn_id|      isbn|rating|
+-------+-------+----------+------+
|      2|  23897|0195153448|     1|
|      7|  42139| 034542252|     1|
|      8| 193834|1558746218|     1|
|      8|  75072|0425176428|     1|
|      8| 129080|0671870432|     1|
|      8| 130853|0679425608|     1|
|      8| 151562|0771074670|     1|
|      8|  66222|0393045218|     1|
|      8| 159520|080652121X|     1|
|      8| 151295|0771025661|     1|
+-------+-------+----------+------+
only showing top 10 rows



In [42]:
ratings_df.groupBy('user_id').sum('rating').show(10)

+-------+-----------+
|user_id|sum(rating)|
+-------+-----------+
|      2|          1|
|      7|          1|
|      8|         11|
|      9|          2|
|     10|          1|
|     14|          1|
|     16|          1|
|     17|          3|
|     20|          1|
|     22|          3|
+-------+-----------+
only showing top 10 rows



We now import the ALS function from the mllib module, and build the model. 

In [28]:
from pyspark.ml.recommendation import ALS
als = ALS(rank=5, maxIter=5, alpha=0.5, implicitPrefs=True,
          userCol="user_id", itemCol="isbn_id", ratingCol="rating",
          nonnegative=True)

In [29]:
training, test = ratings_df.randomSplit([0.8, 0.2])

model = als.fit(training)

In [30]:
#Using the predict all function to give predictions for any unseens. 
predictions = model.transform(test)

We can now look at predictions for a range of user, product pairs:

In [31]:
predictions.take(10)

[Row(user_id=13221, isbn_id=148, isbn='0002005395', rating=1, prediction=nan),
 Row(user_id=113817, isbn_id=833, isbn='0006174817', rating=1, prediction=0.0010701927822083235),
 Row(user_id=264317, isbn_id=1342, isbn='0006497683', rating=1, prediction=nan),
 Row(user_id=117539, isbn_id=2366, isbn='0020264801', rating=1, prediction=0.0008559214184060693),
 Row(user_id=108773, isbn_id=3175, isbn='0028642333', rating=1, prediction=2.8307115371717373e-06),
 Row(user_id=129358, isbn_id=3749, isbn='0060001461', rating=1, prediction=0.016894200816750526),
 Row(user_id=51350, isbn_id=3794, isbn='0060005564', rating=1, prediction=0.0035295472480356693),
 Row(user_id=78738, isbn_id=4101, isbn='0060116188', rating=1, prediction=0.0),
 Row(user_id=127429, isbn_id=5300, isbn='0060247827', rating=1, prediction=nan),
 Row(user_id=83637, isbn_id=6357, isbn='0060652071', rating=1, prediction=0.008162778802216053)]

We can use `.filter()` and `.orderBy()` to view the 20 highest rated items for that user. 

In [24]:
import numpy as np
(predictions.filter(predictions.prediction != np.nan)
            .orderBy("prediction", ascending=False).take(10))

[Row(user_id=76352, isbn_id=98956, isbn='051513287X', rating=1, prediction=0.8468062877655029),
 Row(user_id=198711, isbn_id=59705, isbn='0380710218', rating=1, prediction=0.7767812013626099),
 Row(user_id=76352, isbn_id=49281, isbn='0373218397', rating=1, prediction=0.7479233741760254),
 Row(user_id=102967, isbn_id=79183, isbn='0440211727', rating=1, prediction=0.6657320857048035),
 Row(user_id=35859, isbn_id=110729, isbn='055357695X', rating=1, prediction=0.6600107550621033),
 Row(user_id=76352, isbn_id=49251, isbn='0373218036', rating=1, prediction=0.6562633514404297),
 Row(user_id=55548, isbn_id=79183, isbn='0440211727', rating=1, prediction=0.6538708209991455),
 Row(user_id=76352, isbn_id=79602, isbn='044022165X', rating=1, prediction=0.6529889106750488),
 Row(user_id=76352, isbn_id=98819, isbn='0515128554', rating=1, prediction=0.6501860022544861),
 Row(user_id=198711, isbn_id=80996, isbn='0440484332', rating=1, prediction=0.6485165953636169)]

The .recommendForAllItems function allows us to view predicted ratings for specific user, item pairs. 

In [25]:
model.recommendForAllUsers(8).take(10)

[Row(isbn_id=148, recommendations=[Row(user_id=87555, rating=0.00033917705877684057), Row(user_id=36606, rating=0.0003086001379415393), Row(user_id=238120, rating=0.00028850819217041135), Row(user_id=189334, rating=0.00026371973217464983), Row(user_id=56856, rating=0.0002628100337460637), Row(user_id=23768, rating=0.000255756574915722), Row(user_id=156150, rating=0.00024303447571583092), Row(user_id=127233, rating=0.00022251176415011287)]),
 Row(isbn_id=471, recommendations=[Row(user_id=87555, rating=0.0176248736679554), Row(user_id=36606, rating=0.016952522099018097), Row(user_id=238120, rating=0.01626615785062313), Row(user_id=23768, rating=0.015157423913478851), Row(user_id=56856, rating=0.014275912195444107), Row(user_id=60244, rating=0.014070408418774605), Row(user_id=189334, rating=0.013647633604705334), Row(user_id=156150, rating=0.012675374746322632)]),
 Row(isbn_id=496, recommendations=[Row(user_id=141710, rating=9.460222882839986e-11), Row(user_id=221445, rating=8.86600723393

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")

### Conclusion 
In this notebook we saw how to build a basic implicit ALS model in Spark. However, the data used was fairly plain, with "0"s being used for all implicit interactions. Furtherwork should consider a dataset more suited to implicit ALS. 