## Title

# Data Merge

### Description:

In this notebook we will have a first look to the 4 initial dataset and concat them in order to work with the full dataset.

### Authors:

#### Hugo Cesar Octavio del Sueldo¶
#### Jose Lopez Galdon

### Date:
04/12/2020
### Version:¶
1.0

## PySpark Collaborative Filtering with ALS

Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For example, Amazon can recommend new shopping items to buy, Netflix can recommend new movies to watch, and Google can recommend news that a user might be interested in. The two widely used approaches for building a recommender system are the content-based filtering (CBF) and collaborative filtering (CF).

To understand the concept of recommender systems, let us look at an example. The below table shows the user-item utility matrix Y where the value Rui denotes how item i has been rated by user u on a scale of 1–5. The missing entries (shown by ? in Table) are the items that have not been rated by the respective user.

![](https://miro.medium.com/max/764/1*swlCZkfOdnxKJnQ1xHjIkw.png)

The objective of the recommender system is to predict the ratings for these items. Then the highest rated items can be recommended to the respective users. In real world problems, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available.



### Explicit v.s. Implicit ratings

There are two ways to gather user preference data to recommend items, the first method is to ask for **explicit** ratings from a user, typically on a concrete rating scale (such as rating a movie from one to five stars) making it easier to make extrapolations from data to predict future ratings. However, the drawback with explicit data is that it puts the responsibility of data collection on the user, who may not want to take time to enter ratings. On the other hand, **implicit** data is easy to collect in large quantities without any extra effort on the part of the user. Unfortunately, it is much more difficult to work with.

### Data Sparsity and Cold Start

In real world problems, the utility matrix is expected to be very sparse, as each user only encounters a small fraction of items among the vast pool of options available. Cold-Start problem can arise during addition of a new user or a new item where both do not have history in terms of ratings. 

### Approaches to Recommendation

The two widely used approaches for building a recommender system are the content-based filtering (CBF) and collaborative filtering (CF), of which CBF is the most widely used.

![](https://miro.medium.com/max/1400/1*EIBIiW2YiakP1ftxwPF8LA.png)

The below figure illustrates the concepts of CF and CBF. The primary difference between these two approaches is that CF looks for similar users to recommend items while CBF looks for similar contents to recommend items.

### Content-based Filtering (CBF)

The main idea behind CBF is to recommend items similar to the items previously liked by the user. For example, if the user have rated some items in the past, then these items are used for *user-modeling* where the user’s interests are quantified. Traditionally, the item is represented by a feature vector xi, which can be boolean or real valued, and the user is represented by a weight vector of same dimension. Given a new item x, represented in the same feature vector space, the likeliness, e.g., rating of the item is predicted using the user model.

This can be achieved in two different ways:

• Predicting ratings using parametric models like regression or logistic regression for multiple ratings and binary ratings respectively based on the previous ratings.

• Similarity based techniques using distance measures to find similar items to the items liked by the user based on item features.

CB can be applied even when a strong user-base is not built, as it depends on the item’s meta data (features) therefore does not suffer from cold-start problem. However, this also makes it computationally intensive, as similarities between each user and all the items must be computed. Since the recommendations are based on the item similarity to the item that the user already knows about, it leaves no room for serendipity and causes over specialisation. CB also ignores popularity of an item and other users feedbacks.

![](https://miro.medium.com/max/1400/1*7_JHQ6-1nyHoB2ux1h0ZKw.png)

### Collaborative filtering (CF)

Collaborative filtering aggregates the past behaviour of all users. It recommends items to a user based on the items liked by another set of users whose likes (and dislikes) are similar to the user under consideration. This approach is also called the *user-user* based CF.

*item-item* based CF became popular later, where to recommend an item to a user, the similarity between items liked by the user and other items are calculated. The user-user CF and item-item CF can be achieved by two different ways, **memory-based** (neighbourhood approach) and **model-based** (latent factor model approach).

#### 1. The memory-based approach

Neighbourhood approaches are most effective at detecting very localized relationships (neighbours), ignoring other users. But the downsides are that, first, the data gets sparse which hinders scalability, and second, they perform poorly in terms of reducing the RMSE (root-mean-squared-error) compared to other complex methods. User-based Filtering and Item-based Filtering are the two ways to approach memory-based collaborative filtering.

**User-based Filtering**: To recommend items to user u1 in the user-user based neighborhood approach first a set of users whose likes and dislikes similar to the useru1 is found using a similarity metrics which captures the intuition that sim(u1, u2) >sim(u1, u3) where user u1 and u2 are similar and user u1 and u3 are dissimilar. similar user is called the neighbourhood of user u1.

![](https://miro.medium.com/max/1400/1*iVT1smbzov9Oohpw8SvmhQ.png)

**Item-based Filtering**: To recommend items to user u1 in the item-item based neighborhood approach the similarity between items liked by the user andother items are calculated.

#### 2. The model-based approach

Latent factor model based collaborative filtering learns the (latent) user and item profiles (both of dimension K) through matrix factorization by minimizing the RMSE (Root Mean Square Error) between the available ratings yand their predicted values yˆ. Here each item i is associated with a latent (feature) vector xi, each user u is associated with a latent (profile) vector theta(u), and the rating yˆ(ui) is expressed as

![](https://miro.medium.com/max/910/1*Qz04bNnnO7xg-qgckvVs6A.png)

![](https://miro.medium.com/max/1400/1*YlMGcZI9kJJL7FyKz9p7Vg.png)

Latent methods deliver prediction accuracy superior to other published CF techniques. It also addresses the sparsity issue faced with other neighbourhood models in CF. The memory efficiency and ease of implementation via gradient based matrix factorization model (SVD) have made this the method of choice within the Netflix Prize competition. However, the latent factor models are only effective at estimating the association between all items at once but fails to identify strong association among a small set of closely related items.

### Recommendation using Alternating Least Squares (ALS)

Alternating Least Squares (ALS) matrix factorisation attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called ‘factor’ matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.