# Recommender System

In this lesson we will create a Recommended system using movie ratin

## Summary
- <a href='#1'>1. Context and Motivation</a>
- <a href='#2'>2. Types of Recomender Systems</a>
    - <a href='#2.1'>2.1. Popularity Based </a>
    - <a href='#2.2'>2.2. Content Based</a>
    - <a href='#2.3'>2.3. Collaborative Filtering based</a>
    - <a href='#2.4'>2.4. Hybrid</a>
    - <a href='#2.5'>2.5. Association Rule Mining based </a>
- <a href='#3'>3. CF based Example</a>
    - <a href='#3.1'>3.1. Dataset</a>
    - <a href='#3.2'>3.2. Exploratory Data Analysis</a>
    - <a href='#3.3'>3.3. Feature Engineering</a>
    - <a href='#3.4'>3.4. Splitting the Dataset</a>
    - <a href='#3.5'>3.5. Build and Train Recommender Model</a>
    - <a href='#3.6'>3.6. Predictions and Evaluation on Test Data</a>
    - <a href='#3.7'>3.7. Recommend Top Movies That Active User Might Like</a>
- <a href='#4'>4.  Exercises</a>  
- <a href='#5'>5.  References</a>

# <a id='1'>1. Context and Motivation</a>

Recommender systems can be used for multiple purposes in the sense of recommending various things to users. 
For instance, some of them might fall in the categories below:   
* Retail Products
* Jobs 
* Connections/Friends
* Movies/Music/Videos/Books/Articles
* Ads

Recommender systems take care of the critical aspect that the product or content that is being recommended should either be something which users might like but would not have discovered on their own.

**Examples:** Amazon products, Facebook’s friend suggestions, LinkedIn’s “People you may know,” Netflix’s movie, YouTube’s videos, Spotify’s music, and Coursera’s courses

# <a id='2'>2. Types of Recomender Systems</a>

There are 5 types of Recommender Systems:

* Popularity Based
* Content Based  
* **Collaborative Filtering based** (created in this class)
* Hybrid 
* Association Rule Mining based

## <a id='2.1'>2.1 Popularity Based</a> 

This Recomender System is the most basic and the simplest one. It recommends items/content based on bought/viewed/liked/downloaded by most of the users.   
**It doesn’t produce relevant results as the recommendations stay the same for every user.**



## <a id='2.2'>2.2 Content Based</a> 

**Item Profile** 

This type of Recommender System recommends similar items to the users that the user has liked in the past.   
So, the whole idea is to calculate a similarity score between any two items and recommended to the user based upon the profile of the user’s interests.


**User  Profile**

The other component in content based Recommender System is the User Profile that is created using item profiles that the user has liked or rated.

**Advantages:** 
* Content based RC works independently of other users’ data and hence can be applied to an individual’s historical data.
* The rationale behind RC can be easily understood as the recommendations are based on the similarity score between the User Profile and Item Profile.
* New and unknown items can also be recommended to users just based on historical interests and preferences of users.


**Limitations:**
* Item profile can be biased and might not reflect exact attribute values and might lead to incorrect recommendations.
* Recommendations entirely depend on the history of the user and can only recommend items that are like the historically watched/liked items and do not take into consideration the new interests or liking of the visitor.

## <a id='2.3'>2.3 Collaborative Filtering Based</a> 

Content Filtering based Recommender Systems doesn’t require the item attributes or description for recommendations; instead it works on user item interactions. These interactions can be measured in various ways such as ratings, item bought,time spent, shared on another platform, etc.

* Which movie to watch?
* Which book to read?
* Which restaurant to go to? 
* Which place to travel to?

**The key task in collaborative filtering is to find the users who are most similar to you.**


**Advantages:** 
* Content information of the item is not required, and recommendations can be made based on valuable user item interactions.
* Personalizing experience based on other users.

**Limitations:**
* Cold Start Problem: If the user has no historical data of item interactions. then RC cannot predict the k-nearest neighbors for the new user and cannot make recommendations.
* Missing values: Since the items are huge in number and very few users interact with all the items, some items are never rated by users and can’t be recommended.
* Cannot recommend new or unrated items: If the item is new and yet to be seen by the user, it can’t be recommended to existing users until other users interact with it. 
* Poor Accuracy: It doesn’t perform that well as many components keep changing such as interests of users, limited shelf life of items, and very few ratings of items.



## <a id='2.4'>2.4 Hybrid</a> 
Hybrid recomendation systems is a combination of multiple recommender systems together for instance use content base filter and than apply colaborative filtering to the otput.

## <a id='2.4'>2.4 Association Rule Mining Based</a> 
Basically is a recomender systems that uses association rules like apriori algorithm to recommend items most frequent brogth

# <a id='3'>3. CF based Example</a>
This section focuses on building an **Collaborative Filtering  Recommendation System from scratch using the ALS (Alternating Least Square) method**.

## <a id='3.1'>3.1 Dataset</a> 

The dataset that we are going to use for this chapter is a subset from a famous open sourced movie lens dataset and contains a total of 0.1 million (100000)records with three columns (User_Id,title,rating).

Put the data in **hdfs**: `hdfs dfs -put movie_ratings_df`

In [None]:
df=spark.read.csv('movie_ratings_df.csv',inferSchema=True,header=True) # Read the dataset

## <a id='3.2'>3.2 Exploratory Data Analysis</a> 

In [None]:
print((df.count(), len(df.columns))) # Number of rows and columns

In [None]:
df.printSchema() # schema

In [None]:
from pyspark.sql.functions import rand

In [None]:
df.orderBy(rand()).show(10,False) # random sample from dataset

In [None]:
df.groupBy('userId').count().orderBy('count',ascending=False).show(10,False)  

In [None]:
# The user with the highest number of records has rated 737 movies.

In [None]:
df.groupBy('userId').count().orderBy('count',ascending=True).show(10,False)

In [None]:
# Each user has rated at least 20 movies

In [None]:
df.groupBy('title').count().orderBy('count',ascending=False).show(10,False)

In [None]:
# The movie with highest number of ratings is Star Wars (1977), has been rated 583 times.

## <a id='3.3'>3.3 Feature Engineering</a> 

In [None]:
from pyspark.ml.feature import StringIndexer, IndexToString

In [None]:
# Convert movie title column from categorical to numerical values using StringIndexer
stringIndexer = StringIndexer(inputCol="title", outputCol="title_new") 

In [None]:
model = stringIndexer.fit(df)

In [None]:
indexed = model.transform(df)

In [None]:
indexed.show(10)

In [None]:
indexed.groupBy('title_new').count().orderBy('count',ascending=False).show(10,False)

## <a id='3.4'>3.4 Splitting the Dataset</a> 

In [None]:
train,test=indexed.randomSplit([0.75,0.25])

In [None]:
train.count()

In [None]:
test.count()

## <a id='3.5'>3.5 Build and Train Recommender Model</a> 

In [None]:
from pyspark.ml.recommendation import ALS

ALS finds a K-dimensional feature vector for each user and item such that the dot product of
each user’s feature vector with each item’s feature vector approximates the user’s rating for that
item.

See https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS

In [None]:
rec=ALS(maxIter=10,regParam=0.01,userCol='userId',itemCol='title_new',ratingCol='rating',nonnegative=True, coldStartStrategy="drop")

In [None]:
rec_model=rec.fit(train)

## <a id='3.6'>3.6 Predictions and Evaluation on Test Data</a> 

In [None]:
predicted_ratings=rec_model.transform(test)

In [None]:
predicted_ratings.printSchema()

In [None]:
predicted_ratings.orderBy(rand()).show(10)

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

In [None]:
# metrica Raiz quadrada do desvio padrão
evaluator=RegressionEvaluator(metricName='rmse',predictionCol='prediction',labelCol='rating') 

In [None]:
rmse=evaluator.evaluate(predicted_ratings)

In [None]:
print(rmse)

## <a id='3.7'>3.7 Recommend Top Movies That Active User Might Like</a> 

In [None]:
unique_movies=indexed.select('title_new').distinct()

In [None]:
unique_movies.count()

In [None]:
a = unique_movies.alias('a')

In [None]:
user_id=85

In [None]:
# Filter the movies that this active user has already rated or seen.

watched_movies=indexed.filter(indexed['userId'] ==  user_id).select('title_new').distinct()  

In [None]:
watched_movies.count()

In [None]:
# So, there are total of 287 unique movies out of 1664 movies that this active user has already rated.

In [None]:
b=watched_movies.alias('b')

In [None]:
total_movies = a.join(b, a.title_new == b.title_new,how='left')

In [None]:
total_movies.show(10,False)

In [None]:
from pyspark.sql.functions import col, lit
remaining_movies=total_movies.where(col("b.title_new").isNull()).select(a.title_new).distinct()

In [None]:
remaining_movies.count()

In [None]:
remaining_movies=remaining_movies.withColumn("userId",lit(int(user_id)))

In [None]:
remaining_movies.show(10,False)

In [None]:
recommendations=rec_model.transform(remaining_movies).orderBy('prediction',ascending=False)

In [None]:
recommendations.show(5,False)


In [None]:
movie_title = IndexToString(inputCol="title_new", outputCol="title",labels=model.labels)

In [None]:
final_recommendations=movie_title.transform(recommendations)

In [None]:
final_recommendations.show(10,False)

# <a id='4'>4. Exercises</a>

## Choose the user that rated starwars with classification 5 and check what are the highest prediction for that user.

# <a id='5'>5. References</a>

http://file.allitebooks.com/20181215/Machine%20Learning%20with%20PySpark.pdf