<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

<img style="display: block;max-height:100px;float:left" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Netflix_2015_logo.svg/2560px-Netflix_2015_logo.svg.png" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Check Lab Files](#2.1)
* [3. Data Exploration](#3)  
* [4. Collaborative Filtering](#4)
* [5. Recommendations](#5)
* [6. TearDown](#6)
  * [6.1 Stop Hadoop](#6.1)

<a id='0'></a>
## Description
<p>
<p>One of the most common uses of big data is to predict what users want. 
This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. 
</p>
In thi lab we will use Apache Spark to recommend movies to a user.     
<div>The goal for this lab are:</div>
<ul>    
    <li>Practice the Spark ML API</li>
    <li>Exploring the dataset</li>
    <li>Build a Collaborative Filtering model</li>
    <li>Make customized movie predictions for you 😉</li>
</ul>    
</p>

[Youtube Video](https://www.youtube.com/watch?v=FgGjc5oabrA)


<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [None]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession

By setting this environment variable we can include extra libraries in our Spark cluster.<br/>

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' pyspark-shell'

The first thing always is to create the SparkSession

In [None]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
    .appName("Movielens - Movie Recommendere - MLlib")
    .config("spark.sql.warehouse.dir","hdfs://localhost:9000/warehouse")
    .enableHiveSupport()
    .getOrCreate())

<a id='2'></a>
## 2. Lab

https://www.youtube.com/watch?v=FgGjc5oabrA

<a id='2.1'></a>
### 2.1 Check Lab Files

In order to complete this lab you need to previosly upload the datasets into HDFS.<br/>

Check you have the data ready in HDFS

http://localhost:50070/explorer.html#/datalake/std/movielens/ratings/

http://localhost:50070/explorer.html#/datalake/std/movielens/movies/

<a id='3'></a>
## 3. Data Exploration

We're going to be accessing this data a lot. 

Rather than reading it from source over and over again, we'll cache both the movies DataFrame and the ratings DataFrame into the executor's memory.

In [None]:
movies = spark.read.parquet("hdfs://localhost:9000/datalake/std/movielens/movies/").cache()
print(f"There are {movies.count()} movies in the datasets")

In [None]:
ratings = spark.read.parquet("hdfs://localhost:9000/datalake/std/movielens/ratings/").cache()
print(f"There are {ratings.count()} rating in the datasets")

Let's take a quick look at some of the data in the two DataFrames.

In [None]:
movies.printSchema()

In [None]:
movies.limit(5).toPandas()

In [None]:
ratings.printSchema()

In [None]:
ratings.limit(5).toPandas()

<a id='4'></a>
## 4. Collaborative Filtering

Before we jump into using machine learning, we need to break up the `ratingsDF` dataset into two DataFrames:
* A training set, which we will use to train models
* A test set, which we will use for our experiments

To randomly split the dataset into the multiple groups, we can use the [randomSplit()](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.randomSplit.html?highlight=randomsplit#pyspark.sql.DataFrame.randomSplit) transformation. `randomSplit()` takes a set of splits and a seed and returns multiple DataFrames. Use the seed given below.

In [None]:
# We'll hold out 80% for training and leave 20% for testing 
seed = 42
(trainingDF, testDF) = ratings.randomSplit([0.8, 0.2], seed=seed)

print(f"Training: {trainingDF.count()}, test: {testDF.count()}")
trainingDF.show(3)
testDF.show(3)

### 4.1 Baseline Model

Let's calculate the average movie rating in our dataset to use as our baseline model.

Because we are trying to predict a rating (a number) this is a **regression** problem and we need to use a regression metric in order to evalute the performacen of the model.

We are going to use the **RMSE** (**R**oot **M**ean **S**quared **E**rror). The lower the error we get, the better the model it is.

Let's calculate it for the baseline model:

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

rmseEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="rating", metricName="rmse")

In [None]:
from pyspark.sql.functions import lit, avg

averageRating = trainingDF.select(avg("rating")).first()[0]

baselineDF = trainingDF.withColumn("prediction", lit(averageRating))

baselineRmse = rmseEvaluator.evaluate(baselineDF)

print(f"Baseline RMSE: {baselineRmse:.3}")

### 4.2 Alternating Least Squares

Now we will use the Apache Spark ML Pipeline implementation of Alternating Least Squares, [ALS (Python)](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html?highlight=als#pyspark.ml.recommendation.ALS). 

To determine the best values for the hyperparameters, we will use ALS to train several models, and then we will select the best model and use the parameters from that model in the rest of this lab exercise.

The process we will use for determining the best model is as follows:
1. Pick a set of model parameters. The most important parameter to model is the *rank*, which is the number of columns in the Users matrix or the number of rows in the Movies matrix. In general, a lower rank will mean higher error on the training dataset, but a high rank may lead to [overfitting](https://en.wikipedia.org/wiki/Overfitting).We will train models with ranks of 4 and 12 using the `trainingDF` dataset.


2. Set the appropriate parameters on the `ALS` object:
    * The "User" column will be set to the values in our `userId` DataFrame column.
    * The "Item" column will be set to the values in our `movieId` DataFrame column.
    * The "Rating" column will be set to the values in our `rating` DataFrame column.
    * `nonnegative` = True (whether to use nonnegative constraint for least squares)
    * `regParam` = 0.1.
    
   **Note**: Read the documentation for the ALS class **carefully**. It will help you accomplish this step.
   

3. Create multiple models using the `ParamGridBuilder` and the `CrossValidator`, one for each of our rank values.


4. We'll keep the model with the lowest error rate. Such a model will be selected automatically by the CrossValidator.

In [None]:
from pyspark.ml.recommendation import ALS

als = ALS(userCol="userId",
          itemCol="movieId",
          ratingCol="rating",
          maxIter=5,
          seed=seed,
          coldStartStrategy="drop",
          regParam=0.1,
          nonnegative=True)

<a id='4.3'></a>
### 4.3. Model Selection

Now that we have initialized the algorithm, we need to fit it to our training data, and evaluate how well it does on the validation dataset. 

Let's create a `CrossValidator` and `ParamGridBuilder` that will decide whether *rank* value *4* or *12* gives a lower *RMSE*.  

NOTE: This cell may take a few minutes to run.

In [None]:
from pyspark.ml.tuning import *

grid = (ParamGridBuilder()
        .addGrid(als.rank, [4, 12]) 
        .build())

cv = CrossValidator(numFolds=3, estimator=als, estimatorParamMaps=grid, evaluator=rmseEvaluator, seed=seed)          

cvModel = cv.fit(trainingDF)

Now we have the model ready

In [None]:
type(cvModel)

Two ALS models were trained as we use a grid of 2 parameters. Let's check the RMSE errors

In [None]:
cvModel.avgMetrics

The second model has a sligthly lower RMSE. Let's check it corresponds with option rank=12

In [None]:
bestModel = cvModel.bestModel
print(f"The best model was trained with rank {bestModel.rank}")

### 4.4 Model Evaluation

So far, we used the `trainingDF` dataset to evalute the two models (baseline and ALS). 

Since we used this dataset to determine what model is best, we cannot use it to test how good the model is; otherwise, we would be very vulnerable to [overfitting](https://en.wikipedia.org/wiki/Overfitting).

To decide how good our model is, we need to use the `testDF` dataset.  

We will use the best model we just created for predicting the ratings for the test dataset and then we will compute the RMSE.

The steps you should perform are:
* Run a prediction, using `bestModel` as created above, on the test dataset (`testDF`), producing a new `predictedTestDF` DataFrame.
* Use the previously created RMSE evaluator, `rmseEvaluator` to evaluate the filtered DataFrame.

In [None]:
predictionsBestModelDF = bestModel.transform(testDF)

# Run the previously created RMSE evaluator
alsRMSE = rmseEvaluator.evaluate(predictionsBestModelDF)

print(f"ALS RMSE: {alsRMSE:.3}")

In [None]:
predictionBaselineModelDF = testDF.withColumn("prediction", lit(averageRating))

baselineRMSE = rmseEvaluator.evaluate(predictionBaselineModelDF)

print(f"Baseline RMSE: {baselineRMSE:.3}")

<a id='5'></a>
## 5. Recommendations

The last point of this lab exercise is to predict what movies to recommend to yourself.  

In order to do that, you will first need to add ratings for yourself to the `ratingsDF` dataset.

### Your Movie Ratings

To help you provide ratings for yourself, I have included the following code to list the names and movieIds of the 100 highest-rated movies that have at least 100 ratings.

In [None]:
movies.createOrReplaceTempView("movies")
ratings.createOrReplaceTempView("ratings")

In [None]:
top100RatedMovies = spark.sql("""
                                SELECT r.movieId, m.title, AVG(rating) AS avg_rating, COUNT(*) AS num_ratings
                                FROM ratings r JOIN movies m ON (r.movieId = m.movieId)
                                GROUP BY r.movieId, m.title
                                HAVING COUNT(*) > 100
                                ORDER BY avg_rating DESC
                                LIMIT 100
                                """)

##### pd.set_option('display.max_rows', None)
#pd.set_option('display.max_rows', 20)
top100RatedMovies.toPandas()

The user ID 0 is unassigned, so we will use it for your ratings. We set the variable `myUserId` to 0 for you. 

Next, create a new DataFrame called `myRatingsDF`, with your ratings for at least 10 movie ratings. Each entry should be formatted as `(myUserId, movieId, rating)`.  As in the original dataset, ratings should be between 1 and 5 (inclusive). 

If you have not seen at least 10 of these movies, you can increase the parameter passed to `LIMIT` in the above cell until there are 10 movies that you have seen (or you can also guess what your rating would be for movies you have not seen).

In [None]:
from datetime import datetime
myUserId = 0
now = datetime.now()
myRatedMovies = [
     (myUserId, 1214, 5, now), # Alien
     (myUserId, 480,  5, now), # Jurassic Park
     (myUserId, 260, 5, now),  # Star Wars: Episode IV - A New Hope
     (myUserId, 541, 5, now),  # Blade Runner
     (myUserId, 2571, 5, now), # Matrix, The
     (myUserId, 296,  5, now), # Pulp Fiction
     (myUserId, 356,  5, now), # Forrest Gump     
     (myUserId, 593, 5, now),  # Silence of the Lambs, The
]

myRatingsDF = spark.createDataFrame(myRatedMovies, ['userId', 'movieId', 'rating','timestamp'])
myRatingsDF.toPandas()

In [None]:
movies.join(myRatingsDF,"movieId").toPandas()

###  Add Your Movies to Training Dataset

Now that you have ratings for yourself, you need to add your ratings to the `trainingDF` dataset so that the model you train will incorporate your preferences.

In [None]:
trainingWithMyRatingsDF = trainingDF.unionByName(myRatingsDF)

countDiff = trainingWithMyRatingsDF.count() - trainingDF.count()
print(f"The training dataset now has {countDiff} more entries than the original training dataset")
assert (countDiff == myRatingsDF.count())

### Train a Model with Your Ratings

Now, train a model with your ratings added and the parameters you used in in part (2b) and (2c). Make sure you include all of the parameters.

Note: This cell will take about 1 minute to run.

In [None]:
als.setRank(12)
myRatingsModel = als.fit(trainingWithMyRatingsDF)

### Predict Your Ratings

Now that we have trained a new model, let's predict what ratings you would give to the movies that you did not already provide ratings for. The code below filters out all of the movies you have rated, and creates a `predictedRatingsDF` DataFrame of the predicted ratings for all of your unseen movies.

In [None]:
# Create a list of the my rated movieIds
myRatedMovieIds = [x[1] for x in myRatedMovies]

# Filter out the movies I already rated.
notRatedDF = movies.filter(~ movies['movieId'].isin(myRatedMovieIds))

# Add a column with myUserId as "userId".
myUnratedMoviesDF = notRatedDF.withColumn('userId', lit(myUserId))       

# Use myRatingModel to predict ratings for the movies that I did not manually rate.
predictedRatingsDF = myRatingsModel.transform(myUnratedMoviesDF)

In [None]:
predictedRatingsDF.createOrReplaceTempView("predictions")

In [None]:
predictedRatingsDF.limit(5).toPandas()

Let's create two more DataFrames to get links and trailers for the movies

In [None]:
links = spark.read.parquet("hdfs://localhost:9000/datalake/std/movielens/links/").cache()
trailers = spark.read.parquet("hdfs://localhost:9000/datalake/std/movielens/trailers/").cache()

links.createOrReplaceTempView("links")
trailers.createOrReplaceTempView("trailers")

In [None]:
links.limit(5).toPandas()

In [None]:
trailers.limit(5).toPandas()

Lets' create a function to get the recommendations for any user

In [None]:
from pyspark.sql.types import StructType,StructField,IntegerType

def get_recs(userId,recs_number=5) :
    query = f"""
        SELECT a.movieId,
               a.title,
               a.your_predicted_rating,
               l.imdbUrl,
               l.tmdbUrl,
               t.youtubeUrl
        FROM
        (SELECT p.movieId,
                p.title,               
                p.prediction AS your_predicted_rating
        FROM ratings r 
        INNER JOIN predictions p ON (r.movieId = p.movieId)        
        WHERE p.userId = {userId}
        GROUP BY p.movieId, p.title, p.prediction
        HAVING COUNT(*) > 75
        ORDER BY p.prediction DESC
        LIMIT {recs_number}
        ) a
        LEFT JOIN links l ON a.movieId=l.movieId
        LEFT JOIN trailers t ON a.movieId=t.movieId
        """
    return spark.sql(query)

Now print out the 10 movies with the highest predicted ratings for you

In [None]:
myRecs = get_recs(userId=0,recs_number=10)
myRecs.toPandas()

### Display Your Recommendations (Optional)

To display the recommendations we are going to fetch the movie poster from IMBD website.<br/>
We need to install this library. Open a terminal and execute the following command:

```sh
pip3 install beautifulsoup4
```

In [None]:
import urllib.request
from bs4 import BeautifulSoup
from IPython.display import HTML

def fetch_movie_poster(url):
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html)
    for meta in soup.findAll("meta"):
        if 'property' in meta.attrs and meta.attrs['property'] == "og:image":
            return meta.attrs['content']
    return None

def display_posters(recs):
    html = "<table><tr>"
    for rec in recs:
        url=fetch_movie_poster(rec.imdbUrl)
        html+=f'<td>'
        html+=f'<img src="{url}" width="100"/>'
        html+=f'<a style="text-align: center" href="{rec.imdbUrl}">(IMDB)</a><br/>'
        html+=f'<a style="text-align: center" href="{rec.tmdbUrl}">(TMDB)</a>'
        html+='</td>'
    html+= "</tr></table>"    
    display(HTML(html)) 
    
def display_recs(recs):
    html = ""
    for rec in recs:        
        html += f'<iframe src="{rec.youtubeUrl}"></iframe>'
    display(HTML(html))

In [None]:
display_posters(myRecs.collect())

In [None]:
display_recs(myRecs.collect())

<a id='6'></a>
## 6. Tear Down

Once we complete the the lab we can stop all the services

<a id='6.1'></a>
### 6.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```