# Building Recommendation Engines with PySpark

This course will show you how to build recommendation engines using `Alternating Least Squares` in PySpark. Using the popular MovieLens dataset and the Million Songs dataset, this course will take you step by step through the intuition of the Alternating Least Squares algorithm as well as the code to train, test and implement ALS models on various types of customer data.

## Table of Contents

- [Introduction](#intro)
- [How does ALS work?](#als)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

path = "data/dc36/"

In [None]:
from pyspark import SparkContext
sc = SparkContext("local", "First App")
print(sc)

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('First App').getOrCreate()

In [None]:
# Return spark version
print(spark.version)

# Return python version
import sys
print(sys.version_info)

---
<a id='intro'></a>

## Why learn how to build recommendation engines?

<img src="images/spark6_001.png" alt="" style="width: 800px;"/>

<img src="images/spark6_002.png" alt="" style="width: 800px;"/>

<img src="images/spark6_003.png" alt="" style="width: 800px;"/>

## See the power of a recommendation engine

Taylor and Jane both like watching movies. Taylor only likes dramas, comedies, and romances. Jane likes only action, adventure, and otherwise exciting films. One of the greatest benefits of `ALS-based recommendation engines` is that they can identify movies or items that users will like, even if they themselves think that they might not like them. Take a look at the movie ratings that Taylor and Jane have provided below. It would stand to reason that their different preferences would generate different recommendations.

- Take a look at TJ_ratings using the .show() method and any other methods you prefer to see how each of them rated the various movies they've seen.
- Input user names into the get_ALS_recs() function provided to see what movies ALS recommends for Jane and Taylor based on the ratings provided. Do the ratings make sense to you?

```
In [1]: TJ_ratings.show()
+---------+--------------------+------+
|user_name|          movie_name|rating|
+---------+--------------------+------+
|   Taylor|            Twilight|   4.9|
|   Taylor|  A Walk to Remember|   4.5|
|   Taylor|        The Notebook|   5.0|
|   Taylor|Raiders of the Lo...|   1.2|
|   Taylor|      The Terminator|   1.0|
|   Taylor|      Mrs. Doubtfire|   1.0|
|     Jane|            Iron Man|   4.8|
|     Jane|Raiders of the Lo...|   4.9|
|     Jane|      The Terminator|   4.6|
|     Jane|           Anchorman|   1.2|
|     Jane|        Pretty Woman|   1.0|
|     Jane|           Toy Story|   1.2|
+---------+--------------------+------+

In [2]: get_ALS_recs(["Taylor","Jane"])
    userId  pred_rating                 title          genres
0   Taylor         3.89   Seven Pounds (2008)           Drama
1   Taylor         3.61      Cure, The (1995)           Drama
2   Taylor         3.55  Kiss Me, Guido (1997          Comedy
3   Taylor         3.29  You've Got Mail (199  Comedy|Romance
4   Taylor         3.27  10 Things I Hate Abo  Comedy|Romance
5   Taylor         3.26  Corrina, Corrina (19  Comedy|Drama|R
6     Jane         4.96           Fear (1996)        Thriller
7     Jane         4.85  Lord of the Rings: T  Adventure|Fant
8     Jane         4.70  Lord of the Rings: T  Adventure|Fant
9     Jane         4.55  No Holds Barred (198          Action
10    Jane         4.54  Lord of the Rings: T  Action|Adventu
11    Jane         4.30  Band of Brothers (20  Action|Drama|W
12    Jane         4.26   Transformers (2007)  Action|Sci-Fi|
```

## Recommendation Engine Types and Data Types

<img src="images/spark6_004.png" alt="" style="width: 800px;"/>

<img src="images/spark6_005.png" alt="" style="width: 800px;"/>

## Collaborative vs Content-Based Filtering Part II

Look at the df dataframe using the .show() method and/or the .columns method, and determine whether it is best suited for "collaborative filtering", "content-based filtering", or "both".

```
In [1]: df.show()
+------+-------+-----------------+--------+--------+-------------+------+
|UserId|MovieId|      Movie_Title|   Genre|Language|Year_Produced|rating|
+------+-------+-----------------+--------+--------+-------------+------+
| User1|   2112|     Finding Nemo|Animated| English|         2003|     3|
| User1|   2113|   The Terminator|  Action| English|         1984|     0|
| User1|   2114|       Spinal Tap|  Satire| English|         1984|     4|
| User1|   2115|Life Is Beautiful|   Drama| Italian|         1998|     4|
| User2|   2112|     Finding Nemo|Animated| English|         2003|     4|
| User2|   2113|   The Terminator|  Action| English|         1984|     0|
| User2|   2114|       Spinal Tap|  Satire| English|         1984|     0|
| User2|   2115|Life Is Beautiful|   Drama| Italian|         1998|     4|
| User3|   2112|     Finding Nemo|Animated| English|         2003|     1|
| User3|   2113|   The Terminator|  Action| English|         1984|     2|
| User3|   2114|       Spinal Tap|  Satire| English|         1984|     1|
| User3|   2115|Life Is Beautiful|   Drama| Italian|         1998|     0|
| User4|   2112|     Finding Nemo|Animated| English|         2003|     3|
| User4|   2113|   The Terminator|  Action| English|         1984|     1|
| User4|   2114|       Spinal Tap|  Satire| English|         1984|     0|
| User4|   2115|Life Is Beautiful|   Drama| Italian|         1998|     0|
+------+-------+-----------------+--------+--------+-------------+------+
```

Possible Answers
- Collaborative filtering
- Content-based filtering
- Both collaborative and content-based filtering (correct)

Because this dataset includes descriptive tags like genre and language, as well as user ratings, it is suited for both collaborative and content-based filtering.

## Implicit vs Explicit Data

Recall the differences between implicit and explicit ratings. Take a look at the df1 dataframe to understand whether the data includes implicit or explicit ratings data.

- Use the .columns and .show() methods to get an idea of the data provided, and see if the data includes implicit or explicit ratings.
- Type "implicit" or "explicit" based on whether you think this data contains "implicit" ratings or "explicit" ratings. Name your response answer.

```
In [1]: df1.columns
Out[1]: ['Movie_Title', 'Genre', 'Num_Views']

In [2]: df1.show()
+--------------------+------------------+---------+
|         Movie_Title|             Genre|Num_Views|
+--------------------+------------------+---------+
|        Finding Nemo|Animated Childrens|       12|
|           Toy Story|Animated Childrens|        6|
|            Iron Man|            Action|        1|
|     Captain America|            Action|        1|
|     The Incredibles|Animated Childrens|        9|
|              Frozen|Animated Childrens|       22|
|The Shawshank Red...|             Drama|        2|
|  Rabbit Proof Fence|             Drama|        2|
|Searching for Sug...|       Documentary|        3|
|              Powder|             Drama|        1|
|        The Fugitive|            Action|        2|
+--------------------+------------------+---------+
```
This dataset includes user behavior counts which are used as implicit ratings.

## Ratings data types

Markus watches a lot of movies, including documentaries, superhero movies, classics, and dramas. Drawing on your previous experience with Spark, use the markus_ratings dataframe, which contains data on the number of times Markus has seen movies in various genres, and think about whether these are implicit or explicit ratings. Use the groupBy() method to determine which genre has the highest rating, which could likely influence what recommendations ALS would generate for Markus.

- Use the groupBy() method to group the markus_ratings dataframe by "Genre".
- Apply the .sum() method to get the total number of movies watched for each genre.
- Be sure to add the .show() method at the end to view the counts.

```
In [1]: markus_ratings.show()
+--------------------+------------------+---------+
|         Movie_Title|             Genre|Num_Views|
+--------------------+------------------+---------+
|        Finding Nemo|Animated Childrens|       12|
|           Toy Story|Animated Childrens|        6|
|            Iron Man|            Action|        1|
|     Captain America|            Action|        1|
|     The Incredibles|Animated Childrens|        9|
|              Frozen|Animated Childrens|       22|
|The Shawshank Red...|             Drama|        2|
|  Rabbit Proof Fence|             Drama|        2|
|Searching for Sug...|       Documentary|        3|
|              Powder|             Drama|        1|
|        The Fugitive|            Action|        2|
+--------------------+------------------+---------+

In [2]: # Group the data by "Genre"
        markus_ratings.groupBy("Genre").sum().show()
+------------------+--------------+
|             Genre|sum(Num_Views)|
+------------------+--------------+
|             Drama|             5|
|       Documentary|             3|
|            Action|             4|
|Animated Childrens|            49|
+------------------+--------------+
```
Markus seems to like animated children's movies. Or perhaps his 3 kids use his movie streaming account more than he does.

## Uses for Recommendation Engines

<img src="images/spark6_006.png" alt="" style="width: 800px;"/>

<img src="images/spark6_007.png" alt="" style="width: 800px;"/>

<img src="images/spark6_008.png" alt="" style="width: 800px;"/>

<img src="images/spark6_009.png" alt="" style="width: 800px;"/>

`Latent features` aren't directly observable by humans, and need mathematical operations to uncover them.

## Confirm understanding of latent features

Matrix P is provided here. Its columns represent movies and its rows represent several latent features. Use your understanding of Spark commands to view matrix P and see if you can determine what some of the latent features might represent. After examining the matrix, look at the dataframe Pi, which contains a rough approximation of what these latent features could represent. See if you weren't far off.

- Examine matrix P using the .show() method.
- Examine matrix Pi using the .show() method.

```
<script.py> output:
    +--------+------------+--------+---------+------------+------+----------+
    |Iron Man|Finding Nemo|Avengers|Toy Story|Forrest Gump|Wall-E|Green Mile|
    +--------+------------+--------+---------+------------+------+----------+
    |     0.2|         2.4|     0.1|      2.4|           0|   2.5|         0|
    |     1.5|         1.4|     1.4|      1.3|         1.8|   1.8|       2.5|
    |     2.5|         1.1|     2.4|      0.9|         0.2|   0.9|      0.09|
    |     1.9|           2|     1.5|      2.2|         1.2|   0.3|      0.01|
    |       0|           0|       0|      2.3|         2.2|     0|       2.5|
    +--------+------------+--------+---------+------------+------+----------+
    
    +---------+--------+------------+--------+---------+------------+------+----------+
    | Lat Feat|Iron Man|Finding Nemo|Avengers|Toy Story|Forrest Gump|Wall-E|Green Mile|
    +---------+--------+------------+--------+---------+------------+------+----------+
    | Animated|     0.2|         2.4|     0.1|      2.4|           0|   2.5|         0|
    |    Drama|     1.5|         1.4|     1.4|      1.3|         1.8|   1.8|       2.5|
    |Superhero|     2.5|         1.1|     2.4|      0.9|         0.2|   0.9|      0.09|
    |   Comedy|     1.9|           2|     1.5|      2.2|         1.2|   0.3|      0.01|
    |Tom Hanks|       0|           0|       0|      1.8|         2.2|     0|       2.5|
    +---------+--------+------------+--------+---------+------------+------+----------+
```

---
<a id='als'></a>

## How does ALS work?

<img src="images/spark6_010.png" alt="" style="width: 800px;"/>

<img src="images/spark6_011.png" alt="" style="width: 800px;"/>

<img src="images/spark6_012.png" alt="" style="width: 800px;"/>

<img src="images/spark6_013.png" alt="" style="width: 800px;"/>

<img src="images/spark6_014.png" alt="" style="width: 800px;"/>

<img src="images/spark6_015.png" alt="" style="width: 800px;"/>

<img src="images/spark6_016.png" alt="" style="width: 800px;"/>

<img src="images/spark6_017.png" alt="" style="width: 800px;"/>

<img src="images/spark6_018.png" alt="" style="width: 800px;"/>

<img src="images/spark6_019.png" alt="" style="width: 800px;"/>

## Matrix Multiplication

To understand matrix multiplication more directly, let's do some matrix operations manually.

- Matrices a and b are Pandas dataframes. Use the .head() method on each of them to view them.
- Work out the product of these two matrices on your own.
- Enter the values of the product of the a and b matrices into the product array, created using np.array().
- Use the validation on the last line of code to evaluate your estimate. The .dot() method multiplies two matrices together.

```
In [1]: print("Matrix A: ")
        print (a.head())
        
        print("Matrix B: ")
        print (b.head())
Matrix A: 
     0  1
One  2  2
Two  3  3
Matrix B: 
     0  1
One  1  2
Two  4  4

In [2]: product = np.array([[10,12], [15,18]])

In [3]: product == np.dot(a,b)
Out[3]: 
array([[ True,  True],
       [ True,  True]])
```

## Matrix Multiplication Part II

Let's put your matrix multiplication skills to the test.

- Use the .shape attribute to understand the dimensions of matrices C and D, and determine whether these two matrices can be multiplied together or not.
- If they can be multiplied, use the np.matmul() method to multiply them. If not, set C_times_D to None.

```
In [1]: print(C.shape)
        
        # Print the dimensions of D
        print(D.shape)
(4, 5)
(3, 2)

In [2]: C_times_D = np.matmul()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    C_times_D = np.matmul()
ValueError: invalid number of arguments

In [3]: C_times_D = None
```
The number of columns in C is different than the number of rows in D. C and D cannot be multiplied.

## Overview of Matrix Factorization

<img src="images/spark6_020.png" alt="" style="width: 800px;"/>

<img src="images/spark6_021.png" alt="" style="width: 800px;"/>

<img src="images/spark6_022.png" alt="" style="width: 800px;"/>

## Matrix Factorization

Matrix G is provided here as a Pandas dataframe. View it to understand what it looks like. Look at the possible factor matrices H, I, and J (also Pandas dataframes), and determine which two matrices will produce the matrix G when multiplied together.

- Take a look at matrix G using the print command
- Take a look at the matrices H, I, and J and determine which pair of those matrices will produce G when multiplied together.
- Input your answer into the np.matmul() code provided.

```
# Take a look at Matrix G using the following print function
print("Matrix G:")
print(G)

# Take a look at the matrices H, I, and J and determine which pair of those matrices will produce G when multiplied together. 
print("Matrix H:")
print(H)
print("Matrix I:")
print(I)
print("Matrix J:")
print(J)

# Multiply the two matrices that are factors of the matrix G
prod = np.matmul(H, J)
print(G == prod)

<script.py> output:
    Matrix G:
       0  1
    0  6  6
    1  3  3
    Matrix H:
       0  1
    0  2  2
    1  1  1
    Matrix I:
       0  1
    0  3  3
    1  3  3
    Matrix J:
       0  1
    0  1  1
    1  2  2
          0     1
    0  True  True
    1  True  True
```
Matrices H and J are factors of G.

## Non-Negative Matrix Factorization

It's possible for one matrix to have two equally close factorizations where one has all positive values and the other has some negative values.

The matrix M has been factored twice using two different factorizations. Take a look at each pair of factor matrices L and U, and W and H to see the differences. Then use their products to see that they produce essentially the same product.

- Use print() to view the L and U matrices. Notice that some values in matrices L and U are negative.
- Use print() to view the W and H matrices. Notice that all values in these two matrices are positive.
- The L and U matrices and W and H matrices have been multiplied together to produce the LU and WH matrices respectively. Use getRMSE(product_matrix, original_matrix) to see how close LU is to M compared to how close WH is to M. Are they similar?

```
# View the L, U, W, and H matrices.
print("Matrices L and U:") 
print(L)
print(U)

print("Matrices W and H:")
print(W)
print(H)

# Calculate RMSE between LU and M
print("RMSE of LU: ", getRMSE(LU, M))

# Calculate RMSE between WH and M
print("RMSE of WH: ", getRMSE(WH, M))

<script.py> output:
    Matrices L and U:
          0         1         2  3
    0  1.00  0.000000  0.000000  0
    1  0.01 -0.421053  0.098316  1
    2  1.00  0.000000  1.000000  0
    3  0.10  1.000000  0.000000  0
       0     1      2         3
    0  1  2.00  1.000  2.000000
    1  0 -0.19 -0.099 -0.198000
    2  0  0.00  1.000 -1.000000
    3  0  0.00  0.000  0.194947
    Matrices W and H:
          0     1     2     3
    0  2.61  0.24  0.00  0.12
    1  0.00  0.05  0.02  0.17
    2  1.97  0.00  0.58  0.83
    3  0.05  0.00  0.00  0.00
          0     1     2     3
    0  0.38  0.65  0.34  0.41
    1  0.00  1.20  0.15  3.72
    2  0.42  1.09  1.38  0.07
    3  0.00  0.11  0.65  0.17
    RMSE of LU:  0.072
    RMSE of WH:  0.072
```
Did you notice that LU and WH essentailly created the same product despite LU having some negative values and WH having all positive values?

## How ALS Alternates to Generate Predictions

<img src="images/spark6_023.png" alt="" style="width: 800px;"/>

<img src="images/spark6_024.png" alt="" style="width: 800px;"/>

<img src="images/spark6_025.png" alt="" style="width: 800px;"/>

<img src="images/spark6_026.png" alt="" style="width: 800px;"/>

<img src="images/spark6_027.png" alt="" style="width: 800px;"/>

<img src="images/spark6_028.png" alt="" style="width: 800px;"/>

<img src="images/spark6_029.png" alt="" style="width: 800px;"/>

<img src="images/spark6_030.png" alt="" style="width: 800px;"/>

<img src="images/spark6_031.png" alt="" style="width: 800px;"/>

<img src="images/spark6_032.png" alt="" style="width: 800px;"/>

<img src="images/spark6_033.png" alt="" style="width: 800px;"/>

## Estimating Recommendations

Use your knowledge of matrix multiplication to determine which movie will have the highest recommendation for User_3. The ratings matrix has been factorized into U and P with ALS.

- View the left factor matrix, U, using the print function.
- Did you see anything interesting about User_3? Now inspect the right factor matrix, P. Use the print function.

```
# View left factor matrix
print(U)

<script.py> output:
            U_LF_1  U_LF_2  U_LF_3  U_LF_4
    User_1    0.80    0.01    0.30     0.8
    User_2    0.40    0.01    0.06     0.2
    User_3    0.05    2.10    0.01     2.2
    User_4    0.30    0.01    0.20     0.2
    User_5    0.10    1.50    0.90     0.0
    User_6    0.00    0.03    0.40     0.5
    User_7    0.01    0.02    0.66     0.4
    User_8    0.90    0.70    0.00     1.0
    User_9    1.00    2.00    0.04     0.2
    
# View right factor matrix
print(P)

<script.py> output:
            Movie_1  Movie_2  Movie_3  Movie_4
    P_LF_1      0.5      0.1      0.4     1.10
    P_LF_2      0.2      2.0      0.0     0.01
    P_LF_3      0.3      1.9      0.6     0.90
    P_LF_4      1.0      0.2      1.0     0.89
```

- Looking at U and P, which movie do you think will have the highest recommendation for User_3.
- Multiply U and P using numpy's matmul to obtain recommendations. Call thisUP.
- Complete the code to print UP.

```
# Multiply factor matrices
UP = np.matmul(U,P)

# Convert to pandas DataFrame
print(pd.DataFrame(UP, columns = P.columns, index = U.index))

<script.py> output:
            Movie_1  Movie_2  Movie_3  Movie_4
    User_1      NaN      NaN      NaN      NaN
    User_2      NaN      NaN      NaN      NaN
    User_3      NaN      NaN      NaN      NaN
    User_4      NaN      NaN      NaN      NaN
    User_5      NaN      NaN      NaN      NaN
    User_6      NaN      NaN      NaN      NaN
    User_7      NaN      NaN      NaN      NaN
    User_8      NaN      NaN      NaN      NaN
    User_9      NaN      NaN      NaN      NaN
```

Did you guess Movie 2? It has the highest predicted rating at 4.664 out of 5.

???

## RMSE As ALS Alternates

As you know, ALS will alternate between the two factor matrices, adjusting their values each time to iteratively come closer and closer to approximating the original ratings matrix. This exercise is intended to illustrate this to you.

Matrix T is a ratings matrix, and matrices F1, F2, F3, F4, F5, and F6 are the respective products of ALS after iterating 2, 3, 4, 5, and 6 times respectively. Follow the instructions below to see how the RMSE changes as ALS iterates.

- Use getRMSE(pred_matrix, actual_matrix) to calculate the RMSE of F1.
- Put F2, F3, F4, F5, and F6 into one list called Fs.
- Complete the getRMSEs(listOfPredMatrices, actualValues) function to calculate the RMSE of each matrix in the Fs list.

```
# Use getRMSE(preds, actuals) to calculate the RMSE of matrices T and F1.
getRMSE(F1, T)

# Create list of F2, F3, F4, F5, and F6
Fs = [F2, F3, F4, F5, F6]

# Calculate RMSEs for F2 - F6
getRMSEs(Fs, T)

<script.py> output:
    F1:  2.4791263858912522
    F2: 0.4389326310548279
    F3: 0.17555006757053257
    F4: 0.15154042416388636
    F5: 0.13191130368008455
    F6: 0.04533823201006271

```
Do you see how the RMSE gets smaller and smaller as ALS continues to iterate?

## Data Preparation for Spark ALS

<img src="images/spark6_034.png" alt="" style="width: 800px;"/>

<img src="images/spark6_035.png" alt="" style="width: 800px;"/>

<img src="images/spark6_036.png" alt="" style="width: 800px;"/>

<img src="images/spark6_037.png" alt="" style="width: 800px;"/>

<img src="images/spark6_038.png" alt="" style="width: 800px;"/>

<img src="images/spark6_039.png" alt="" style="width: 800px;"/>

<img src="images/spark6_040.png" alt="" style="width: 800px;"/>

<img src="images/spark6_041.png" alt="" style="width: 800px;"/>

<img src="images/spark6_042.png" alt="" style="width: 800px;"/>

<img src="images/spark6_043.png" alt="" style="width: 800px;"/>

<img src="images/spark6_044.png" alt="" style="width: 800px;"/>

<img src="images/spark6_045.png" alt="" style="width: 800px;"/>

<img src="images/spark6_046.png" alt="" style="width: 800px;"/>

<img src="images/spark6_047.png" alt="" style="width: 800px;"/>

## Correct format and distinct users

Take a look at the R dataframe. Notice that it is in conventional or `"wide"` format with a different movie in each column. Also notice that the User's and movie names are not in integer format. Follow the steps to properly prepare this data for ALS.

- Import the monotonically_increasing_id package from pyspark.sql.functions and view the R dataframe using the .show() method.
- Use the to_long() function to convert the R dataframe into a `"long"` data frame. Call the new dataframe ratings.
- Create a dataframe called users that contains all the .distinct() users from the dataframe and repartition the dataframe into one partition using the .coalesce(1) method.
- Use the monotonically_increasing_id() method inside of withColumn() to create a new column in the users dataframe that contains a unique integer for each user. Call this column userId. Be sure to call the .persist() method on the final dataframe to ensure the new integer IDs persist.

```
# Import monotonically_increasing_id and show R
from pyspark.sql.functions import monotonically_increasing_id
R.show()

# Use the to_long() function to convert the dataframe to the "long" format.
ratings = to_long(R)
ratings.show()

# Get unique users and repartition to 1 partition
users = ratings.select("User").distinct().coalesce(1)

# Create a new column of unique integers called "userId" in the users dataframe.
users = users.withColumn("userId", monotonically_increasing_id()).persist()
users.show()

<script.py> output:
    +----------------+-----+----+----------+--------+
    |            User|Shrek|Coco|Swing Kids|Sneakers|
    +----------------+-----+----+----------+--------+
    |    James Alking|    3|   4|         4|       3|
    |Elvira Marroquin|    4|   5|      null|       2|
    |      Jack Bauer| null|   2|         2|       5|
    |     Julia James|    5|null|         2|       2|
    +----------------+-----+----+----------+--------+
    
    +----------------+----------+------+
    |            User|     Movie|Rating|
    +----------------+----------+------+
    |    James Alking|     Shrek|     3|
    |    James Alking|      Coco|     4|
    |    James Alking|Swing Kids|     4|
    |    James Alking|  Sneakers|     3|
    |Elvira Marroquin|     Shrek|     4|
    |Elvira Marroquin|      Coco|     5|
    |Elvira Marroquin|  Sneakers|     2|
    |      Jack Bauer|      Coco|     2|
    |      Jack Bauer|Swing Kids|     2|
    |      Jack Bauer|  Sneakers|     5|
    |     Julia James|     Shrek|     5|
    |     Julia James|Swing Kids|     2|
    |     Julia James|  Sneakers|     2|
    +----------------+----------+------+
    
    +----------------+------+
    |            User|userId|
    +----------------+------+
    |Elvira Marroquin|     0|
    |      Jack Bauer|     1|
    |    James Alking|     2|
    |     Julia James|     3|
    +----------------+------+
```
Each user now has a unique integer assigned to it.

## Assigning integer id's to movies

Let's do the same thing to the movies. Then let's join the new user IDs and movie IDs into one dataframe.

- Use the .select() and the .distinct() methods to extract all unique Movies from the ratings dataframe.
- Repartition the movies dataframe to one partition using coalesce().
- Complete the partial code provided to assign unique integer IDs to each movie. Name the new column movieId and call the .persist() method on the resulting dataframe.
- Join the ratings dataframe to the users dataframe and subsequently to the movies dataframe. Call the result movie_ratings.

```
# Extract the distinct movie id's
movies = ratings.select("Movie").distinct() 

# Repartition the data to have only one partition.
movies = movies.coalesce(1) 

# Create a new column of movieId integers. 
movies = movies.withColumn("movieId", monotonically_increasing_id()).persist() 

# Join the ratings, users and movies dataframes
movie_ratings = ratings.join(users, "User", "left").join(movies, "Movie", "left")
movie_ratings.show()

<script.py> output:
    +----------+----------------+------+------+-------+
    |     Movie|            User|Rating|userId|movieId|
    +----------+----------------+------+------+-------+
    |     Shrek|    James Alking|     3|     2|      3|
    |      Coco|    James Alking|     4|     2|      1|
    |Swing Kids|    James Alking|     4|     2|      2|
    |  Sneakers|    James Alking|     3|     2|      0|
    |     Shrek|Elvira Marroquin|     4|     0|      3|
    |      Coco|Elvira Marroquin|     5|     0|      1|
    |  Sneakers|Elvira Marroquin|     2|     0|      0|
    |      Coco|      Jack Bauer|     2|     1|      1|
    |Swing Kids|      Jack Bauer|     2|     1|      2|
    |  Sneakers|      Jack Bauer|     5|     1|      0|
    |     Shrek|     Julia James|     5|     3|      3|
    |Swing Kids|     Julia James|     2|     3|      2|
    |  Sneakers|     Julia James|     2|     3|      0|
    +----------+----------------+------+------+-------+
```
You now have distinct userId's and movieId's that are integer data types.

## ALS Parameters and Hyperparameters

<img src="images/spark6_048.png" alt="" style="width: 800px;"/>

<img src="images/spark6_049.png" alt="" style="width: 800px;"/>

<img src="images/spark6_050.png" alt="" style="width: 800px;"/>

<img src="images/spark6_051.png" alt="" style="width: 800px;"/>

<img src="images/spark6_052.png" alt="" style="width: 800px;"/>

## Build Out An ALS Model

Let's specify your first ALS model. Complete the code below to build your first ALS model.

Recall that you can use the .columns method on the ratings data frame to see what the names of the columns are that contain user, movie, and ratings data. Spark needs to know the names of these columns in order to perform ALS correctly.

- Before building our ALS model, we need to split the data into training data and test data. Use the randomSplit() method to split the ratings dataframe into training_data and test_data using an 0.8/0.2 split respectively and a seed for the random number generator of 42.
- Tell Spark which columns contain the userCol, itemCol and ratingCol. Use the .columns method if needed. Complete the hyperparameters. Set the rank to 10, the maxIter to 15, the regParam or lambda to .1, the coldStartStrategy to "drop", the nonnegative argument should be set to True, and since our data contains explicit ratings, set the implicitPrefs argument to False.
- Now fit the als model to the training_data portion of the ratings data by calling the als.fit() method on the training_data provided. Call the fitted model model.
- Generate predictions on the test_data portion of the ratings data by calling the model.transform() method on the test_data provided. Call the predictions test_predictions. Feel free to view the predictions by calling the .show() method on the test_predictions

```
In [1]: ratings.show(5)
+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     2|      3|   3.0|
|     2|      1|   4.0|
|     2|      2|   4.0|
|     2|      0|   3.0|
|     0|      3|   4.0|
+------+-------+------+
only showing top 5 rows

# Split the ratings dataframe into training and test data
(training_data, test_data) = ratings.randomSplit([0.8, 0.2], seed=42)

# Set the ALS hyperparameters
from pyspark.ml.recommendation import ALS
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", rank = 10, maxIter = 15, regParam = .1,
          coldStartStrategy="drop", nonnegative = True, implicitPrefs = False)

# Fit the mdoel to the training_data
model = als.fit(training_data)

# Generate predictions on the test_data
test_predictions = model.transform(test_data)
test_predictions.show()

<script.py> output:
    +------+-------+------+----------+
    |userId|movieId|rating|prediction|
    +------+-------+------+----------+
    |     0|      1|   5.0| 2.4220588|
    |     0|      3|   4.0| 2.5182998|
    |     1|      0|   5.0| 1.4837145|
    +------+-------+------+----------+
```
You just built our your first ALS model and generated some test predictions. It's a toy dataset, so the results aren't particularly meaningful, but you now know how to do this.

## Build RMSE Evaluator

Now that you know how to fit a model to training data and generate test predictions, you need a way to evaluate how well your model performs. For this we'll build an evaluator. Evaluators in Spark can be built out in various ways. For our purposes, we want a regressionEvaluator that calculates the RMSE. After we build our regressionEvaluator, we can fit the model to our data and generate predictions.

- Import the required RegressionEvaluator package from the pyspark.ml.evaluation class.
- Complete the evaluator code, specifying the metric name to be "rmse". Set the labelCol to the name of the column in our ratings data that contains our ratings (use the ratings.columns method to see column names) and set the prediction column name to "prediction".
- Confirm that the evaluator was properly created by extracting each of the three parameters from it. Do this by running the following 3 lines of code, each within a print statement:
    - evaluator.getMetricName()
    - evaluator.getLabelCol()
    - evaluator.getPredictionCol()

```
# Import RegressionEvaluator
from pyspark.ml.evaluation import RegressionEvaluator

# Complete the evaluator code
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

# Extract the 3 parameters
print(evaluator.getMetricName())
print(evaluator.getLabelCol())
print(evaluator.getPredictionCol())

<script.py> output:
    rmse
    rating
    prediction
```
You now know how to build a model, generate predictions, and build an evaluator to tell you how well the model predicted the test values.

## Get RMSE

Now that you know how to build a model and generate predictions, and have an evaluator to tell us how well it predicts ratings, we can calculate the RMSE to see how well an ALS model performed. We'll use the evaluator that we built in the previous exercise to calculate and print the rmse.

- Call the .evaluate() method on our evaluator to calculate our RMSE on the test_predictions dataframe. Call the result RMSE.
- Print the RMSE

```
In [1]: test_predictions.show(5)
+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|     1|      1|   2.0| 2.0127134|
|     2|      1|   4.0| 4.0069914|
|     0|      1|   5.0| 4.7319183|
|     3|      3|   5.0|  4.730774|
|     2|      3|   3.0| 2.9880555|
+------+-------+------+----------+
only showing top 5 rows

# Evaluate the "test_predictions" dataframe
RMSE = evaluator.evaluate(test_predictions)

# Print the RMSE
print (RMSE)

<script.py> output:
    0.16853197489754093
```
This RMSE means that on average, the model's test predictions are about .16 off from the true values.

In [None]:
# Terminate the cluster
spark.stop()

In [None]:
<img src="images/spark6_053.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>