# How does ALS work?

In this track you will review basic concepts of matrix multiplication and matrix factorization, and dive into how the Alternating Least Squares algorithm works and what arguments and hyperparameters it uses to return the best recommendations possible. You will also learn important techniques for properly preparing your data for ALS in Spark.

## Preparing the environment

### Importing libraries

In [1]:
import numpy as np
import pandas as pd

from typing import List
from environment import SEED
from sklearn.decomposition import NMF
from pyspark.sql.types import (StructType, StructField,
                               DoubleType, IntegerType, StringType, TimestampType)
from pyspark.sql import SparkSession, Row, DataFrame as SparkDataframe, functions as F
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

### Connect to Spark

In [2]:
spark = (SparkSession.builder
                     .master('local[*]') \
                     .appName('spark_application') \
                     .config("spark.sql.repl.eagerEval.enabled", True)  # eval DataFrame in notebooks
                     .getOrCreate())

sc = spark.sparkContext
print(f'Spark version: {spark.version}')

Spark version: 3.5.1


## Loading data

### Links

In [3]:
# Reading the file
schema_links = StructType([
    StructField("movieId", IntegerType()),
    StructField("imdbId", StringType()),
    StructField("tmdbId", IntegerType())
])
links_data = spark.read.csv('data-sources/links.csv', header=True, schema=schema_links)

# Reviewing the result
links_data.createOrReplaceTempView("links")
print(f'Dataframe shape: ({links_data.count()}, {len(links_data.columns)})')
links_data.printSchema()
links_data.limit(2)

Dataframe shape: (87585, 3)
root
 |-- movieId: integer (nullable = true)
 |-- imdbId: string (nullable = true)
 |-- tmdbId: integer (nullable = true)



movieId,imdbId,tmdbId
1,114709,862
2,113497,8844


### Movies

In [4]:
# Reading the file
schema_movies = StructType([
    StructField("movieId", IntegerType()),
    StructField("title", StringType()),
    StructField("genres", StringType())
])
movies_data = spark.read.csv('data-sources/movies.csv', header=True, schema=schema_movies)

# Reviewing the result
movies_data.createOrReplaceTempView("movies")
print(f'Dataframe shape: ({movies_data.count()}, {len(movies_data.columns)})')
movies_data.printSchema()
movies_data.limit(2)

Dataframe shape: (87585, 3)
root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



movieId,title,genres
1,Toy Story (1995),Adventure|Animati...
2,Jumanji (1995),Adventure|Childre...


### Ratings

In [5]:
# Reading the file
schema_ratings = StructType([
    StructField("userId", IntegerType()),
    StructField("movieId", IntegerType()),
    StructField("rating", DoubleType()),
    StructField("timestamp", IntegerType())
])
ratings_data = spark.read.csv('data-sources/ratings.csv', header=True, schema=schema_ratings)

# Cleaning and mutating some columns
ratings_data = ratings_data.withColumn('timestamp', F.to_timestamp(F.from_unixtime('timestamp')))
date_range = ratings_data.select('timestamp').agg(F.min('timestamp'), F.max('timestamp')).collect()[0]
print(f"Date range: {date_range[0]} - {date_range[1]}")

# Taking just last month
ratings_data = ratings_data.where('timestamp >= "2023-09-12"')

# Reviewing the result
ratings_data.createOrReplaceTempView("ratings")
print(f'Dataframe shape: ({ratings_data.count()}, {len(ratings_data.columns)})')
ratings_data.printSchema()
ratings_data.limit(2)

Date range: 1995-01-09 05:46:44 - 2023-10-12 20:29:07
Dataframe shape: (92234, 4)
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: timestamp (nullable = true)



userId,movieId,rating,timestamp
28,7458,3.5,2023-09-22 12:42:52
28,285593,3.0,2023-09-22 21:20:05


### Tags

In [6]:
# Reading the file
schema_tags = StructType([
    StructField("userId", IntegerType()),
    StructField("movieId", IntegerType()),
    StructField("tag", StringType()),
    StructField("timestamp", IntegerType())
])
tags_data = spark.read.csv('data-sources/tags.csv', header=True, schema=schema_tags)

# Cleaning and mutating some columns
tags_data = tags_data.withColumn('timestamp', F.to_timestamp(F.from_unixtime('timestamp')))

# Reviewing the result
tags_data.createOrReplaceTempView("tags")
print(f'Dataframe shape: ({tags_data.count()}, {len(tags_data.columns)})')
tags_data.printSchema()
tags_data.limit(2)

Dataframe shape: (2000072, 4)
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- tag: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)



userId,movieId,tag,timestamp
22,26479,Kevin Kline,2020-02-29 23:01:26
22,79592,misogyny,2020-02-11 20:58:17


### Tables catalogue

In [7]:
spark.catalog.listTables()

[Table(name='links', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='movies', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='ratings', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='tags', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

## User defined functions

### getRMSE

In [8]:
def getRMSE(pred, actual):
	"""Returns RMSE between predictions and actual observations
	Parameters:
		predictions: pandas dataframe of value predictions
		actual values: pandas dataframe of actual values that predictions are trying to predict
	Returns: RMSE value in decimal format
	"""
	
	RMSE = (((pred - actual)**2).sum().sum()/(pred.shape[0]*pred.shape[1]))**.5
	return round(RMSE, 3)

### to_long

In [9]:
def to_long(df: SparkDataframe, 
            cols_a: List[str]=['User'], 
            col_b: str='Movie', 
            col_c: str='Rating') -> SparkDataframe:
    """ 
    Converts traditional or "wide" dataframe into a "row-based" dataframe, 
    also known as a "dense" or "long" dataframe.
    
    Parameters:
    - df: array of columns with column names
    - cols_a: list of columns name which serves as
    - col_b: name of the second column that will hold the rest of columns name, excluding cols_a
    - col_c: name of th third column that will hold the dataset values
    
    Returns: Row-based dataframe (cols_a, col_b, col_c) with no null values
    """
    cols = [c for c in df.columns if c not in cols_a]
    
    # Create and explode an array of (column_name, column_value) structs
    kvs = F.explode(F.array([
        F.struct(F.lit(c).alias(col_b), F.col(c).alias(col_c)) 
        for c in cols
    ])).alias("kvs")
    return (df.select(cols_a + [kvs])
              .select(cols_a + [f"kvs.{col_b}", f"kvs.{col_c}"])
              .filter(f"{col_c} IS NOT NULL"))

# Knowdledge Base

## Matrix multiplication: `np.dot` vs `np.matmul`

To execute this operation, the number of columns in the first array need to be equal to the number of rows in the second array.

### Ex. 1 - Matrix multiplication

To understand matrix multiplication more directly, let's do some matrix operations manually.

**Instructions:**

1. Matrices `a` and `b` are Pandas dataframes. Review them.
2. Work out the product of these two matrices on your own.
3. Enter the values of the product of the a and b`a` and `b` matrices into the `product` array, created using `np.array()`.
4. Use the validation on the last line of code to evaluate your estimate. The `.dot()` method multiplies two matrices together.

In [10]:
a = pd.DataFrame([[2, 2], [3, 3]], index=['One', 'Two'])
b = pd.DataFrame([[1, 2], [4, 4]], index=['One', 'Two'])

print(f'''
Matriz a:
{a}

Matriz b:
{b}
''')

# Complete the matrix with the product of matrices a and b
product = np.array([[2*1 + 2*4, 2*2 + 2*4], 
                    [3*1 + 3*4, 3*2 + 3*4]])
print(f'''
product:
{product}
''')

# Run this validation to see how your estimate performs
# For 2D arrays, np.dot is equal to np.matmul, np.matmul requires arrays not dataframes
print(f'''
Manual operation equal?: {np.array_equal(product, np.dot(a, b))}
np.matmul equal to np.dpt?: {np.array_equal(np.matmul(a.values, b.values), np.dot(a, b))}
''')


Matriz a:
     0  1
One  2  2
Two  3  3

Matriz b:
     0  1
One  1  2
Two  4  4


product:
[[10 12]
 [15 18]]


Manual operation equal?: True
np.matmul equal to np.dpt?: True



In [11]:
# Another example
a = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
b = pd.DataFrame([[7, 8], [9, 10], [11, 12]])

print(f'''
Matriz a:
{a.to_string(index=False, header=False)}

Matriz b:
{b.to_string(index=False, header=False)}
''')

# Complete the matrix with the product of matrices a and b
product = np.array([[1*7 + 2*9 + 3*11, 1*8 + 2*10 + 3*12],
                    [4*7 + 5*9 + 6*11, 4*8 + 5*10 + 6*12]])
print(f'''
product:
{product}
''')

# Run this validation to see how your estimate performs
# For 2D arrays, np.dot is equal to np.matmul, np.matmul requires arrays not dataframes
print(f'''
Manual operation equal?: {np.array_equal(product, np.dot(a, b))}
np.matmul equal to np.dpt?: {np.array_equal(np.matmul(a.values, b.values), np.dot(a, b))}
''')


Matriz a:
1 2 3
4 5 6

Matriz b:
 7  8
 9 10
11 12


product:
[[ 58  64]
 [139 154]]


Manual operation equal?: True
np.matmul equal to np.dpt?: True



### Ex. 2 - Non-negative matrix factorization

It's possible for one matrix to have two equally close factorizations where one has all positive values and the other has some negative values.

The matrix `M` has been factored twice using two different factorizations. Take a look at each pair of factor matrices `L` and `U`, and `W` and `H` to see the differences. Then use their products to see that they produce essentially the same product.

**Instructions:**

1. Use `print()` to view the `L` and `U` matrices. Notice that some values in matrices `L` and `U` are negative.
2. Use `print()` to view the `W` and `H` matrices. Notice that all values in these two matrices are positive.
3. The `L` and `U` matrices and `W` and `H` matrices have been multiplied together to produce the `LU` and `WH` matrices respectively. Use `getRMSE(product_matrix, original_matrix)` to see how close `LU` is to `M` compared to how close `WH` is to `M`. Are they similar?

In [12]:
# Setting the matrix
M = np.array([[1, 2, 1, 2],
              [0, 0, 0, 0],
              [1, 2, 2, 1],
              [0, 0, 0, 0]])

# Setting the factorized result of matrix M - possibility 1
L = np.array([[1.00, 0.000000, 0.000000, 0],
              [0.01, -0.421053, 0.098316, 1],
              [1.00, 0.000000, 1.000000, 0],
              [0.10, 1.000000, 0.000000, 0]])
U = np.array([[1, 2.00, 1.000, 2.000000],
              [0, -0.19, -0.099, -0.198000],
              [0, 0.00, 1.000, -1.000000],
              [0, 0.00, 0.000, 0.194947]])

# Setting the factorized result of matrix M - possibility 2
W = np.array([[2.61, 0.24, 0.00, 0.12],
              [0.00, 0.05, 0.02, 0.17],
              [1.97, 0.00, 0.58, 0.83],
              [0.05, 0.00, 0.00, 0.00]])

H = np.array([[0.38, 0.65, 0.34, 0.41],
              [0.00, 1.20, 0.15, 3.72],
              [0.42, 1.09, 1.38, 0.07],
              [0.00, 0.11, 0.65, 0.17]])

# Making the matrix multiplication of the possible factors
LU = np.dot(L, U)
WH = np.dot(W, H)

# Calculate RMSE between LU and M
print("How far is LU from M? RMSE:", getRMSE(LU, M))

# Calculate RMSE between WH and M
print("How far is WH from M? RMSE:", getRMSE(WH, M))

How far is LU from M? RMSE: 0.072
How far is WH from M? RMSE: 0.071


### Matrix Factorization - Using `numpy.linalg.svd`
```
A == np.dot(US, V)
```

where:
```
U, S, V = np.linalg.svd(A)
US = U * S
```

In [13]:
A = np.array([[1, 2, 1, 2],
              [0, 0, 0, 0],
              [1, 2, 2, 1],
              [0, 0, 0, 0]])

U, S, V = np.linalg.svd(A, full_matrices=True)  # If True (default), u and vh have the shapes (..., M, M) 
                                                # and (..., N, N), respectively.
                                                # Otherwise, the shapes are (..., M, K) and (..., K, N),
                                                # respectively, where K = min(M, N).

# Print the results
print("U * S:\n", U * S)
print("V:\n", V)

# Calculate RMSE between WH and M
USV = np.dot(U * S, V)
print("\nHow far is USV from A? RMSE:", getRMSE(USV, A))
np.allclose(A, USV)

U * S:
 [[-3.082207   -0.70710678  0.          0.        ]
 [ 0.          0.          0.          0.        ]
 [-3.082207    0.70710678  0.          0.        ]
 [ 0.          0.          0.          0.        ]]
V:
 [[-3.24442842e-01 -6.48885685e-01 -4.86664263e-01 -4.86664263e-01]
 [-4.66686951e-18 -4.75961049e-17  7.07106781e-01 -7.07106781e-01]
 [ 0.00000000e+00 -7.27606875e-01  4.85071250e-01  4.85071250e-01]
 [-9.45905303e-01  2.22565954e-01  1.66924465e-01  1.66924465e-01]]

How far is USV from A? RMSE: 0.0


True

### Matrix Factorization - Using `sklearn.decomposition.NMF`
```
A == np.dot(W, H)
```

where:
```
model = NMF(n_components=?, init='random', random_state=SEED)  # n_components: number of cols for the array W.
                                                               # Components to infer
W = fit_transform(A)
H = model.components_
```

In [14]:
A = np.array([[1, 2, 1, 2],
              [0, 0, 0, 0],
              [1, 2, 2, 1],
              [0, 0, 0, 0]])

model = NMF(n_components=3, init='random', random_state=0)
W = model.fit_transform(A) # W.shape = (4, 3)
H = model.components_ # W.shape = (3, 4)

# Print the results
print("W:\n", W)
print("H:\n", H)

# Calculate RMSE between WH and M
WH = np.dot(W, H)
print("\nHow far is WH from A? RMSE:", getRMSE(WH, A))
np.allclose(A, WH)

W:
 [[1.40955799 0.18471342 1.13367593]
 [0.         0.         0.        ]
 [0.80188415 1.32305167 0.31916526]
 [0.         0.         0.        ]]
H:
 [[0.32474734 1.27990823 0.55551258 0.84884422]
 [0.4617867  0.72266071 1.17496175 0.07324487]
 [0.40306531 0.05504503 0.         0.69683021]]

How far is WH from A? RMSE: 0.0


False

## Data preparation for Spark ALS

### Conventional Databases

In [15]:
conventional_df = (ratings_data.select('userId', 'movieId', 'rating')
                               .join(movies_data.select('movieId', 'title')
                                                .where(F.col('title').isin([
                                                    'Good Will Hunting (1997)',
                                                    'Batman Forever (1995)',
                                                    'Incredibles, The (2004)',
                                                    'Shawshank Redemption, The (1994)',
                                                    'Coco (2017)'
                                                ])), on=['movieId'])
                               .groupBy('userId')
                               .pivot('title')
                               .agg(F.avg('rating'))
                               .limit(10))
conventional_df

userId,Batman Forever (1995),Coco (2017),Good Will Hunting (1997),"Incredibles, The (2004)","Shawshank Redemption, The (1994)"
16339,,,,5.0,
58707,,,,,3.0
65905,3.5,,,3.0,
78700,,,3.0,5.0,5.0
72938,,,5.0,,
86775,,4.5,,,
177859,,,,4.0,
78042,,,,,5.0
95943,,,2.0,,
143788,,1.0,,,5.0


### RowBase Data Format

In [16]:
# Using unpivot
rowbase_df = (conventional_df.unpivot(['UserId'], conventional_df.columns[1:], 'Movies', 'Ratings')
                             .where('Ratings IS NOT NULL')
                             .sort('UserId', 'Movies'))
rowbase_df.show(10, truncate=False)

+------+--------------------------------+-------+
|UserId|Movies                          |Ratings|
+------+--------------------------------+-------+
|16339 |Incredibles, The (2004)         |5.0    |
|58707 |Shawshank Redemption, The (1994)|3.0    |
|65905 |Batman Forever (1995)           |3.5    |
|65905 |Incredibles, The (2004)         |3.0    |
|72938 |Good Will Hunting (1997)        |5.0    |
|78042 |Shawshank Redemption, The (1994)|5.0    |
|78700 |Good Will Hunting (1997)        |3.0    |
|78700 |Incredibles, The (2004)         |5.0    |
|78700 |Shawshank Redemption, The (1994)|5.0    |
|86775 |Coco (2017)                     |4.5    |
+------+--------------------------------+-------+
only showing top 10 rows



In [17]:
# Using a user defined function `to_long`
row_base_df = to_long(conventional_df,
                      cols_a=['userId'], col_b='Movies', col_c='Rating').sort('UserId', 'Movies')
row_base_df.show(10, truncate=False)

+------+--------------------------------+------+
|userId|Movies                          |Rating|
+------+--------------------------------+------+
|16339 |Incredibles, The (2004)         |5.0   |
|58707 |Shawshank Redemption, The (1994)|3.0   |
|65905 |Batman Forever (1995)           |3.5   |
|65905 |Incredibles, The (2004)         |3.0   |
|72938 |Good Will Hunting (1997)        |5.0   |
|78042 |Shawshank Redemption, The (1994)|5.0   |
|78700 |Good Will Hunting (1997)        |3.0   |
|78700 |Incredibles, The (2004)         |5.0   |
|78700 |Shawshank Redemption, The (1994)|5.0   |
|86775 |Coco (2017)                     |4.5   |
+------+--------------------------------+------+
only showing top 10 rows



### Steps to get integer ID's in case userId and movieId were strings

1. Extract unique userIds and movieIds
2. Assign unique integers to each id
3. Rejoin unique integer id's back to the ratings data

In [18]:
# User integers IDs
users = row_base_df.select('userId').distinct()
users = users.coalesce(1)
users = users.withColumn("userIntId", F.monotonically_increasing_id()).persist()
users.show(3)

# Movie integer IDs
movies = row_base_df.select("Movies").distinct()
movies = movies.coalesce(1)
movies = movies.withColumn("movieId", F.monotonically_increasing_id()).persist()
movies.show(3)

# Joining UserIds and MovieIds
ratings_w_int_ids = (row_base_df.join(users, "userId", "left")
                                .join(movies, "Movies", "left"))
ratings_w_int_ids.show(10)

+------+---------+
|userId|userIntId|
+------+---------+
| 16339|        0|
| 58707|        1|
| 65905|        2|
+------+---------+
only showing top 3 rows

+--------------------+-------+
|              Movies|movieId|
+--------------------+-------+
|Incredibles, The ...|      0|
|Shawshank Redempt...|      1|
|Batman Forever (1...|      2|
+--------------------+-------+
only showing top 3 rows

+--------------------+------+------+---------+-------+
|              Movies|userId|Rating|userIntId|movieId|
+--------------------+------+------+---------+-------+
|Incredibles, The ...| 16339|   5.0|        0|      0|
|Shawshank Redempt...| 58707|   3.0|        1|      1|
|Batman Forever (1...| 65905|   3.5|        2|      2|
|Incredibles, The ...| 65905|   3.0|        2|      0|
|Good Will Hunting...| 78700|   3.0|        3|      3|
|Incredibles, The ...| 78700|   5.0|        3|      0|
|Shawshank Redempt...| 78700|   5.0|        3|      1|
|Good Will Hunting...| 72938|   5.0|        4|    

### Ex. 3 Correct format and distinct users
Take a look at the `df` dataframe. Notice that it is in conventional or "wide" format with a different movie in each column. Also notice that the User's and movie names are not in integer format. Follow the steps to properly prepare this data for ALS.

**Instructions:**

1. Import the `monotonically_increasing_id` package from `pyspark.sql.functions` and view the `R` dataframe using the `.show()` method.
2. Use the `to_long()` function to convert the `df` dataframe into a `"long"` data frame. Call the new dataframe ratings.
3. Create a dataframe called users that contains all the `.distinct()` users from the dataframe and repartition the dataframe into one partition using the `.coalesce(1)` method.
4. Use the `monotonically_increasing_id()` method inside of `withColumn()` to create a new column in the `users` dataframe that contains a unique integer for each user. Call this column `userId`. Be sure to call the `.persist()` method on the final dataframe to ensure the new integer IDs persist.

In [19]:
# Setting the data
df = spark.createDataFrame([
        ['James Alking', 3, 4, 4, 3],
        ['Elvira Marroquin', 4, 5, None, 2],
        ['Jack Bauer', None, 2, 2, 5],
        ['Julia James', 5, None, 2, 2]
    ], schema=('User', 'Shreck', 'Coco', 'Swing Kids', 'Sneakers'))
df.show(truncate=False)

+----------------+------+----+----------+--------+
|User            |Shreck|Coco|Swing Kids|Sneakers|
+----------------+------+----+----------+--------+
|James Alking    |3     |4   |4         |3       |
|Elvira Marroquin|4     |5   |NULL      |2       |
|Jack Bauer      |NULL  |2   |2         |5       |
|Julia James     |5     |NULL|2         |2       |
+----------------+------+----+----------+--------+



In [20]:
# Convert the dataframe to the "long" format.
ratings_df = (df.unpivot(['User'], df.columns[1:], 'Movie', 'Rating')
                .where('Rating IS NOT NULL')
                .sort('User', 'Movie'))
ratings_df.show(truncate=False)

+----------------+----------+------+
|User            |Movie     |Rating|
+----------------+----------+------+
|Elvira Marroquin|Coco      |5     |
|Elvira Marroquin|Shreck    |4     |
|Elvira Marroquin|Sneakers  |2     |
|Jack Bauer      |Coco      |2     |
|Jack Bauer      |Sneakers  |5     |
|Jack Bauer      |Swing Kids|2     |
|James Alking    |Coco      |4     |
|James Alking    |Shreck    |3     |
|James Alking    |Sneakers  |3     |
|James Alking    |Swing Kids|4     |
|Julia James     |Shreck    |5     |
|Julia James     |Sneakers  |2     |
|Julia James     |Swing Kids|2     |
+----------------+----------+------+



In [21]:
# Get unique users and repartition to 1 partition
users = ratings_df.select("User").distinct().coalesce(1)
users = users.withColumn("userId", F.monotonically_increasing_id()).persist()
users.show(truncate=False)

+----------------+------+
|User            |userId|
+----------------+------+
|Elvira Marroquin|0     |
|Jack Bauer      |1     |
|James Alking    |2     |
|Julia James     |3     |
+----------------+------+



### Ex. 4 Assigning integer id's to movies

Let's do the same thing to the movies. Then let's join the new user IDs and movie IDs into one dataframe.

**Instructions:**

1. Use the `.select()` and the `.distinct()` methods to extract all unique Movies from the ratings dataframe.
2. Repartition the movies dataframe to one partition using `coalesce()`.
3. Complete the partial code provided to assign unique integer IDs to each movie. Name the new column `movieId` and call the `.persist()` method on the resulting dataframe.
4. Join the ratings dataframe to the users dataframe and subsequently to the movies dataframe. Call the result movie_ratings.

In [22]:
# Get unique movie and repartition to 1 partition
movies = ratings_df.select("Movie").distinct().coalesce(1)
movies = movies.withColumn("movieId", F.monotonically_increasing_id()).persist()
movies.show(truncate=False)

+----------+-------+
|Movie     |movieId|
+----------+-------+
|Sneakers  |0      |
|Coco      |1      |
|Swing Kids|2      |
|Shreck    |3      |
+----------+-------+



In [23]:
# Join the ratings, users and movies dataframes
movie_ratings_df = (ratings_df.join(users, "User", "left")
                              .join(movies, "Movie", "left"))
movie_ratings_df.show(truncate=False)

+----------+----------------+------+------+-------+
|Movie     |User            |Rating|userId|movieId|
+----------+----------------+------+------+-------+
|Shreck    |James Alking    |3     |2     |3      |
|Coco      |James Alking    |4     |2     |1      |
|Swing Kids|James Alking    |4     |2     |2      |
|Sneakers  |James Alking    |3     |2     |0      |
|Shreck    |Elvira Marroquin|4     |0     |3      |
|Coco      |Elvira Marroquin|5     |0     |1      |
|Sneakers  |Elvira Marroquin|2     |0     |0      |
|Coco      |Jack Bauer      |2     |1     |1      |
|Swing Kids|Jack Bauer      |2     |1     |2      |
|Sneakers  |Jack Bauer      |5     |1     |0      |
|Shreck    |Julia James     |5     |3     |3      |
|Swing Kids|Julia James     |2     |3     |2      |
|Sneakers  |Julia James     |2     |3     |0      |
+----------+----------------+------+------+-------+



## ALS parameters and hyperparameters

### ALS model code

In [24]:
# Review data
df = ratings_data.select('userId', 'MovieId', 'Rating').repartition(5)
df.show(3)

+------+-------+------+
|userId|MovieId|Rating|
+------+-------+------+
| 13147|   2001|   2.0|
|  3135| 163066|   4.0|
|  6587|  66659|   3.5|
+------+-------+------+
only showing top 3 rows



In [25]:
# Split into train and test
df_train, df_test = df.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_train.count()}, Testing set: {df_test.count()}")

Training set: 73917, Testing set: 18317


In [26]:
# Build ALS model
als_model = ALS(userCol="userId", itemCol="MovieId", ratingCol="Rating",
                nonnegative=True, coldStartStrategy="drop", implicitPrefs=False).fit(df_train)

In [27]:
# Generate predictions on test dataset
predictions = als_model.transform(df_test)
predictions.show(3)

+------+-------+------+----------+
|userId|MovieId|Rating|prediction|
+------+-------+------+----------+
|  6654|  48304|   2.5| 2.7344885|
|  6654|  57274|   2.5| 3.5387907|
|  6654| 204318|   3.5| 3.0853581|
+------+-------+------+----------+
only showing top 3 rows



### Evaluating the model

In [28]:
# Complete the evaluator code
evaluator = RegressionEvaluator(metricName="rmse", labelCol="Rating", predictionCol="prediction")
print(f'''
RMSE: {evaluator.evaluate(predictions)}
 MAE: {evaluator.evaluate(predictions, {evaluator.metricName: "mae"})}
  R²: {evaluator.evaluate(predictions, {evaluator.metricName: "r2"})}
 MSE: {evaluator.evaluate(predictions, {evaluator.metricName: "mse"})}
''')


RMSE: 0.9161828173411929
 MAE: 0.6896194025121015
  R²: 0.1660545516063615
 MSE: 0.8393909547912431



### Ex. 5 - Build out an ALS model

Let's specify your first ALS model. Complete the code below to build your first ALS model.

Recall that you can use the `.columns` method on the ratings data frame to see what the names of the columns are that contain user, movie, and ratings data. Spark needs to know the names of these columns in order to perform ALS correctly.

**Instructions:**

1. Before building our ALS model, we need to split the data into training data and test data. Use the `randomSplit()` method to split the ratings dataframe into training data and test data using an `0.8/0.2` split respectively and a seed for the random number generator of `SEED`.
2. Tell Spark
    - Which columns contain the `userCol`, `itemCol` and `ratingCol`. Use the `.columns` method if needed.
3. Complete the hyperparameters.
    - Set the `rank` to `10`,
    - the `maxIter` to `15`,
    - the `regParam` or lambda to `.1`,
    - the `coldStartStrategy` to `"drop"`,
    - the `nonnegative` argument should be set to `True`,
    - and since our data contains explicit ratings, set the `implicitPrefs` argument to False.
4. Now fit the als model to the training_data portion of the ratings data by calling the `.fit()` method on the training data provided. 
5. Generate predictions on the test data portion of the ratings data by calling the `.transform()` method on the test_data provided. Feel free to view the predictions by calling the `.show()` method on the predictions data

In [29]:
# Loading the data
df = spark.createDataFrame([Row(userId=2, movieId=3, rating=3.0),
                            Row(userId=2, movieId=1, rating=4.0),
                            Row(userId=2, movieId=2, rating=4.0),
                            Row(userId=2, movieId=0, rating=3.0),
                            Row(userId=0, movieId=3, rating=4.0),
                            Row(userId=0, movieId=1, rating=5.0),
                            Row(userId=0, movieId=0, rating=2.0),
                            Row(userId=1, movieId=1, rating=2.0),
                            Row(userId=1, movieId=2, rating=2.0),
                            Row(userId=1, movieId=0, rating=5.0),
                            Row(userId=3, movieId=3, rating=5.0),
                            Row(userId=3, movieId=2, rating=2.0),
                            Row(userId=3, movieId=0, rating=2.0)])
df.printSchema()
df.show()

root
 |-- userId: long (nullable = true)
 |-- movieId: long (nullable = true)
 |-- rating: double (nullable = true)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     2|      3|   3.0|
|     2|      1|   4.0|
|     2|      2|   4.0|
|     2|      0|   3.0|
|     0|      3|   4.0|
|     0|      1|   5.0|
|     0|      0|   2.0|
|     1|      1|   2.0|
|     1|      2|   2.0|
|     1|      0|   5.0|
|     3|      3|   5.0|
|     3|      2|   2.0|
|     3|      0|   2.0|
+------+-------+------+



In [30]:
# Split the df dataframe into training and test data
(df_train, df_test) = df.randomSplit([0.8, 0.2], seed=42)

# Set the ALS hyperparameters
model = ALS(userCol="userId", itemCol="movieId", ratingCol="rating",
            rank=10, maxIter=15, regParam=0.1,
            coldStartStrategy="drop", nonnegative=True, implicitPrefs=False)

# Fit the mdoel to the df_train
model = model.fit(df_train)

# Generate predictions on the df_test
df_predictions = model.transform(df_test)
df_predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|     2|      3|   3.0| 3.0715334|
|     1|      2|   2.0| 3.5046484|
|     3|      3|   5.0| 0.8877176|
|     3|      2|   2.0| 1.7279035|
+------+-------+------+----------+



### Ex. 6 - Build RMSE evaluator

Now that you know how to fit a model to training data and generate test predictions, you need a way to evaluate how well your model performs. For this we'll build an evaluator. Evaluators in Spark can be built out in various ways. For our purposes, we want a `regressionEvaluator` that calculates the `RMSE`. After we build our `regressionEvaluator`, we can fit the model to our data and generate predictions.

**Instructions:**

1. Import the required `RegressionEvaluator` package from the `pyspark.ml.evaluation` class. Already done!
2. Complete the evaluator code, specifying the metric name to be `"rmse"`. Set the `labelCol` to the name of the column in our ratings data that contains our ratings (use the `.columns` method to see column names) and set the prediction column name to `"prediction"`.
3. Confirm that the evaluator was properly created by extracting each of the three parameters from it. Do this by running the following 3 lines of code, each within a print statement:
```
evaluator.getMetricName()
evaluator.getLabelCol()
evaluator.getPredictionCol()
```

In [31]:
# Complete the evaluator code
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

# Extract the 3 parameters
print(evaluator.getMetricName())
print(evaluator.getLabelCol())
print(evaluator.getPredictionCol())

rmse
rating
prediction


### Ex. 7 - Get RMSE

Now that you know how to build a model and generate predictions, and have an evaluator to tell us how well it predicts ratings, we can calculate the `RMSE` to see how well an ALS model performed. We'll use the evaluator that we built in the previous exercise to calculate and print the rmse.

**Instructions:**

1. Call the `.evaluate()` method on our evaluator to calculate our `RMSE` on the test predictions dataframe. Call the result `RMSE`.
2. Print the `RMSE`.

In [32]:
# Evaluate the "df_predictions" dataframe
RMSE = evaluator.evaluate(df_predictions)

# Print the RMSE
print (RMSE)

2.1939682652076633


## Close session

In [33]:
spark.stop()