# Final Project - Evaluating Ideal Number of Weights Using Spark

For this exercise, we'll use Spark on an AWS EC2 machine. 

### AWS Machine Specs 
The machine is a T2 Extra Large machine with 8 cores and 32GB of RAM.

### Spark Setup
We'll use PySpark with two partitions. We are running four cores, and Spark with use these to run parallel jobs

### Objective
The objective of this exercise is to look evaluate the optimal number of user and item weights using Spark. We are using distributed systems because the computational power requied for these large sparse matrices is very high.

### Datasets

* Book Review Data (From a previous exercise)
* Jester Dataset (Used in this exercise)

### Import Modules and Libraries for Use

In [3]:
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import requests, os, sys, zipfile, io
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
import os
#import pyspark
warnings.filterwarnings('ignore')
%matplotlib inline
import json
import surprise
import warnings
warnings.filterwarnings(action='ignore')
from sklearn.preprocessing import MinMaxScaler
import pickle
from pyspark.sql import functions as func

## Set Up Spark

### Open Secrets File to Get Path to Spark Master 

(I don't want my internal IP in the public domain)

In [2]:
with open('data/secrets.json') as f:
    pyspark_config = json.load(f)

### Create Spark Session with Link to Master

In [7]:
spark = pyspark.sql.SparkSession.builder.master(pyspark_config['master']).getOrCreate()

### Get Default Parallelism (since we have eight cores)

We can see that spark has automatically detected that we can run eight parallel operations

In [6]:
spark.sparkContext.defaultParallelism

8

### Get Data and Import into Spark RDD format

In [8]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

### Book Data

In [None]:
ratings_books = spark.read.csv('data/ratings_books.csv',
                        sep=',',
                        inferSchema=True,
                        header=True)
                         
                         

for i in ['user_id', 
         'book_id']:
          ratings_books = ratings_books.withColumn(i, ratings_books[i].cast('int'))
    
ratings_books = ratings_books.withColumn("rating", ratings_books["rating"].cast('float'))

ratings_books = ratings_books[['user_id',
                             'book_id',
                             'rating']]

### Create Test Train Split for Book Data

In [7]:
train_books, test_books = ratings_books.randomSplit([0.7, 0.3], seed=42)

### Create the Spark ALS Instance and get Cross Val Scores

In [None]:
# The ALS instance
als_book = ALS(userCol='user_id',
              itemCol='book_id',
              ratingCol='rating',
              nonnegative=True,
              seed=42)

# The parameter grid to search

als_book_paramgrid = (ParamGridBuilder()
                 .addGrid(als_book.rank, [10, 20])
                 .addGrid(als_book.maxIter, [10])
                 .addGrid(als_book.regParam, [0.1, 0.5, 1.0])
                 .addGrid(als_book.alpha, [0.1, 0.5, 1.0])
                 .build())

# The evaluation function for determining the best model
rmse_eval_book = RegressionEvaluator(labelCol='rating',
                                predictionCol='prediction', 
                                metricName='rmse')

# The cross validation instance
cv_book = CrossValidator(estimator=als_book,
                    parallelism=8,
                    collectSubModels=True,
                    estimatorParamMaps=als_book_paramgrid,
                    evaluator=rmse_eval_book,
                    numFolds=3, 
                    seed=42)

# Fit the models and find the best one!
als_book_cv = cv_book.fit(train_books.dropna())

### Check Number of Weights for Best Model (Book Data)

In [11]:
als_book_best = als_book_cv.bestModel
dict(Rank=als_book_best.rank,
     User_Factors=len(als_book_best.itemFactors.toPandas()['features'][0]),
     Item_Factors=len(als_book_best.userFactors.toPandas()['features'][0]))

{'Item_Factors': 10, 'Rank': 10, 'User_Factors': 10}

### Calculate RMSE

In [12]:
#the aggregate function can be created outside of the dataframe if desired
rms_calc = func.sqrt(func.mean('diff_sq'))
als_book_pred = als_book_best.transform(test_books)
als_book_pred = als_book_pred.withColumn('diff_sq', (als_book_pred['rating'] - als_book_pred['prediction'])**2)
als_book_pred.dropna().select(rms_calc.alias('rmse')).show()

+------------------+
|              rmse|
+------------------+
|1.2680396014053497|
+------------------+



### Compare with Jester Data

In [19]:
ratings_jokes = spark.read.csv('data/jester_ratings.csv',
                        sep=',',
                        inferSchema=True,
                        header=True)
                         
                         

for i in ['user', 
          'item']:
          ratings_jokes = ratings_jokes.withColumn(i, ratings_jokes[i].cast('int'))
    
ratings_jokes = ratings_jokes.withColumn("rating", ratings_jokes["rating"].cast('float'))

ratings_jokes = ratings_jokes[['user',
                             'item',
                             'rating']]

In [20]:
train_jokes, test_jokes = ratings_jokes.randomSplit([0.7, 0.3], seed=42)

In [23]:
# The ALS instance
als_jokes = ALS(userCol='user',
              itemCol='item',
              ratingCol='rating',
              nonnegative=False,
              seed=42)

# The parameter grid to search

als_jokes_paramgrid = (ParamGridBuilder()
                 .addGrid(als_jokes.rank, [10, 20])
                 .addGrid(als_jokes.maxIter, [10])
                 .addGrid(als_jokes.regParam, [0.1, 0.5, 1.0])
                 .addGrid(als_jokes.alpha, [0.1, 0.5, 1.0])
                 .build())

# The evaluation function for determining the best model
rmse_eval_jokes = RegressionEvaluator(labelCol='rating',
                                predictionCol='prediction', 
                                metricName='rmse')

# The cross validation instance
cv_jokes = CrossValidator(estimator=als_jokes,
                    parallelism=8,
                    collectSubModels=True,
                    estimatorParamMaps=als_jokes_paramgrid,
                    evaluator=rmse_eval_jokes,
                    numFolds=3, 
                    seed=42)

# Fit the models and find the best one!
als_jokes_cv = cv_jokes.fit(train_jokes.dropna())

### Check Number of Weights for Best Model (Jester Data)

In [24]:
als_jokes_best = als_jokes_cv.bestModel
dict(Rank=als_jokes_best.rank,
     User_Factors=len(als_jokes_best.itemFactors.toPandas()['features'][0]),
     Item_Factors=len(als_jokes_best.userFactors.toPandas()['features'][0]))

{'Item_Factors': 10, 'Rank': 10, 'User_Factors': 10}

### Look at RMSE

In [26]:
#the aggregate function can be created outside of the dataframe if desired
rms_calc = func.sqrt(func.mean('diff_sq'))
als_jokes_pred = als_jokes_best.transform(test_jokes)
als_jokes_pred = als_jokes_pred.withColumn('diff_sq', (als_jokes_pred['rating'] - als_jokes_pred['prediction'])**2)
als_jokes_pred.dropna().select(rms_calc.alias('rmse')).show()

+-----------------+
|             rmse|
+-----------------+
|4.521681825280404|
+-----------------+



## Summary

The RMSE for the Jester data seems large, but only because we have a rating from -10 to +10. However, it is clear that for both datasets, 10 is a better number of weights than 20