# DSCI 632 - Applied Cloud Computing - Final Project

### Team Members
- Aman Ostwal (ago34@drexel.edu)
- Darshit Rai (dr3264@drexel.edu)
- Sanskruti Chavanke (sc4323@drexel.edu)

## Configuration:

In [1]:
# Install the `findspark` and `pyspark` package quietly
!pip install -q findspark pyspark

# Import the `os` module for operating system functionalities
import os

## Load Packages and Functions

In [2]:
# Initialize PySpark in Colab
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession

In [3]:
def get_movie_title_from_id(movieId):
    try:
        # Filter the DataFrame to find the title corresponding to the given movieId
        title = movie_titles.filter(movie_titles["movieId"] == movieId).select("title").first()[0]
        return title
    except:
        return "Unknown Movie"  # If movieId not found, return a default message

In [4]:
def get_user_recommended_movies(recs_df, userId):
    try:
        # Filter recommendations DataFrame to get recommendations for the given userId
        user_recommendations = recs_df.filter(recs_df["userId"] == userId).select("recommendations")
        # Check if recommendations exist for the user
        if user_recommendations.count() > 0:
            # Iterate through recommendations and print movie titles with predicted ratings
            for row in user_recommendations.collect():
                for movie in row.recommendations:
                    movie_id, rating = movie
                    movie_title = get_movie_title_from_id(movie_id)
                    print(f"Movie: \n{movie_title}\nPredicted Rating: {rating}\n")
        else:
            print("No recommendations found for this user.")
    except ValueError:
        print("That userId does not exist in the dataset. Try another.")

## Import Data

In [5]:
# Create SparkContext
from pyspark import SparkContext

# Get or create SparkContext
sc = SparkContext.getOrCreate()

# Create SparkSession
from pyspark.sql import SparkSession

# Get or create SparkSession
spark = SparkSession.builder.getOrCreate()

# Print SparkContext information
print('Master : ', sc.master)  # Print the URL of the master
print('Cores  : ', sc.defaultParallelism)  # Print the number of cores being used

Master :  local[*]
Cores  :  2


In [6]:
import pandas as pd

# Define the file path
file_path = "/content/movies.csv"

# Read CSV file into a DataFrame
movie_titles = spark.read.csv(file_path, header=True, inferSchema=True)

# Show the first few rows of the DataFrame
movie_titles.show(truncate=False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

In [7]:
# Define the file path
file_path = "/content/ratings.csv"

# Read CSV file into a DataFrame
ratings = spark.read.csv(file_path, header=True, inferSchema=True)

# Show the first few rows of the DataFrame
ratings.show(truncate=False)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|1     |1      |4.0   |964982703|
|1     |3      |4.0   |964981247|
|1     |6      |4.0   |964982224|
|1     |47     |5.0   |964983815|
|1     |50     |5.0   |964982931|
|1     |70     |3.0   |964982400|
|1     |101    |5.0   |964980868|
|1     |110    |4.0   |964982176|
|1     |151    |5.0   |964984041|
|1     |157    |5.0   |964984100|
|1     |163    |5.0   |964983650|
|1     |216    |5.0   |964981208|
|1     |223    |3.0   |964980985|
|1     |231    |5.0   |964981179|
|1     |235    |4.0   |964980908|
|1     |260    |5.0   |964981680|
|1     |296    |3.0   |964982967|
|1     |316    |3.0   |964982310|
|1     |333    |5.0   |964981179|
|1     |349    |4.0   |964982563|
+------+-------+------+---------+
only showing top 20 rows



## ALS Model Creation

We'll split our data 80/20% into training/testing sets and set ```seed``` to 50 for reproducibility.

In [8]:
# Selecting required columns from the ratings DataFrame
ratings = ratings.select("userId", "movieId", "rating")

# Splitting the data into training and testing sets with 80/20 ratio
# Setting seed to 42 for reproducibility
(training_data, test_data) = ratings.randomSplit([0.8, 0.2], seed=50)

Initialize our model.  We'll set the following parameters before optimizing hyperparameters:
- `nonnegative`: `True`. We only want non-negative numbers, as a negative rating has no meaning in this context.  
- `coldStartStrategy`: `"drop"`.  Helps avoid situations where all of a user's ratings are added to the training set only.  This data will not be used when calculating RMSE, because predictions on these users would be meaningless because there is nothing to test.
- `implicitPrefs`: `False`.  We have actual ratings, so we don't need to use implicit feedback.

In [9]:
# from pyspark.ml.recommendation import ALS

# als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating",
#           nonnegative = True, coldStartStrategy = "drop", implicitPrefs = False)

# Import ALS from PySpark ML recommendation module
from pyspark.ml.recommendation import ALS

# Initialize ALS model with specified parameters
# userCol: Column name for user IDs
# itemCol: Column name for item IDs
# ratingCol: Column name for ratings
# nonnegative: Whether to use non-negative constraint for least squares
# coldStartStrategy: Strategy for dealing with unknown or new users/items during prediction ("drop" will drop any rows in the DataFrame of predictions that contain NaN values)
# implicitPrefs: Whether to treat ratings as implicit feedback (default is False)

als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", nonnegative=True, coldStartStrategy="drop", implicitPrefs=False)


Now we'll build our `ParamGridBuilder`:

In [10]:
# Import ParamGridBuilder from PySpark ML tuning module
from pyspark.ml.tuning import ParamGridBuilder

# Define parameter grid for hyperparameter tuning
param_grid = ParamGridBuilder() \
                  .addGrid(als.rank, [5, 20]) \
                  .addGrid(als.maxIter, [5]) \
                  .addGrid(als.regParam, [0.01, 0.05, 1]) \
                  .build()

Next, we'll create our evaluator and use RMSE as our metric:

In [11]:
# Import RegressionEvaluator from PySpark ML evaluation module
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize RegressionEvaluator with specified metric, label column, and prediction column
# metricName: The metric to use for evaluation (Root Mean Squared Error in this case)
# labelCol: Column name for the true ratings
# predictionCol: Column name for the predicted ratings
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

# Print the number of models to be tested based on the parameter grid
print("Number of models to be tested: ", len(param_grid))

Number of models to be tested:  6


Create CrossValidator:

In [12]:
# Import CrossValidator from PySpark ML tuning module
from pyspark.ml.tuning import CrossValidator

# Initialize CrossValidator with specified parameters
# estimator: The estimator (ALS model in this case) to be cross-validated
# estimatorParamMaps: Parameter grid for hyperparameter tuning
# evaluator: The evaluator to use for model selection (RegressionEvaluator in this case)
# numFolds: Number of folds for cross-validation
cv = CrossValidator(estimator=als,
                    estimatorParamMaps=param_grid,
                    evaluator=evaluator,
                    numFolds=5)

Fit Data:

In [13]:
# Fit the CrossValidator to the training data, which performs model selection
model = cv.fit(training_data)

# Get the best model selected by the CrossValidator
best_model = model.bestModel

Get information on the best model:

In [14]:
# Print the type of the best_model
print(type(best_model))

# Print information about the best model
print("\n**Best Model**")
print("Rank:", best_model.rank)  # Print the rank of the best model
print("MaxIter:", best_model._java_obj.parent().getMaxIter())  # Print the maximum number of iterations of the best model
print("RegParam:", best_model._java_obj.parent().getRegParam())  # Print the regularization parameter of the best model

<class 'pyspark.ml.recommendation.ALSModel'>

**Best Model**
Rank: 5
MaxIter: 5
RegParam: 0.05


## Performance Evaluation

Let's generate predictions on the test data:

In [15]:
# Make predictions on the test data using the best model
test_predictions = best_model.transform(test_data)

# Show the first 5 rows of the predictions DataFrame
test_predictions.show(truncate=False)

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|580   |44022  |3.5   |3.7855005 |
|133   |471    |4.0   |2.9903438 |
|108   |1959   |5.0   |4.121835  |
|155   |3175   |4.0   |3.4476712 |
|34    |3997   |2.0   |2.8021786 |
|368   |1645   |3.0   |2.8253365 |
|28    |1645   |2.5   |3.069725  |
|587   |1238   |4.0   |3.861096  |
|332   |1645   |3.5   |3.0652223 |
|577   |1959   |4.0   |3.559896  |
|384   |2122   |1.0   |2.7819238 |
|606   |1645   |3.5   |3.1752186 |
|606   |1829   |3.5   |2.1398973 |
|606   |1959   |3.5   |3.8126364 |
|223   |1342   |1.0   |4.016542  |
|91    |471    |1.0   |3.1500528 |
|91    |6620   |3.5   |3.367694  |
|330   |1580   |4.0   |3.2828496 |
|157   |3175   |2.0   |3.9116929 |
|232   |68135  |4.0   |3.2063005 |
+------+-------+------+----------+
only showing top 20 rows



In [16]:
# Evaluate the predictions using the evaluator
RMSE = evaluator.evaluate(test_predictions)

# Print the root mean squared error (RMSE)
print(RMSE)
print(RMSE * 100)

0.9120234454843942
91.20234454843941


## Generate Recommendations:

In [17]:
# Generate top movie recommendations for each user using the best model
userRecs = best_model.recommendForAllUsers(5)

# Show the first rows of the recommendations DataFrame without truncating the column values
userRecs.show(truncate=False)

+------+---------------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                          |
+------+---------------------------------------------------------------------------------------------------------+
|1     |[{157775, 6.435184}, {150554, 6.435184}, {149566, 6.435184}, {149508, 6.435184}, {142444, 6.435184}]     |
|2     |[{8477, 6.008811}, {1916, 5.985858}, {157775, 5.719481}, {149566, 5.719481}, {147410, 5.719481}]         |
|3     |[{3837, 7.170146}, {74754, 6.3216515}, {446, 5.8772163}, {42018, 5.770948}, {52712, 5.720191}]           |
|4     |[{68945, 5.804837}, {6376, 5.7241693}, {7579, 5.5963883}, {171867, 5.45473}, {84847, 5.443246}]          |
|5     |[{157775, 5.8970184}, {149566, 5.8970184}, {143031, 5.8970184}, {140133, 5.8970184}, {139640, 5.8970184}]|
|6     |[{8477, 6.354791}, {157775, 6.128828}, {149566, 6.128828}, {147410, 6.12

In [18]:
# Convert the recommendations DataFrame to a Pandas DataFrame
userRecs_pandas = userRecs.toPandas()

# Display the first few rows of the Pandas DataFrame
userRecs_pandas.head()

Unnamed: 0,userId,recommendations
0,1,"[(157775, 6.435184001922607), (150554, 6.43518..."
1,2,"[(8477, 6.008810997009277), (1916, 5.985857963..."
2,3,"[(3837, 7.1701459884643555), (74754, 6.3216514..."
3,4,"[(68945, 5.804837226867676), (6376, 5.72416925..."
4,5,"[(157775, 5.8970184326171875), (149566, 5.8970..."


In [19]:
# Call the function to get the title for movieId 2906
get_movie_title_from_id(2906)

'Random Hearts (1999)'

In [20]:
# Call the function to get recommended movies for user 10
get_user_recommended_movies(userRecs, 10)

Movie: 
Zed & Two Noughts, A (1985)
Predicted Rating: 5.8545379638671875

Movie: 
Buffalo '66 (a.k.a. Buffalo 66) (1998)
Predicted Rating: 5.813051223754883

Movie: 
Emma (2009)
Predicted Rating: 5.685601711273193

Movie: 
Pride and Prejudice (1995)
Predicted Rating: 5.434269905090332

Movie: 
Superstar (1999)
Predicted Rating: 5.258571147918701

