# TITLE

# CONTENTS

# I. BUSINESS UNDERSTANDING

# II. DATA UNDERSTANDING

**Iteration 1:** Collaborative Filtering Using Matrix Factorization

Implement collaborative filtering with Matrix Factorization using ALS.
Evaluate the model using RMSE or MAE.
Generate top-5 movie recommendations for users.

**Iteration 2:** Hybrid Approach with Content-Based Filtering

Combine collaborative and content-based recommendations.
Re-evaluate the hybrid model.
Compare scores with the pure collaborative filtering model from Iteration 1.

**Iteration 3:** Addressing the Cold Start Problem

Implement a temporary solution for new users (cold start) using demographics or movie popularity.
Enhance the model with implicit feedback or deep learning approaches.
Validate the cold start handling mechanism.

**Iteration 4:** Implementing Surprise for Collaborative Filtering

Integrate the Surprise library.
Choose and optimize collaborative filtering algorithms from Surprise.
Evaluate the model using Surprise's metrics.
Generate top-5 movie recommendations for users using the best-performing Surprise-based collaborative filtering model. Compare to past model iterations.

In [1]:
import pandas as pd

from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import expr, explode, col
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType

from pyspark.sql.functions import abs, col
from pyspark.ml.evaluation import RegressionEvaluator


from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

In [2]:
ratings_df = pd.read_csv('Data/ratings.csv')
movies_df = pd.read_csv('Data/movies.csv')
tags_df = pd.read_csv('Data/tags.csv')
links_df = pd.read_csv('Data/links.csv')

In [3]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings_df.shape

(100836, 4)

In [6]:
ratings_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [7]:
ratings_df.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

It looks like our "ratings_df" is in good shape with no missing values. We have a dataset with 100,836 ratings from 610 unique users for various movies. The "movieId" column contains IDs for the movies, which range from 1 to 193,609. User ratings range from 0.5 to 5.0, with an average rating of approximately 3.50. The "timestamp" column contains timestamp information, which might not be directly relevant for collaborative filtering but can be used for time-based recommendations or analysis. With a clean and well-structured dataset, we can proceed with a collaborative filtering approach to our first model iteration.

# MODELING

## Matrix Factorization with ALS -  PySpark

In [8]:
# Create a Spark session
spark = SparkSession.builder.appName("MovieRecommendations").getOrCreate()

# Convert your ratings data to a Spark DataFrame
ratings_spark = spark.createDataFrame(filtered_ratings_df)

# Define ALS model parameters
als = ALS(
    rank=10,              # Number of latent factors (you can adjust this)
    maxIter=10,           # Number of iterations
    regParam=0.01,        # Regularization parameter (you can adjust this)
    userCol="userId",
    itemCol="movieId",
    ratingCol="rating",
)

# Fit the ALS model (includes matrix factorization) to the entire dataset
als_model = als.fit(ratings_spark)

# Generate recommendations for all users
user_recommendations = als_model.recommendForAllUsers(10)  # Number of recommendations per user

# Filter recommendations for a specific user (e.g., user with ID 1)
user_id = 1
user_recommendations = user_recommendations.filter(col("userId") == user_id)

# Extract and rename columns for readability
user_recommendations = user_recommendations.select(
    col("userId"),
    explode(col("recommendations")).alias("recommendations")
)

# Extract movieId and rating from the recommendations struct
user_recommendations = user_recommendations.select(
    "userId",
    col("recommendations.movieId").alias("movieId"),
    col("recommendations.rating").alias("predictedRating")
)

# Sort the recommendations by predicted rating in descending order
user_recommendations = user_recommendations.orderBy(col("predictedRating").desc())

user_recommendations.show()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/05 11:13:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/05 11:13:45 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
23/11/05 11:13:46 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/11/05 11:13:46 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
23/11/05 11:13:47 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK

+------+-------+---------------+
|userId|movieId|predictedRating|
+------+-------+---------------+
|     1|   6380|        7.36554|
|     1|   2693|       7.282355|
|     1|   3846|      6.5048866|
|     1|    232|      6.4766016|
|     1|    971|      6.3779616|
+------+-------+---------------+



                                                                                

In [9]:
# Evaluate the model on the entire dataset using RMSE
predictions = als_model.transform(ratings_spark)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE) on the entire dataset: {rmse}")

[Stage 160:>                                                        (0 + 2) / 2]

Root Mean Squared Error (RMSE) on the entire dataset: 0.50513894782959


                                                                                

In [11]:
# Calculate the absolute difference between the actual rating and predicted rating
predictions = als_model.transform(ratings_spark)
predictions = predictions.withColumn("abs_diff", abs(col("rating") - col("prediction")))

# Calculate the mean of the absolute differences
mae = predictions.select("abs_diff").agg({"abs_diff": "mean"}).collect()[0][0]

# Print the MAE
print(f"Mean Absolute Error (MAE) on the entire dataset: {mae}")

                                                                                

Mean Absolute Error (MAE) on the entire dataset: 0.35528256076354925


## Surprise collab filtering model: matrix factorization with SVD

In [12]:
# Create a Surprise dataset from the ratings data in ratings_df
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

In [13]:
# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [14]:
# Initialize the SVD algorithm
surprise_model_1 = SVD(n_factors=100, n_epochs=20, random_state=42)

In [15]:
# Train the model on the training set
surprise_model_1.fit(trainset)

# Make predictions on the test set
surprise_preds = surprise_model_1.test(testset)

In [16]:
# Now, you can make recommendations for a specific user
user_id = 100  # Change this to the user for whom you want to make recommendations
user_movies = ratings_df[ratings_df['userId'] == user_id]['movieId']

In [17]:
# Get a list of movies the user has not rated
all_movies = ratings_df['movieId'].unique()
unrated_movies = set(all_movies) - set(user_movies)

In [18]:
# Predict ratings for unrated movies
user_ratings = []
for movie_id in unrated_movies:
    pred = surprise_model_1.predict(user_id, movie_id)
    user_ratings.append((movie_id, pred.est))

In [19]:
# Sort the predicted ratings in descending order
user_ratings.sort(key=lambda x: x[1], reverse=True)

# Get the top N recommended movies (e.g., top 10)
top_n = 10
recommended_movies = user_ratings[:top_n]

In [20]:
# Print the recommended movies
print(f'Top {top_n} recommended movies for user {user_id}:')
for movie_id, estimated_rating in recommended_movies:
    movie_title = movies_df[movies_df['movieId'] == movie_id]['title'].values[0]
    print(f'{movie_title} (Estimated Rating: {estimated_rating:.2f})')

Top 10 recommended movies for user 100:
Boondock Saints, The (2000) (Estimated Rating: 4.65)
Boot, Das (Boat, The) (1981) (Estimated Rating: 4.60)
Braveheart (1995) (Estimated Rating: 4.59)
In the Name of the Father (1993) (Estimated Rating: 4.59)
Rear Window (1954) (Estimated Rating: 4.58)
Third Man, The (1949) (Estimated Rating: 4.58)
Casablanca (1942) (Estimated Rating: 4.58)
Some Like It Hot (1959) (Estimated Rating: 4.57)
Seventh Seal, The (Sjunde inseglet, Det) (1957) (Estimated Rating: 4.56)
Seven Samurai (Shichinin no samurai) (1954) (Estimated Rating: 4.55)


In [21]:
# Evaluate the model's performance using RMSE (Root Mean Squared Error)
rmse = accuracy.rmse(surprise_preds)
print(f'RMSE: {rmse}')

RMSE: 0.8807
RMSE: 0.8807462819979623


In [22]:
# Calculate MAE (Mean Absolute Error)
mae = accuracy.mae(surprise_preds)
print(f'MAE: {mae}')

MAE:  0.6766
MAE: 0.6765729095860605


## model 3: surprise matrix factorization with SVD - tuned hyperparameters

In [25]:
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
import numpy as np

# Create a Surprise dataset from the ratings data in ratings_df
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define hyperparameter search space
n_factors = [100, 150, 200]
n_epochs = [20, 25, 30]
reg_all = [0.02, 0.1, 0.2]

best_rmse = np.inf
best_mae = np.inf
best_model = None

# Perform a manual grid search
for n in n_factors:
    for epoch in n_epochs:
        for reg in reg_all:
            print(f"Training model with n_factors={n}, n_epochs={epoch}, and reg_all={reg}")
            # Initialize the SVD algorithm
            surprise_model = SVD(n_factors=n, n_epochs=epoch, reg_all=reg, random_state=42)

            # Train the model on the training set
            surprise_model.fit(trainset)

            # Make predictions on the test set
            surprise_preds = surprise_model.test(testset)

            # Evaluate the model's performance using RMSE (Root Mean Squared Error)
            rmse = accuracy.rmse(surprise_preds)

            # Calculate MAE (Mean Absolute Error)
            mae = accuracy.mae(surprise_preds)

            print(f'RMSE: {rmse}, MAE: {mae}')

            # Check if this model's RMSE and MAE are better than the best so far
            if rmse < best_rmse and mae < best_mae:
                best_rmse = rmse
                best_mae = mae
                best_model = surprise_model

print(f'Best RMSE: {best_rmse}')
print(f'Best MAE: {best_mae}')

Training model with n_factors=100, n_epochs=20, and reg_all=0.02
RMSE: 0.8807
MAE:  0.6766
RMSE: 0.8807462819979623, MAE: 0.6765729095860605
Training model with n_factors=100, n_epochs=20, and reg_all=0.1
RMSE: 0.8768
MAE:  0.6747
RMSE: 0.8768471300806494, MAE: 0.6746908977104873
Training model with n_factors=100, n_epochs=20, and reg_all=0.2
RMSE: 0.8804
MAE:  0.6780
RMSE: 0.8804417094197639, MAE: 0.6780371572841168
Training model with n_factors=100, n_epochs=25, and reg_all=0.02
RMSE: 0.8810
MAE:  0.6760
RMSE: 0.881006689163421, MAE: 0.6759554089170796
Training model with n_factors=100, n_epochs=25, and reg_all=0.1
RMSE: 0.8742
MAE:  0.6722
RMSE: 0.8741750964150049, MAE: 0.6721650045131462
Training model with n_factors=100, n_epochs=25, and reg_all=0.2
RMSE: 0.8786
MAE:  0.6763
RMSE: 0.8786425364637055, MAE: 0.676336252694741
Training model with n_factors=100, n_epochs=30, and reg_all=0.02
RMSE: 0.8827
MAE:  0.6766
RMSE: 0.8827039289713046, MAE: 0.676625096950452
Training model with 