# TITLE

# CONTENTS

# BUSINESS UNDERSTANDING

# DATA UNDERSTANDING

**Iteration 1:** Collaborative Filtering Using Matrix Factorization

Implement collaborative filtering with Matrix Factorization using ALS.
Evaluate the model using RMSE or MAE.
Generate top-5 movie recommendations for users.

**Iteration 2:** Hybrid Approach with Content-Based Filtering

Combine collaborative and content-based recommendations.
Re-evaluate the hybrid model.
Compare scores with the pure collaborative filtering model from Iteration 1.

**Iteration 3:** Addressing the Cold Start Problem

Implement a temporary solution for new users (cold start) using demographics or movie popularity.
Enhance the model with implicit feedback or deep learning approaches.
Validate the cold start handling mechanism.

**Iteration 4:** Implementing Surprise for Collaborative Filtering

Integrate the Surprise library.
Choose and optimize collaborative filtering algorithms from Surprise.
Evaluate the model using Surprise's metrics.
Generate top-5 movie recommendations for users using the best-performing Surprise-based collaborative filtering model. Compare to past model iterations.

In [1]:
import pandas as pd

ratings_df = pd.read_csv('Data/ratings.csv')
movies_df = pd.read_csv('Data/movies.csv')
tags_df = pd.read_csv('Data/tags.csv')
links_df = pd.read_csv('Data/links.csv')

In [2]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings_df.shape

(100836, 4)

In [5]:
ratings_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [6]:
ratings_df.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

It looks like our "ratings_df" is in good shape with no missing values. We have a dataset with 100,836 ratings from 610 unique users for various movies. The "movieId" column contains IDs for the movies, which range from 1 to 193,609. User ratings range from 0.5 to 5.0, with an average rating of approximately 3.50. The "timestamp" column contains timestamp information, which might not be directly relevant for collaborative filtering but can be used for time-based recommendations or analysis. With a clean and well-structured dataset, we can proceed with a collaborative filtering approach to our first model iteration.

# MODELING

## Matrix Factorization with ALS -  PySpark

In [7]:
# Import necessary Spark libraries
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import expr, explode, col
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType

# Create a Spark session
spark = SparkSession.builder.appName("MovieRecommendations").getOrCreate()

# Convert your ratings data to a Spark DataFrame
ratings_spark = spark.createDataFrame(ratings_df)

# Define ALS model parameters
als = ALS(
    rank=10,              # Number of latent factors (you can adjust this)
    maxIter=10,           # Number of iterations
    regParam=0.01,        # Regularization parameter (you can adjust this)
    userCol="userId",
    itemCol="movieId",
    ratingCol="rating",
)

# Fit the ALS model (includes matrix factorization) to the entire dataset
model = als.fit(ratings_spark)

# Generate recommendations for all users
user_recommendations = model.recommendForAllUsers(5)  # Number of recommendations per user

# Filter recommendations for a specific user (e.g., user with ID 1)
user_id = 1
user_recommendations = user_recommendations.filter(col("userId") == user_id)

# Extract and rename columns for readability
user_recommendations = user_recommendations.select(
    col("userId"),
    explode(col("recommendations")).alias("recommendations")
)

# Extract movieId and rating from the recommendations struct
user_recommendations = user_recommendations.select(
    "userId",
    col("recommendations.movieId").alias("movieId"),
    col("recommendations.rating").alias("predictedRating")
)

# Sort the recommendations by predicted rating in descending order
user_recommendations = user_recommendations.orderBy(col("predictedRating").desc())

user_recommendations.show()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/04 18:57:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/04 18:57:39 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
23/11/04 18:57:41 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/11/04 18:57:41 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
23/11/04 18:57:42 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK

+------+-------+---------------+
|userId|movieId|predictedRating|
+------+-------+---------------+
|     1|   3036|       6.909803|
|     1|   7748|       6.611859|
|     1|   1096|      6.5637517|
|     1|   3814|        6.45934|
|     1|   4396|      6.0607758|
+------+-------+---------------+



                                                                                

In [8]:
from pyspark.ml.evaluation import RegressionEvaluator

# Evaluate the model on the entire dataset using RMSE
predictions = model.transform(ratings_spark)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE) on the entire dataset: {rmse}")

                                                                                

Root Mean Squared Error (RMSE) on the entire dataset: 0.5030397826355685


In [9]:
from pyspark.sql.functions import abs, col

# Calculate the absolute difference between the actual rating and predicted rating
predictions = model.transform(ratings_spark)
predictions = predictions.withColumn("abs_diff", abs(col("rating") - col("prediction")))

# Calculate the mean of the absolute differences
mae = predictions.select("abs_diff").agg({"abs_diff": "mean"}).collect()[0][0]

# Print the MAE
print(f"Mean Absolute Error (MAE) on the entire dataset: {mae}")

                                                                                

Mean Absolute Error (MAE) on the entire dataset: 0.3537580657590326
