# DES431 Project: Recommendation System

# Background

**MovieLens** is a movie recommendation system operated by GroupLens, a research group at the University of Minnesota. MovieLens has been developed to provide personalized movie recommendations to its users based on their viewing history and preferences.

# Task

1. This project is to be completed by a group of three students.
2. Propose and implement your own recommendation system based on the MovieLens dataset.
   - Use `ratings_train.csv` as the training set and `ratings_valid.csv` as the validation set.
   - Your recommendation system may utilize information from `movies.csv` for making recommendations.
   - The structure of the data files is detailed at `https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html`.
   - The goal of the recommendation system is to minimize the root-mean-square error (RMSE), i.e., to minimize the difference between the predicted and actual ratings.
   - Implement a function named `predict_rating`. This function should accept a DataFrame with two columns: `userId` and `movieId`, and return the DataFrame with an additional column named `rating`, containing predicted ratings of a `movieId` by a `userId`.
   - The `predict_rating` function must be compatible with an undisclosed test set having the same format as the validation set. The test set contains  Your implementation will be evaluated by the test set. Failure to comply will result in a 50% deduction of your score.
   - You are required to modify the given program to enhance recommendation quality. Submitting the unaltered original program will be considered plagiarism.
3. Prepare slides for a 7-minute presentation that explains your proposed technique and algorithm for making recommendations, and demonstrates your RMSE results on the validation set.
4. Submit your Python notebook and the presentation slides in PDF format via Google Classroom by April 30, 2024, at 23:59. All members of the group must individually submit their work to Google Classroom. Late submissions will not be accepted and will incur a 10% deduction. Do not procrastinate. Plagiarism and code duplication will be rigorously checked.
5. Present your work on May 1, 2024, within a 7-minute timeframe. Presentations exceeding 7 minutes will result in point deductions.


In [171]:
# Edit this cell for the group name and members
#content based (Combine Genres and Titles),TF-IDF matrix
#title, genre similarity

In [172]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer


# Loading data

In [173]:
ratings_train = pd.read_csv('ratings_train.csv')
ratings_valid = pd.read_csv('ratings_valid.csv')
movies = pd.read_csv('movies.csv')

In [174]:
ratings_train.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,96464.0,96464.0,96464.0,96464.0
mean,327.86935,19105.768059,3.509325,1204483000.0
std,183.95296,35243.409786,1.041385,216528300.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1196.0,3.0,1013395000.0
50%,330.0,2959.0,3.5,1182909000.0
75%,479.0,7486.0,4.0,1435993000.0
max,610.0,193609.0,5.0,1537799000.0


In [175]:
ratings_train.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [176]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


# Constructing model and predicting ratings

In [177]:
#TF-IDF
movies['combined_features'] = movies['title'] + " " + movies['genres'].replace('|', ' ')

# Create a TF-IDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Apply TF-IDF to combined features
tfidf_matrix = tfidf.fit_transform(movies['combined_features'])

# Calculate cosine similarity matrix from the TF-IDF matrix
content_similarity = cosine_similarity(tfidf_matrix)

# Convert to DataFrame, ensuring correct indexing
content_similarity_df = pd.DataFrame(content_similarity, index=movies['movieId'], columns=movies['movieId'])

In [178]:
# Calculate the average rating for each movie

avg_rating = ratings_train.groupby('movieId')['rating'].mean().reset_index()
avg_rating.columns = ['movieId', 'averageRating']
ratings_with_avg = pd.merge(ratings_train, avg_rating, on='movieId', how='left')

# Display the first few entries of the average ratings
print(ratings_with_avg.head())


   userId  movieId  rating  timestamp  averageRating
0       1        1     4.0  964982703       3.920930
1       1        3     4.0  964981247       3.259615
2       1        6     4.0  964982224       3.946078
3       1       47     5.0  964983815       3.975369
4       1       50     5.0  964982931       4.237745


In [179]:
ratings_train = pd.read_csv('ratings_train.csv')

# Create the utility matrix
utility_matrix = ratings_train.pivot_table(index='movieId', columns='userId', values='rating').fillna(0)

# Compute the cosine similarity matrix
similarity_matrix = cosine_similarity(utility_matrix)

# Convert the similarity matrix to a DataFrame for better readability and manipulation
similarity_df = pd.DataFrame(similarity_matrix, index=utility_matrix.index, columns=utility_matrix.index)

# Show the similarity matrix
print(similarity_df.head())

movieId    1         2         3         4         5         6         7       \
movieId                                                                         
1        1.000000  0.410562  0.296917  0.035573  0.295509  0.376316  0.277491   
2        0.410562  1.000000  0.282438  0.106415  0.252313  0.297009  0.228576   
3        0.296917  0.282438  1.000000  0.092406  0.405341  0.284257  0.402831   
4        0.035573  0.106415  0.092406  1.000000  0.197276  0.089685  0.275035   
5        0.295509  0.252313  0.405341  0.197276  1.000000  0.292412  0.456264   

movieId    8         9         10      ...  193565  193567  193571  193573  \
movieId                                ...                                   
1        0.115186  0.232586  0.395573  ...     0.0     0.0     0.0     0.0   
2        0.149095  0.044835  0.417693  ...     0.0     0.0     0.0     0.0   
3        0.334122  0.304840  0.242954  ...     0.0     0.0     0.0     0.0   
4        0.168453  0.000000  0.095598  ...

In [180]:
#Predicts Active User and Take Average of All Rating on That Item

utility_matrix = ratings_train.pivot_table(index='movieId', columns='userId', values='rating').fillna(0)

# Compute cosine similarity based on rating
rating_similarity_matrix = cosine_similarity(utility_matrix)
rating_similarity_df = pd.DataFrame(rating_similarity_matrix, index=utility_matrix.index, columns=utility_matrix.index)

# Calculate average ratings for each movie
avg_rating = ratings_train.groupby('movieId')['rating'].mean()

In [181]:
movies['genres'] = movies['genres'].apply(lambda x: x.split('|') if isinstance(x, str) else ['Unknown'])

# Binarize genres
mlb = MultiLabelBinarizer()
genres_matrix = mlb.fit_transform(movies['genres'])

# Compute genre similarity
genre_similarity = cosine_similarity(genres_matrix)
genre_similarity_df = pd.DataFrame(genre_similarity, index=movies['movieId'], columns=movies['movieId'])


In [182]:
weight_rating = 0.3
weight_genre = 0.3
weight_content = 0.4

combined_similarity_df = (weight_rating * rating_similarity_df +
                           weight_genre * genre_similarity_df +
                           weight_content * content_similarity_df)
combined_similarity_df.fillna(0, inplace=True)

In [183]:
def predict_rating(df):
    predictions = []

    for _, row in df.iterrows():
        user_id = row['userId']
        item_id = row['movieId']
        if item_id not in utility_matrix.index or user_id not in utility_matrix.columns:
            prediction = ratings_train['rating'].mean()
        else:
            sim_scores = combined_similarity_df.loc[item_id]
            user_ratings = utility_matrix.loc[:, user_id]
            nonzero_ratings = user_ratings[user_ratings > 0]
            similar_items = sim_scores[nonzero_ratings.index].drop(item_id, errors='ignore')

            if similar_items.sum() > 0:
                prediction = np.dot(similar_items, nonzero_ratings) / similar_items.sum()
            else:
                prediction = avg_rating.get(item_id, np.mean(avg_rating))

        predictions.append((user_id, item_id, prediction))

    return pd.DataFrame(predictions, columns=['userId', 'movieId', 'rating'])


In [184]:
# Prepare df for prediction
r = ratings_valid[['userId', 'movieId']]

# Predict ratings
ratings_pred = predict_rating(r)

In [185]:
ratings_pred.head(10)

Unnamed: 0,userId,movieId,rating
0,4,45,3.570359
1,4,52,3.574186
2,4,58,3.551974
3,4,222,3.48783
4,4,247,3.565393
5,4,265,3.492657
6,4,319,3.55074
7,4,345,3.562767
8,4,417,3.603083
9,4,441,3.648042


In [186]:
from sklearn.metrics import mean_squared_error

r_true = ratings_valid['rating'].to_numpy()
r_pred = ratings_pred['rating'].to_numpy()

rmse = mean_squared_error(r_true, r_pred, squared=False)
print(f"RMSE = {rmse:.4f}")

RMSE = 0.8992
