# Projet Final
M2 MIAGE-ID classique
<br>ZOU Yongkang

# Data
- movie_ratings_500_id.pkl contains the interactions between users and movies
- movie_metadata.pkl contains detailed information about movies, e.g. genres, actors and directors of the movies.

# Goal

- Construct your own recommender systems
- Compare the performances of at least one of the baselines



# Baselines

## User-Based Collaborative Filtering
This approach predicts $\hat{r}_{(u,i)}$ by leveraging the ratings given to $i$ by $u$'s similar users. Formally, it is written as:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{v \in \mathcal{N}_i(u)}sim_{(u,v)}r_{vi}}{\sum\limits_{v \in \mathbf{N}_i(u)}|sim_{(u,v)}|}
\end{equation}
where $sim_{(u,v)}$ is the similarity between user $u$ and $v$. Usually, $sim_{(u,v)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Item-Based Collaborative Filtering
This approach exploits the ratings given to similar items by the target user. The idea is formalized as follows:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{j \in \mathcal{N}_u(i)}sim_{(i,j)}r_{ui}}{\sum\limits_{j \in \mathbf{N}_u(i)}|sim_{(i,j)}|}
\end{equation}
where $sim_{(i,j)}$ is the similarity between item $i$ and $j$. Usually, $sim_{(i,j)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Vanilla MF (You may use the package Surprise if you do not want to write the training function by your self)
Vanilla MF is the inner product of vectors that represent users and items. Each user is represented by a vector $\textbf{p}_u \in \mathbb{R}^d$, each item is represented by a vector $\textbf{q}_i \in \mathbb{R}^d$, and $\hat{r}_{(u,i)}$ is computed by the inner product of $\textbf{p}_u $ and $\textbf{q}_i$. The core idea of Vanilla MF is depicted in the followng figure and follows the idea of SVD as we have seen during the TD.

![picture](https://drive.google.com/uc?export=view&id=1EAG31Qw9Ti6hB7VqdONUlijWd4rXVobC)


\begin{equation}
\hat{r}_{(u,i)} = \textbf{p}_u{\textbf{q}_i}^T
\end{equation}

## Some variants of SVD



-  SVD with bias: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^Tp_u$
- SVD ++: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^T(p_u + |I_u|^{\frac{-1}{2}}\sum\limits_{j \in I_u}y_j)$

## Factorization machine (FM)

FM takes into account user-item interactions and other features, such as users' contexts and items' attributes. It captures the second-order interactions of the vectors representing these features , thereby enriching FM's expressiveness. However, interactions involving less relevant features may introduce noise, as all interactions share the same weight. e.g. You may use FM to consider the features of items.

\begin{equation}
\hat{y}_{FM}(\textbf{X}) = w_0 + \sum\limits_{j =1}^nw_jx_j + \sum\limits_{j=1}^n\sum\limits_{k=j+1}^n\textbf{v}_j^T\textbf{v}_kx_jx_k
\end{equation}

where $\textbf{X} \in \mathbb{R}^n$ is the feature vector, $n$ denotes the number of features, $w_0$ is the global bias, $w_j$ is the bias of the $j$-th feature and $\textbf{v}_j^T\textbf{v}_k$ denotes the bias of interaction between $j$-th feature and $k$-th feature, $\textbf{v}_j \in \mathbb{R}^d$ is the vector representing $j$-th feature.

## MLP

You may also represent users and items by vectors and them feed them into a MLP to make prediction.

## Metrics

- \begin{equation}
RMSE = \sqrt{\frac{1}{|\mathcal{T}|}\sum\limits_{(u,i)\in\mathcal{T}}{(\hat{r}_{(u,i)}-r_{ui})}^2}
\end{equation}

- \begin{equation}
MAE = \frac{1}{|\mathcal{T}|}\sum\limits_{(u,i)\in\mathcal{T}}{|\hat{r}_{(u,i)}-r_{ui}|}
\end{equation}
-  Bonnus: you may also consider NDCG and HR under the top-k setting


# Requirements
- Minimizing the RMSE and MAE
- Try to compare different methods that you have adopted and interpret the results that you have obtained
- Construct a recommender system that returns the top 10 movies that the users have not watched in the past
- Before January 7th

# Loading Data

In [97]:
import pandas as pd

movie_meta = pd.read_pickle("movie_metadata.pkl")
movie_ratings = pd.read_pickle("movie_ratings_500_id.pkl")

exemple_data:


```
movie_meta:
{'tt0305224':
    {'director': 'Peter Segal',
    'genre': ['Comedy'],
    'actors': ['Jack Nicholson','Adam Sandler','Marisa Tomei','Woody Harrelson','John Turturro'],
    'title': 'Anger Management'
    },
  'tt0245046':
    {'director': 'Gillian Armstrong',
    'genre': ['Drama', 'Romance', 'Thriller'],
    'actors': ['Cate Blanchett', 'James Fleet', 'Abigail Cruttenden'],
    'title': 'Charlotte Gray'
    },
 'tt0185125':
  {'director': 'Pedro Almodóvar',
    'genre': ['Drama'],
    'actors': ['Cecilia Roth','Marisa Paredes','Candela Peña','Penélope Cruz'],
    'title': 'All About My Mother'
    },
  ...
}
```



```
movie_ratings:
{'tt0305224': [
    {'user_rating': '4','user_rating_date': '2005-07-05','user_id': '1380819'},
    {'user_rating': '3', 'user_rating_date': '2005-07-05', 'user_id': '185150'},
    {'user_rating': '4', 'user_rating_date': '2005-07-06', 'user_id': '1351377'},
    {'user_rating': '2', 'user_rating_date': '2005-07-06', 'user_id': '386143'},
    {'user_rating': '3', 'user_rating_date': '2003-12-23', 'user_id': '2173336'},
    {'user_rating': '2', 'user_rating_date': '2004-02-26', 'user_id': '716091'},
    {'user_rating': '1', 'user_rating_date': '2004-08-28', 'user_id': '671513'},
    {'user_rating': '4', 'user_rating_date': '2005-07-08', 'user_id': '1227848'},
    {'user_rating': '4', 'user_rating_date': '2004-08-02', 'user_id': '712664'},
    {'user_rating': '3', 'user_rating_date': '2004-09-13', 'user_id': '1907667'},
    {'user_rating': '3', 'user_rating_date': '2003-12-18', 'user_id': '69867'},
    {'user_rating': '1', 'user_rating_date': '2003-10-23', 'user_id': '1402412'},
    {'user_rating': '1', 'user_rating_date': '2004-05-18', 'user_id': '1601783'},
    {'user_rating': '2', 'user_rating_date': '2005-07-11', 'user_id': '2640108'},
    ...],
  ...
}
```




In [98]:
# Convert the nested dictionary movie_meta to a DataFrame
meta = pd.DataFrame.from_dict(movie_meta, orient='index')

In [99]:
# Convert the nested dictionary movie_ratings to a DataFrame
rows_list = []

for movie_id, ratings in movie_ratings.items():
    for rating_info in ratings:
        rating_info['movie_id'] = movie_id
        rows_list.append(rating_info)

ratings = pd.DataFrame(rows_list)
ratings['user_rating'] = pd.to_numeric(ratings['user_rating'])

In [100]:
meta.head()

Unnamed: 0,director,genre,actors,title
tt0305224,Peter Segal,[Comedy],"[Jack Nicholson, Adam Sandler, Marisa Tomei, W...",Anger Management
tt0245046,Gillian Armstrong,"[Drama, Romance, Thriller]","[Cate Blanchett, James Fleet, Abigail Cruttenden]",Charlotte Gray
tt0185125,Pedro Almodóvar,[Drama],"[Cecilia Roth, Marisa Paredes, Candela Peña, P...",All About My Mother
tt0196229,Ben Stiller,[Comedy],"[Ben Stiller, Owen Wilson, Christine Taylor, W...",Zoolander
tt0308644,Marc Forster,"[Biography, Drama, Family]","[Johnny Depp, Kate Winslet, Julie Christie, Du...",Finding Neverland


In [101]:
ratings.head()

Unnamed: 0,user_rating,user_rating_date,user_id,movie_id
0,4,2005-07-05,1380819,tt0305224
1,3,2005-07-05,185150,tt0305224
2,4,2005-07-06,1351377,tt0305224
3,2,2005-07-06,386143,tt0305224
4,3,2003-12-23,2173336,tt0305224


# Exploratory Data Analysis (EDA)

##EDA of dataframe meta

In [102]:
meta

Unnamed: 0,director,genre,actors,title
tt0305224,Peter Segal,[Comedy],"[Jack Nicholson, Adam Sandler, Marisa Tomei, W...",Anger Management
tt0245046,Gillian Armstrong,"[Drama, Romance, Thriller]","[Cate Blanchett, James Fleet, Abigail Cruttenden]",Charlotte Gray
tt0185125,Pedro Almodóvar,[Drama],"[Cecilia Roth, Marisa Paredes, Candela Peña, P...",All About My Mother
tt0196229,Ben Stiller,[Comedy],"[Ben Stiller, Owen Wilson, Christine Taylor, W...",Zoolander
tt0308644,Marc Forster,"[Biography, Drama, Family]","[Johnny Depp, Kate Winslet, Julie Christie, Du...",Finding Neverland
...,...,...,...,...
tt0203019,George Tillman Jr.,"[Biography, Drama]","[Cuba Gooding Jr., Robert De Niro, Charlize Th...",Men of Honor
tt0169547,Sam Mendes,"[Drama, Romance]","[Kevin Spacey, Annette Bening, Thora Birch, We...",American Beauty
tt0227538,Robert Rodriguez,"[Action, Adventure, Comedy]","[Alexa PenaVega, Daryl Sabara, Antonio Banderas]",Spy Kids
tt0374536,Nora Ephron,"[Comedy, Fantasy, Romance]","[Nicole Kidman, Will Ferrell, Shirley MacLaine...",Bewitched


In [103]:
# install python-Levenshtein to check for potential spelling errors
!pip install python-Levenshtein



In [104]:
# check for edit distance
from Levenshtein import distance as levenshtein_distance

director_namelist = meta['director'].explode().unique()
genre_list = meta['genre'].explode().unique()
actor_namelist = meta['actors'].explode().unique()
movie_list = meta['title'].explode().unique()

print("check for director_namelist:")
for director1 in director_namelist:
    for director2 in director_namelist:
        if director1 != director2:
            dist = levenshtein_distance(director1, director2)
            if dist < 4:
                print(f"'{director1}' and '{director2}' might be a typo. Distance: {dist}")
print()
print("check for genre_list:")
for genre1 in genre_list:
    for genre2 in genre_list:
        if genre1 != genre2:
            dist = levenshtein_distance(genre1, genre2)
            if dist < 4:
                print(f"'{genre1}' and '{genre2}' might be a typo. Distance: {dist}")
print()
print("check for actor_namelist:")
for actor1 in actor_namelist:
    for actor2 in actor_namelist:
        if actor1 != actor2:
            dist = levenshtein_distance(actor1, actor2)
            if dist < 4:
                print(f"'{actor1}' and '{actor2}' might be a typo. Distance: {dist}")
print()
print("check for movie_list:")
for title1 in movie_list:
    for title2 in movie_list:
        if title1 != title2:
            dist = levenshtein_distance(title1, title2)
            if dist < 4:
                print(f"'{title1}' and '{title2}' might be a typo. Distance: {dist}")

check for director_namelist:
'Peter Webber' and 'Peter Weir' might be a typo. Distance: 3
'Tim Roth' and 'Joe Roth' might be a typo. Distance: 3
'Mike Binder' and 'Mike Barker' might be a typo. Distance: 3
'Mike Barker' and 'Mike Binder' might be a typo. Distance: 3
'Joe Roth' and 'Tim Roth' might be a typo. Distance: 3
'Michael Mann' and 'Michael Bay' might be a typo. Distance: 3
'Michael Mann' and 'Michael Mayer' might be a typo. Distance: 3
'Peter Care' and 'Peter Berg' might be a typo. Distance: 3
'Michael Bay' and 'Michael Mann' might be a typo. Distance: 3
'Michael Bay' and 'Michael Mayer' might be a typo. Distance: 3
'Peter Weir' and 'Peter Webber' might be a typo. Distance: 3
'Peter Weir' and 'Peter Berg' might be a typo. Distance: 3
'Michael Mayer' and 'Michael Mann' might be a typo. Distance: 3
'Michael Mayer' and 'Michael Bay' might be a typo. Distance: 3
'Peter Berg' and 'Peter Care' might be a typo. Distance: 3
'Peter Berg' and 'Peter Weir' might be a typo. Distance: 3
'Jo

In [105]:
# Using explode to count each director, genre, actor and title individually
unique_director = meta['director'].explode().nunique()
unique_genres = meta['genre'].explode().nunique()
unique_actors = meta['actors'].explode().nunique()
unique_movies = meta['title'].explode().nunique()

print("Number of directors:", unique_director)
print("Number of genres:", unique_genres)
print("Number of actors:", unique_actors)
print("Number of movies:", unique_movies)

Number of directors: 389
Number of genres: 20
Number of actors: 745
Number of movies: 528


In [106]:
director_counts = meta['director'].explode().value_counts()
most_frequent_director = director_counts.idxmax()
print("Director frequencies:\n", director_counts)
print("")
print("The most frequent director is:", most_frequent_director)

Director frequencies:
 Steven Soderbergh     7
Joel Schumacher       6
Steven Spielberg      5
Woody Allen           4
Jay Roach             4
                     ..
Henry Bean            1
Sean Penn             1
Stephen Sommers       1
Wes Craven            1
George Tillman Jr.    1
Name: director, Length: 389, dtype: int64

The most frequent director is: Steven Soderbergh


In [107]:
genre_counts = meta['genre'].explode().value_counts()
most_frequent_genre = genre_counts.idxmax()
print("Genre frequencies:\n", genre_counts)
print("")
print("The most frequent genre is:", most_frequent_genre)

Genre frequencies:
 Drama        349
Comedy       204
Romance      145
Crime        123
Action        92
Thriller      88
Adventure     77
Mystery       58
Fantasy       37
Biography     35
Sci-Fi        26
Horror        20
History       17
Family        16
Animation     14
War           11
Music         10
Sport          9
Western        5
Musical        4
Name: genre, dtype: int64

The most frequent genre is: Drama


In [108]:
actors_counts = meta['actors'].explode().value_counts()
most_frequent_actor = actors_counts.idxmax()
print("Actor frequencies:\n", actors_counts)
print("")
print("The most frequent actor is:", most_frequent_actor)

Actor frequencies:
 Ben Affleck               18
Julianne Moore            15
Philip Seymour Hoffman    15
Cate Blanchett            14
Penélope Cruz             14
                          ..
Mercedes Ruehl             1
Ice Cube                   1
Kristen Stewart            1
Sharon Stone               1
Aitana Sánchez-Gijón       1
Name: actors, Length: 745, dtype: int64

The most frequent actor is: Ben Affleck


In [109]:
# Check for any missing values
print(meta.isnull().sum())

director    0
genre       0
actors      0
title       0
dtype: int64



## EDA of dataframe ratings

In [110]:
ratings

Unnamed: 0,user_rating,user_rating_date,user_id,movie_id
0,4,2005-07-05,1380819,tt0305224
1,3,2005-07-05,185150,tt0305224
2,4,2005-07-06,1351377,tt0305224
3,2,2005-07-06,386143,tt0305224
4,3,2003-12-23,2173336,tt0305224
...,...,...,...,...
259813,5,2005-07-09,1139877,tt0361862
259814,4,2005-07-11,1460015,tt0361862
259815,5,2005-07-11,1098265,tt0361862
259816,4,2005-07-11,1962894,tt0361862


In [111]:
ratings_unique = ratings.drop_duplicates()
ratings_unique

Unnamed: 0,user_rating,user_rating_date,user_id,movie_id
0,4,2005-07-05,1380819,tt0305224
1,3,2005-07-05,185150,tt0305224
2,4,2005-07-06,1351377,tt0305224
3,2,2005-07-06,386143,tt0305224
4,3,2003-12-23,2173336,tt0305224
...,...,...,...,...
259813,5,2005-07-09,1139877,tt0361862
259814,4,2005-07-11,1460015,tt0361862
259815,5,2005-07-11,1098265,tt0361862
259816,4,2005-07-11,1962894,tt0361862


In [112]:
ratings.nunique()

user_rating             5
user_rating_date     2144
user_id             36968
movie_id              528
dtype: int64

In [113]:
movie_ratings_count = ratings.groupby('movie_id')['user_rating'].count()
print(movie_ratings_count)

movie_id
tt0118661    500
tt0118715    500
tt0118744    500
tt0118863    500
tt0119079    500
            ... 
tt0385307    500
tt0388973    500
tt0395169    500
tt0401792    500
tt0405159    500
Name: user_rating, Length: 528, dtype: int64


In [114]:
user_ratings_count = ratings.groupby('user_id')['user_rating'].count()
print(user_ratings_count)

user_id
1000192     1
1000287     6
1000433     1
1000457     1
1000460    14
           ..
999527      1
999586      1
999591      1
999652      2
999743      6
Name: user_rating, Length: 36968, dtype: int64


## Meta and Ratings

In [115]:
# check if the two dataframe are using these same movie_id

meta_movie_ids = set(meta.index)
ratings_movie_ids = set(ratings['movie_id'])

# check for the amount of movie_ids
same_number_of_ids = meta.index.nunique() == ratings['movie_id'].nunique()

# Check if every movie_id in ratings is present in meta
ratings_in_meta = ratings_movie_ids.issubset(meta_movie_ids)

# Check if every movie_id in meta is present in ratings
meta_in_ratings = meta_movie_ids.issubset(ratings_movie_ids)

# check for the same elements
same_ids = meta_movie_ids == ratings_movie_ids

print(f"Both DataFrames have the same number of movie_ids: {same_number_of_ids}")
print(f"Both DataFrames have the exact same movie_ids: {same_ids}")

Both DataFrames have the same number of movie_ids: True
Both DataFrames have the exact same movie_ids: True


# User-Based Collaborative Filter

This approach predicts $\hat{r}_{(u,i)}$ by leveraging the ratings given to $i$ by $u$'s similar users. Formally, it is written as:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{v \in \mathcal{N}_i(u)}sim_{(u,v)}r_{vi}}{\sum\limits_{v \in \mathbf{N}_i(u)}|sim_{(u,v)}|}
\end{equation}
where $sim_{(u,v)}$ is the similarity between user $u$ and $v$. Usually, $sim_{(u,v)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Construction of the filtering model

In [116]:
from scipy.sparse import coo_matrix

# Assuming 'ratings' is your original ratings data
ratings_matrix = ratings.pivot_table(index='user_id', columns='movie_id', values='user_rating')
stacked_ratings = ratings_matrix.stack(dropna=True)

# Calculating sparsity
num_holes = ratings_matrix.isna().sum().sum()
total_elements = ratings_matrix.size
sparsity = num_holes / total_elements

print("Sparsity of the matrix:", sparsity)

Sparsity of the matrix: 0.9866890406444886


Considering that the total amount of user_id is 36968, the calculation of Cosine Distance for all each 2 users would be increadibly large
<br>Thus, we'd like to select the most active users for calculating the similarity

In [117]:
from collections import defaultdict

def user_activity(ratings, top_n_users):
    user_similarities = defaultdict(dict)
    user_ratings_count = defaultdict(int)

    # Select the most active users and count their ratings
    user_activity = ratings['user_id'].value_counts()
    active_users = user_activity.head(top_n_users).index

    # Store the number of ratings for each of the top users
    for user in active_users:
        user_ratings_count[user] = user_activity.loc[user]

    print(f"if we only select the top {top_n_users} most active users :")

    active_users = list(user_ratings_count.keys())
    first_user_id = active_users[0]
    last_ranked_user_id = active_users[top_n_users-1]
    print("First user: ",f"User {first_user_id} has given {user_ratings_count[first_user_id]} ratings.")
    print(f"{top_n_users}th user: ",f"User {last_ranked_user_id} has given {user_ratings_count[last_ranked_user_id]} ratings.")

# Exemple usage:
# select only 5000 users
user_ratings_count = user_activity(ratings, 5000)

if we only select the top 5000 most active users :
First user:  User 1174530 has given 341 ratings.
5000th user:  User 1449252 has given 10 ratings.


In [118]:
def get_active_users(ratings_matrix, top_n_users):
    user_activity = ratings_matrix.notnull().sum(axis=1).sort_values(ascending=False)
    active_user_ids = user_activity.head(top_n_users).index
    active_user_indices = [user_to_index[user_id] for user_id in active_user_ids]
    return active_user_indices

In [119]:
# because user_id is not a continous sequence which be difficult for quoting its id
# we then establish a continous serie from 0 to xxx to simplify the operation
import numpy as np
import pandas as pd

unique_user_ids = np.unique(ratings_matrix.index.values)
user_to_index = {user_id: index for index, user_id in enumerate(unique_user_ids)}
index_to_user = {index: user_id for user_id, index in user_to_index.items()}

##test
##print("User to Index:", list(user_to_index.items())[:5])
##print("Index to User:", list(index_to_user.items())[:5])

In [120]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

def user_based_prediction(ratings_matrix, user_to_index, target_user_id, top_n_users=5000, similarity_threshold=0.1):

    # Step 1: Check for user_id
    target_user_id = str(target_user_id)
    if target_user_id not in user_to_index:
        raise ValueError(f"User {target_user_id} does not exist in the user_to_index mapping.")
    target_user_index = user_to_index[target_user_id]

    # Step 2: Get active users
    active_user_indices = get_active_users(ratings_matrix, top_n_users)

    # Ensure the target user is taken into consideration
    if target_user_index not in active_user_indices:
        active_user_indices.append(target_user_index)
    active_users = [index_to_user[idx] for idx in active_user_indices]

    # Create ratings matrix for active users
    active_ratings_matrix = ratings_matrix.loc[active_users]
    active_ratings_matrix_sparse = csr_matrix(np.nan_to_num(active_ratings_matrix.to_numpy()))

    # Step 3: Calculation of Cosine Similiarity
    user_similarity = cosine_similarity(active_ratings_matrix_sparse)
    target_user_similarity_index = active_user_indices.index(target_user_index)
    target_user_similarity = user_similarity[target_user_similarity_index]
    target_user_similarity[target_user_similarity < similarity_threshold] = 0

    # Step 4: Predict ratings for all movies
    predicted_scores = {}
    for movie_id in ratings_matrix.columns:
        movie_index = ratings_matrix.columns.get_loc(movie_id)
        movie_ratings = active_ratings_matrix_sparse[:, movie_index].toarray().ravel()
        weighted_scores = movie_ratings * target_user_similarity
        sum_of_weights = np.sum(target_user_similarity[(movie_ratings > 0) & (target_user_similarity > 0)])
        if sum_of_weights > 0:
            predicted_score = np.sum(weighted_scores) / sum_of_weights
            predicted_scores[movie_id] = predicted_score
    return predicted_scores

def recommend(predicted_scores, meta, ratings_matrix, target_user_id, top_n=10):
    user_ratings = ratings_matrix.loc[target_user_id]
    filtered_scores = {movie_id: score for movie_id, score in predicted_scores.items() if pd.isna(user_ratings[movie_id])}
    top_movies = sorted(filtered_scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    top_movie_info = [(movie_id, meta.loc[movie_id, 'title'], score) for movie_id, score in top_movies]
    return top_movie_info

In [121]:
# Exemple usage
## user_id '1000192' is not in the top active users
predict_1000192 = user_based_prediction(ratings_matrix, user_to_index, '1000192')
print("Top 10 recommended movie titles for user 1000192 :")
print(recommend(predict_1000192,meta,ratings_matrix,'1000192'))

## user_id '1174530' is in the top active users
predict_1174530 = user_based_prediction(ratings_matrix, user_to_index, '1174530')
print("Top 10 recommended movie titles for user 1174530 :")
print(recommend(predict_1174530,meta,ratings_matrix,'1174530'))

Top 10 recommended movie titles for user 1000192 :
[('tt0120601', 'Being John Malkovich', 5.0), ('tt0120623', "A Bug's Life", 5.0), ('tt0168786', 'Antwone Fisher', 5.0), ('tt0335345', 'The Passion of the Christ', 5.0), ('tt0246578', 'Donnie Darko', 4.783273257725223), ('tt0141974', 'The War Zone', 4.547004949122404), ('tt0368447', 'The Village', 4.522353948261479), ('tt0137523', 'Fight Club', 4.515501611501895), ('tt0274558', 'The Hours', 4.368577184051996), ('tt0230600', 'The Others', 4.35099446585942)]
Top 10 recommended movie titles for user 1174530 :
[('tt0141974', 'The War Zone', 4.332339136178121), ('tt0335126', 'Grand Champion', 4.068154657039985), ('tt0160547', 'More Dogs Than Bones', 4.0), ('tt0347618', 'Neko no ongaeshi', 3.9811087156990084), ('tt0185125', 'All About My Mother', 3.919631270508301), ('tt0246578', 'Donnie Darko', 3.888081983265853), ('tt0161010', 'The Trench', 3.783519113983769), ('tt0124315', 'The Cider House Rules', 3.780113884364557), ('tt0198021', 'Where th

## Metrics of the model

Since we have 36968 users and the ranking of active users shows:
<br>First user:  User 1174530 has given 341 ratings.
<br>100th user:  User 1990901 has given 170 ratings.
<br>500th user:  User 193828 has given 86 ratings.
<br>1000th user:  User 1571931 has given 54 ratings.
<br>5000th user:  User 1449252 has given 10 ratings.


In [122]:
def calculate_rmse_for_active_users(ratings_matrix, user_to_index, index_to_user, top_n_users=5000, similarity_threshold=0.1):
    # Initialize the sum of squared differences and count of ratings
    sum_squared_diff = 0
    ratings_count = 0

    # Get the indices of the most active users
    active_user_indices = get_active_users(ratings_matrix, 1000)
    active_users = [index_to_user[idx] for idx in active_user_indices]

    # Iterate through each active user to predict ratings and calculate RMSE
    for user_id in active_users:
        # Predict ratings for the user
        try:
            predicted_scores = user_based_prediction(
                ratings_matrix,
                user_to_index,
                user_id,
                top_n_users,
                similarity_threshold
            )
        except ValueError:
            # Skip if the user is not in user_to_index mapping
            continue

        # Get the actual ratings from the user
        actual_ratings = ratings_matrix.loc[user_id].dropna()

        # Calculate the squared differences for known ratings
        for movie_id, actual_rating in actual_ratings.items():
            # Check if the movie was rated in predictions
            if movie_id in predicted_scores:
                predicted_rating = predicted_scores[movie_id]
                diff = actual_rating - predicted_rating
                sum_squared_diff += diff ** 2
                ratings_count += 1

    # Calculate the RMSE
    rmse = np.sqrt(sum_squared_diff / ratings_count) if ratings_count != 0 else np.nan
    return rmse

In [123]:
rmse = calculate_rmse_for_active_users(ratings_matrix, user_to_index, index_to_user, top_n_users=10, similarity_threshold=0.1)
print(rmse)

0.7583771424767344


In [None]:
rmse = calculate_rmse_for_active_users(ratings_matrix, user_to_index, index_to_user, top_n_users=100, similarity_threshold=0.1)
print(rmse)

In [None]:
rmse = calculate_rmse_for_active_users(ratings_matrix, user_to_index, index_to_user, top_n_users=100, similarity_threshold=0.3)
print(rmse)

In [None]:
rmse = calculate_rmse_for_active_users(ratings_matrix, user_to_index, index_to_user, top_n_users=100, similarity_threshold=0.05)
print(rmse)

To minimise the rmse, we should try to find the best **(top_n_users, similarity_threshold)**

# Item-Based Collaborative Filtering

This approach exploits the ratings given to similar items by the target user. The idea is formalized as follows:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{j \in \mathcal{N}_u(i)}sim_{(i,j)}r_{ui}}{\sum\limits_{j \in \mathbf{N}_u(i)}|sim_{(i,j)}|}
\end{equation}
where $sim_{(i,j)}$ is the similarity between item $i$ and $j$. Usually, $sim_{(i,j)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Construction of the Item_Based Filtering Model

According to EDA of meta:
<br>Number of directors: 389
<br>Number of genres: 20
<br>Number of actors: 745
<br>Number of movies: 528

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def calculate_cosine_similarity(meta, ratings_matrix):
    # Define TF-IDF Vectorizer
    tfidf = TfidfVectorizer()

    # Generate TF-IDF matrix for genres
    tfidf_matrix_genre = tfidf.fit_transform(meta['genre'].apply(lambda x: ' '.join(x)))
    cosine_sim_genre = cosine_similarity(tfidf_matrix_genre, tfidf_matrix_genre)

    # Generate TF-IDF matrix for directors
    tfidf_matrix_director = tfidf.fit_transform(meta['director'])
    cosine_sim_director = cosine_similarity(tfidf_matrix_director, tfidf_matrix_director)

    # Generate TF-IDF matrix for actors
    tfidf_matrix_actors = tfidf.fit_transform(meta['actors'].apply(lambda x: ' '.join(x)))
    cosine_sim_actors = cosine_similarity(tfidf_matrix_actors, tfidf_matrix_actors)

    # Combine the cosine similarity matrices
    combined_cosine_sim = (cosine_sim_genre*1/20 + cosine_sim_director*1/389 + cosine_sim_actors*1/745)/(1/20+1/389+1/745)


    # Reindex the DataFrame to match the order of movie_ids in ratings_matrix
    ordered_cosine_sim = pd.DataFrame(combined_cosine_sim, index=meta.index, columns=meta.index)
    ordered_cosine_sim = ordered_cosine_sim.reindex(index=ratings_matrix.columns, columns=ratings_matrix.columns)

    return ordered_cosine_sim

# Usage
# Ensure ratings_matrix is defined and accessible
movies_cosine_similarity = calculate_cosine_similarity(meta, ratings_matrix)
movies_cosine_similarity

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def genre_cosine_similarity(meta, ratings_matrix):
    # Define TF-IDF Vectorizer
    tfidf = TfidfVectorizer()

    # Generate TF-IDF matrix for genres
    tfidf_matrix_genre = tfidf.fit_transform(meta['genre'].apply(lambda x: ' '.join(x)))
    cosine_sim_genre = cosine_similarity(tfidf_matrix_genre, tfidf_matrix_genre)

    # Reindex the DataFrame to match the order of movie_ids in ratings_matrix
    ordered_cosine_sim = pd.DataFrame(cosine_sim_genre, index=meta.index, columns=meta.index)
    ordered_cosine_sim = ordered_cosine_sim.reindex(index=ratings_matrix.columns, columns=ratings_matrix.columns)

    return ordered_cosine_sim

# Usage
# Ensure ratings_matrix is defined and accessible
genre_cosine_similarity = genre_cosine_similarity(meta, ratings_matrix)
genre_cosine_similarity

In [None]:
def item_based_prediction(ratings_matrix, cosine_sim_matrix, user_to_index, target_user_id):
    # Check if target_user_id is in user_to_index mapping
    if target_user_id not in user_to_index:
        raise ValueError(f"User {target_user_id} does not exist in the user_to_index mapping.")

    # Get the target user's index from the user_to_index mapping
    target_user_index = user_to_index[target_user_id]

    # Get the user's ratings from the ratings matrix
    user_ratings = ratings_matrix.iloc[target_user_index].fillna(0)  # Use iloc for positional indexing

    # Initialize a Series to hold the predicted ratings
    predicted_ratings = pd.Series(index=ratings_matrix.columns, dtype=float)

    # Predict ratings for each item
    for item_index in ratings_matrix.columns:  # Iterate over item indices (column names)
        # Compute the numerator: the weighted sum of ratings for each item
        if isinstance(cosine_sim_matrix, pd.DataFrame):
            # If cosine_sim_matrix is a DataFrame, use .loc or .iloc
            weighted_ratings_sum = np.dot(cosine_sim_matrix.loc[:, item_index], user_ratings)
        else:
            # If cosine_sim_matrix is a numpy array, use numpy indexing
            weighted_ratings_sum = np.dot(cosine_sim_matrix[:, item_index], user_ratings)

        # Compute the denominator: sum of absolute values of the cosine similarities for each item
        sum_of_weights = np.abs(cosine_sim_matrix.loc[:, item_index] if isinstance(cosine_sim_matrix, pd.DataFrame) else cosine_sim_matrix[:, item_index]).sum()

        # Predict the rating for the item
        if sum_of_weights != 0:
            predicted_ratings[item_index] = weighted_ratings_sum / sum_of_weights
        else:
            # Predict a neutral score if no weights are available
            predicted_ratings[item_index] = user_ratings.mean()  # Or another neutral value

    return predicted_ratings

In [None]:
# Example usage:
predict_item_1000192 = item_based_prediction(ratings_matrix, movies_cosine_similarity, user_to_index, '1000192')
print("Top 10 recommended movie titles for user 1000192:")
print(recommend(predict_item_1000192,meta,ratings_matrix,'1000192'))

predict_item_1380819 = item_based_prediction(ratings_matrix, movies_cosine_similarity, user_to_index, '1380819')
print("Top 10 recommended movie titles for user 1380819:")
print(recommend(predict_item_1380819,meta,ratings_matrix,'1380819'))

## Metrics of Model

Similarily as the RMSE calculation for user_based_prediction, we have 36968 users and the ranking of active users shows:
<br>First user: User 1174530 has given 341 ratings.
<br>100th user: User 1990901 has given 170 ratings.
<br>500th user: User 193828 has given 86 ratings.
<br>1000th user: User 1571931 has given 54 ratings.
<br>5000th user: User 1449252 has given 10 ratings.

To optimise the calculation, the calculation only takes the first 1000 users into consideration.

In [None]:
def calculate_rmse_for_item_based(ratings_matrix, cosine_sim_matrix, user_to_index, index_to_user):
    # Initialize the sum of squared differences and count of ratings
    sum_squared_diff = 0
    ratings_count = 0

    # Get the indices of the most active users
    active_user_indices = get_active_users(ratings_matrix, 1000)
    active_users = [index_to_user[idx] for idx in active_user_indices]

    # Iterate through each active user to predict ratings and calculate RMSE
    for user_id in active_users:
        # Predict ratings for the user
        try:
            predicted_ratings = item_based_prediction(
                ratings_matrix,
                cosine_sim_matrix,
                user_to_index,
                user_id
            )
        except ValueError as e:
            # Skip if the user is not in user_to_index mapping or any other issue
            print(f"Skipping user {user_id}: {e}")
            continue

        # Get the actual ratings from the user
        actual_ratings = ratings_matrix.loc[user_id].dropna()

        # Calculate the squared differences for known ratings
        for movie_id, actual_rating in actual_ratings.items():
            # Check if the movie was rated in predictions
            if movie_id in predicted_ratings:
                predicted_rating = predicted_ratings[movie_id]
                diff = actual_rating - predicted_rating
                sum_squared_diff += diff ** 2
                ratings_count += 1

    # Calculate the RMSE
    rmse = np.sqrt(sum_squared_diff / ratings_count) if ratings_count != 0 else np.nan
    return rmse

In [None]:
rmse = calculate_rmse_for_item_based(ratings_matrix, movies_cosine_similarity, user_to_index, index_to_user)
print(rmse)

In [None]:
rmse = calculate_rmse_for_item_based(ratings_matrix, genre_cosine_similarity, user_to_index, index_to_user)
print(rmse)

# Vanilla MF

(You may use the package Surprise if you do not want to write the training function by your self)

Vanilla MF is the inner product of vectors that represent users and items. Each user is represented by a vector $\textbf{p}_u \in \mathbb{R}^d$, each item is represented by a vector $\textbf{q}_i \in \mathbb{R}^d$, and $\hat{r}_{(u,i)}$ is computed by the inner product of $\textbf{p}_u $ and $\textbf{q}_i$. The core idea of Vanilla MF is depicted in the followng figure and follows the idea of SVD as we have seen during the TD.

\begin{equation}
\hat{r}_{(u,i)} = \textbf{p}_u{\textbf{q}_i}^T
\end{equation}

In [None]:
!pip install scikit-surprise

In [None]:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
import pandas as pd

def mf_recommend(user_id, meta):
    # Create a Reader object with the rating scale
    reader = Reader(rating_scale=(1, 5))

    # Load the data from the DataFrame
    data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'user_rating']], reader)

    # Split the data into training and test sets (if needed for evaluation)
    trainset = data.build_full_trainset()

    # Instantiate and train the SVD algorithm
    algo = SVD()
    algo.fit(trainset)

    # Get the list of all movie ids
    movies_ids = meta.index.unique()

    # Predict ratings for all movies that the user hasn't rated
    testset = [[user_id, movie_id, 4] for movie_id in movies_ids if not ((ratings['user_id'] == user_id) & (ratings['movie_id'] == movie_id)).any()]
    predictions = algo.test(testset)

    # Retrieve the top 10 movies with the highest estimated rating
    top_preds = sorted(predictions, key=lambda x: x.est, reverse=True)[:10]

    # Map the predictions to movie titles
    top_movies = [(pred.iid, meta.loc[pred.iid, 'title'], pred.est) for pred in top_preds]

    return top_movies

In [None]:
# Exemple usage
print("Top 10 recommended movie titles for user 1174530 :")
print(mf_recommend('1174530', meta))

In [None]:
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise import accuracy
import pandas as pd

# Specify the scale of the ratings
reader = Reader(rating_scale=(1, 5))

# Load data
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'user_rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2)

# Use the SVD algorithm for matrix factorization
algo = SVD()

# Train the algorithm
algo.fit(trainset)

# Predict ratings for the test set
predictions = algo.test(testset)

# Compute and print the RMSE
rmse = accuracy.rmse(predictions)
print(f'RMSE: {rmse}')

# SVD++

Variants of SVD
- SVD ++: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^T(p_u + |I_u|^{\frac{-1}{2}}\sum\limits_{j \in I_u}y_j)$

In [None]:
from surprise import Dataset, Reader, SVDpp
from surprise.model_selection import train_test_split
import pandas as pd

def svdpp_recommend(user_id, meta):
    # Create a Reader object with the rating scale
    reader = Reader(rating_scale=(1, 5))

    # Load the data from the DataFrame
    data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'user_rating']], reader)

    # Split the data into training and test sets (if needed for evaluation)
    trainset = data.build_full_trainset()

    # Instantiate and train the SVDpp algorithm
    algo_pp = SVDpp()
    algo_pp.fit(trainset)

    # Get the list of all movie ids
    movies_ids = meta.index.unique()

    # Predict ratings for all movies that the user hasn't rated
    testset = [[user_id, movie_id, 4] for movie_id in movies_ids if not ((ratings['user_id'] == user_id) & (ratings['movie_id'] == movie_id)).any()]
    predictions_pp = algo_pp.test(testset)

    # Retrieve the top 10 movies with the highest estimated rating
    top_preds = sorted(predictions_pp, key=lambda x: x.est, reverse=True)[:10]

    # Map the predictions to movie titles
    top_movies = [(pred.iid, meta.loc[pred.iid, 'title'], pred.est) for pred in top_preds]

    return top_movies

In [None]:
# Exemple usage
print("Top 10 recommended movie titles for user 1174530 :")
print(svdpp_recommend('1174530', meta))

In [None]:
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import SVDpp
from surprise import accuracy
import pandas as pd

# Specify the scale of the ratings
reader = Reader(rating_scale=(1, 5))

# Load data
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'user_rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2)

# Use the SVD algorithm for matrix factorization
algo_pp = SVDpp()

# Train the algorithm
algo_pp.fit(trainset)

# Predict ratings for the test set
predictions_pp = algo_pp.test(testset)

# Compute and print the RMSE
rmse = accuracy.rmse(predictions_pp)
print(f'RMSE: {rmse}')

# Factorization machine (FM)

FM takes into account user-item interactions and other features, such as users' contexts and items' attributes. It captures the second-order interactions of the vectors representing these features , thereby enriching FM's expressiveness. However, interactions involving less relevant features may introduce noise, as all interactions share the same weight. e.g. You may use FM to consider the features of items.

\begin{equation}
\hat{y}_{FM}(\textbf{X}) = w_0 + \sum\limits_{j =1}^nw_jx_j + \sum\limits_{j=1}^n\sum\limits_{k=j+1}^n\textbf{v}_j^T\textbf{v}_kx_jx_k
\end{equation}

where $\textbf{X} \in \mathbb{R}^n$ is the feature vector, $n$ denotes the number of features, $w_0$ is the global bias, $w_j$ is the bias of the $j$-th feature and $\textbf{v}_j^T\textbf{v}_k$ denotes the bias of interaction between $j$-th feature and $k$-th feature, $\textbf{v}_j \in \mathbb{R}^d$ is the vector representing $j$-th feature.


In [None]:
!pip install fastFM

In [None]:
print(ratings.columns)

In [None]:
import numpy as np
from fastFM import als
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from scipy.sparse import csr_matrix

def prepare_data(ratings):
    """
    # Converting user and movie IDs to integer indices
    """
    # Create unique integer indices for users and movies
    user_encoder = {uid: i for i, uid in enumerate(ratings['user_id'].unique())}
    movie_encoder = {mid: i for i, mid in enumerate(ratings['movie_id'].unique())}

    # Create feature matrix and target vector
    X = np.array(list(zip(ratings['user_id'].map(user_encoder), ratings['movie_id'].map(movie_encoder))))
    y = ratings['user_rating'].values
    return X, y, user_encoder, movie_encoder

def train_fm(X, y):
    # Convert the feature array to a sparse matrix
    X_sparse = csr_matrix((np.ones(len(X)), (np.arange(len(X)), X[:,1])))

    # Using ALS from fastFM to train the model
    fm = als.FMRegression(n_iter=1000, init_stdev=0.1, rank=2, l2_reg_w=0.1, l2_reg_V=0.5)
    fm.fit(X_sparse, y)

    return fm

def predict_and_evaluate(fm, X_train, y_train, X_test, y_test):
    # Convert to sparse matrix
    X_train_sparse = csr_matrix((np.ones(len(X_train)), (np.arange(len(X_train)), X_train[:,1])))
    X_test_sparse = csr_matrix((np.ones(len(X_test)), (np.arange(len(X_test)), X_test[:,1])))

    # Prediction
    y_pred_train = fm.predict(X_train_sparse)
    y_pred_test = fm.predict(X_test_sparse)

    # Calculation of RMSE on test set
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))

    return rmse_test, y_pred_test

# Split the dataset
X, y, user_encoder, movie_encoder = prepare_data(ratings)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
fm = train_fm(X_train, y_train)

# Predict and Evaluation
rmse_test, y_pred_test = predict_and_evaluate(fm, X_train, y_train, X_test, y_test)

print(f'RMSE: {rmse_test}')

In [None]:
def fm_recommend(user_id, meta, ratings, user_encoder, movie_encoder):

    # Ensure the user_id exists in the encoder dictionary
    if user_id not in user_encoder:
        raise ValueError("User ID not found in the data.")

    user_index = user_encoder[user_id]

    # List out rated movies
    reviewed_movies = ratings[ratings['user_id'] == user_id]['movie_id'].unique()

    # List out all movie ids from the metadata
    all_movie_ids = meta.index.unique()

    # Predict ratings for all unseen movies
    test_movies = np.setdiff1d(all_movie_ids, reviewed_movies)
    test_movie_indices = np.array([movie_encoder[movie_id] for movie_id in test_movies])

    # Prepare the test set for the FM model
    X_test = np.array(list(zip([user_index] * len(test_movie_indices), test_movie_indices)))
    X_test_sparse = csr_matrix((np.ones(len(X_test)), (X_test[:,0], X_test[:,1])),
                               shape=(max(user_encoder.values())+1, max(movie_encoder.values())+1))

    # Use the trained FM model to predict
    y_pred = fm.predict(X_test_sparse)

    # Pair up the predicted ratings with movie ids
    predicted_ratings = list(zip(test_movies, y_pred))

    # Retrieve the top 10 movies with the highest estimated rating
    top_ratings = sorted(predicted_ratings, key=lambda x: x[1], reverse=True)[:10]

    # Map the predictions to movie titles
    top_movies = [(movie_id, meta.loc[movie_id, 'title'], rating) for movie_id, rating in top_ratings]

    return top_movies

In [None]:
# Example usage:
print("Top 10 recommended movie titles for user 1174530 :")
print(fm_recommend('1174530', meta, ratings, user_encoder, movie_encoder))

# MLP

You may also represent users and items by vectors and them feed them into a MLP to make prediction.

In [7]:
ratings

Unnamed: 0,user_rating,user_rating_date,user_id,movie_id
0,4,2005-07-05,1380819,tt0305224
1,3,2005-07-05,185150,tt0305224
2,4,2005-07-06,1351377,tt0305224
3,2,2005-07-06,386143,tt0305224
4,3,2003-12-23,2173336,tt0305224
...,...,...,...,...
259813,5,2005-07-09,1139877,tt0361862
259814,4,2005-07-11,1460015,tt0361862
259815,5,2005-07-11,1098265,tt0361862
259816,4,2005-07-11,1962894,tt0361862


In [8]:
meta

Unnamed: 0,director,genre,actors,title
tt0305224,Peter Segal,[Comedy],"[Jack Nicholson, Adam Sandler, Marisa Tomei, W...",Anger Management
tt0245046,Gillian Armstrong,"[Drama, Romance, Thriller]","[Cate Blanchett, James Fleet, Abigail Cruttenden]",Charlotte Gray
tt0185125,Pedro Almodóvar,[Drama],"[Cecilia Roth, Marisa Paredes, Candela Peña, P...",All About My Mother
tt0196229,Ben Stiller,[Comedy],"[Ben Stiller, Owen Wilson, Christine Taylor, W...",Zoolander
tt0308644,Marc Forster,"[Biography, Drama, Family]","[Johnny Depp, Kate Winslet, Julie Christie, Du...",Finding Neverland
...,...,...,...,...
tt0203019,George Tillman Jr.,"[Biography, Drama]","[Cuba Gooding Jr., Robert De Niro, Charlize Th...",Men of Honor
tt0169547,Sam Mendes,"[Drama, Romance]","[Kevin Spacey, Annette Bening, Thora Birch, We...",American Beauty
tt0227538,Robert Rodriguez,"[Action, Adventure, Comedy]","[Alexa PenaVega, Daryl Sabara, Antonio Banderas]",Spy Kids
tt0374536,Nora Ephron,"[Comedy, Fantasy, Romance]","[Nicole Kidman, Will Ferrell, Shirley MacLaine...",Bewitched


In [80]:
# merge "meta" and "ratings"
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

meta_movie = meta.reset_index()
meta_movie.rename(columns={'index': 'movie_id'}, inplace=True)

merged_df = pd.merge(ratings, meta_movie, on='movie_id', how='inner')

mlb = MultiLabelBinarizer()
genres_encoded = mlb.fit_transform(merged_df['genre'])

# Now genres_encoded is a numpy array. Convert it back to DataFrame and join with the original data
genres_encoded_df = pd.DataFrame(genres_encoded, columns=mlb.classes_)
merged_df = merged_df.join(genres_encoded_df)

# For directors
# director_dummies = pd.get_dummies(merged_df['director'], prefix='director')
# merged_df = pd.concat([merged_df, director_dummies], axis=1)

# For actors
# Ensuring each entry in the actors column is a list (if not already)
# merged_df['actors'] = merged_df['actors'].apply(lambda x: x if isinstance(x, list) else [])

# mlb_actors = MultiLabelBinarizer()
# actors_encoded = mlb_actors.fit_transform(merged_df['actors'])

# Now actors_encoded is a numpy array. Convert it back to DataFrame and join with the original data
# actors_encoded_df = pd.DataFrame(actors_encoded, columns=mlb_actors.classes_)
# merged_df = merged_df.join(actors_encoded_df)

# Dropping 'title', 'director', 'genre', and 'actors' columns
merged_df = merged_df.drop(columns=['user_id','movie_id','title', 'director', 'genre', 'actors'])

# Convert user_rating_date to datetime
merged_df['user_rating_date'] = pd.to_datetime(merged_df['user_rating_date'])

# After extraction, you might want to drop the original 'user_rating_date' column
merged_df = merged_df.drop(columns=['user_rating_date'])

merged_df.head()

Unnamed: 0,user_rating,Action,Adventure,Animation,Biography,Comedy,Crime,Drama,Family,Fantasy,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Sport,Thriller,War,Western
0,4,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [81]:
# This will print the data type of each column
print(merged_df.dtypes)

user_rating    int64
Action         int64
Adventure      int64
Animation      int64
Biography      int64
Comedy         int64
Crime          int64
Drama          int64
Family         int64
Fantasy        int64
History        int64
Horror         int64
Music          int64
Musical        int64
Mystery        int64
Romance        int64
Sci-Fi         int64
Sport          int64
Thriller       int64
War            int64
Western        int64
dtype: object


In [82]:
from sklearn.model_selection import train_test_split

# Choosing the label and features
X = merged_df.drop('user_rating', axis=1)  # features (all columns except 'user_rating')
y = merged_df['user_rating']  # label

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [83]:
import torch
from torch import nn

class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MLP, self).__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)  # Input to hidden layer
        self.relu = nn.ReLU()                            # Activation function
        self.layer2 = nn.Linear(hidden_size, output_size) # Hidden layer to output

    def forward(self, x):
        x = self.relu(self.layer1(x)) # Pass input through layer1, then relu
        x = self.layer2(x)            # Then through layer2
        return x

In [84]:
# Initialize the model
num_features = X_train.shape[1]  # Correctly calculate the number of features
hidden_size = 100 # Number of features in hidden layer
output_size = 1  # Single value output (rating)

# Use num_features as input_size
model = MLP(num_features, hidden_size, output_size)

# Define the Loss function and optimizer
criterion = nn.MSELoss()  # Mean Squared Error Loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer

In [85]:
from torch.utils.data import DataLoader, TensorDataset
# Training model

# Convert features and labels to tensors
X_train_tensor = torch.FloatTensor(X_train.values)
y_train_tensor = torch.FloatTensor(y_train.values)

# Create TensorDataset
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)

# Define a batch size
batch_size = 64  # You can adjust this

# Create DataLoader
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

# Training loop with mini-batches
for epoch in range(10):  # number of epochs
    for inputs, targets in train_loader:
        # Forward pass
        y_pred = model(inputs)

        # Compute loss
        loss = criterion(y_pred, targets.view(-1, 1))  # Make sure the target shape matches prediction

        # Backward pass and updates
        optimizer.zero_grad()  # zero the gradient buffers
        loss.backward()        # backpropagation
        optimizer.step()       # update weights

        # Print loss (or accumulate to print later)
        print(f'Epoch {epoch}, Loss: {loss.item()}')

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
Epoch 8, Loss: 1.194939374923706
Epoch 8, Loss: 0.7724367380142212
Epoch 8, Loss: 0.9828863143920898
Epoch 8, Loss: 1.2075608968734741
Epoch 8, Loss: 0.9911182522773743
Epoch 8, Loss: 1.005747675895691
Epoch 8, Loss: 1.2213311195373535
Epoch 8, Loss: 0.9192814230918884
Epoch 8, Loss: 1.4675692319869995
Epoch 8, Loss: 1.259120225906372
Epoch 8, Loss: 1.2723184823989868
Epoch 8, Loss: 0.8400248289108276
Epoch 8, Loss: 0.8606266379356384
Epoch 8, Loss: 1.138293981552124
Epoch 8, Loss: 1.0424809455871582
Epoch 8, Loss: 1.085294246673584
Epoch 8, Loss: 1.21248197555542
Epoch 8, Loss: 1.1288424730300903
Epoch 8, Loss: 1.0070065259933472
Epoch 8, Loss: 1.218961238861084
Epoch 8, Loss: 1.3694554567337036
Epoch 8, Loss: 1.3853120803833008
Epoch 8, Loss: 1.0365772247314453
Epoch 8, Loss: 1.1731882095336914
Epoch 8, Loss: 1.159329891204834
Epoch 8, Loss: 0.8979784250259399
Epoch 8, Loss: 1.3239318132400513
Epoch 8, Loss: 0.8839197158813477
Epoch 8, Loss: 0

In [86]:
import torch
from sklearn.metrics import mean_squared_error

# Assuming X_test and y_test are available
X_test_tensor = torch.FloatTensor(X_test.values)
y_test_tensor = torch.FloatTensor(y_test.values)

# Make predictions using the trained model
y_pred = model(X_test_tensor)

# Convert predictions and ground truth to numpy arrays
y_pred_np = y_pred.detach().numpy()
y_test_np = y_test_tensor.numpy()

# Calculate RMSE using sklearn's mean_squared_error
rmse = mean_squared_error(y_test_np, y_pred_np, squared=False)
print(f'RMSE: {rmse}')

RMSE: 1.0706043243408203


In [91]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

def get_user_features(user_id, ratings_df, meta_df):
    # Merge ratings and meta data
    meta_movie = meta_df.reset_index()
    meta_movie.rename(columns={'index': 'movie_id'}, inplace=True)
    merged_df = pd.merge(ratings_df, meta_movie, on='movie_id', how='inner')

    # Filter data for the target user
    user_data = merged_df[merged_df['user_id'] == user_id]

    # Get all unique genres from the entire dataset
    all_genres = set()
    for genres_list in meta_df['genre']:
        all_genres.update(genres_list)

    # Encode genres
    mlb = MultiLabelBinarizer(classes=list(all_genres))
    genres_encoded = mlb.fit_transform(user_data['genre'])
    genres_encoded_df = pd.DataFrame(genres_encoded, columns=mlb.classes_)

    # Extract user features
    user_features = user_data.drop(columns=['user_id', 'movie_id', 'title', 'director', 'genre', 'actors'])
    user_features = user_features.drop_duplicates()

    # Optionally, extract date components
    user_features['user_rating_date'] = pd.to_datetime(user_features['user_rating_date'])
    user_features['year'] = user_features['user_rating_date'].dt.year
    user_features['month'] = user_features['user_rating_date'].dt.month
    user_features['day'] = user_features['user_rating_date'].dt.day
    user_features = user_features.drop(columns=['user_rating_date'])

    # Join with encoded genres
    user_features = user_features.join(genres_encoded_df)

    return user_features

In [92]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# Reset the index of the meta DataFrame and rename the index column to 'movie_id'
meta_movie = meta.reset_index()
meta_movie.rename(columns={'index': 'movie_id'}, inplace=True)

# Merge ratings and meta_movie on 'movie_id'
movie_merged_df = pd.merge(ratings, meta_movie, on='movie_id', how='inner')

# Encode genres using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
genres_encoded = mlb.fit_transform(movie_merged_df['genre'])

# Convert genres_encoded to a DataFrame with appropriate column names
genres_encoded_df = pd.DataFrame(genres_encoded, columns=mlb.classes_)

# Combine genres_encoded_df with movie_merged_df
movie_merged_df = pd.concat([movie_merged_df, genres_encoded_df], axis=1)

# Fill missing genre columns with 0
movie_merged_df[mlb.classes_] = movie_merged_df[mlb.classes_].fillna(0)

# Drop columns related to user information
movie_merged_df = movie_merged_df.drop(columns=['user_id', 'user_rating_date'])

# Drop 'title', 'genre' columns (and other columns you don't need)
movie_merged_df = movie_merged_df.drop(columns=['director', 'actors', 'title', 'genre'])

# Reset the index to 'movie_id' (if needed)
movie_merged_df.set_index('movie_id', inplace=True)

# Now movie_merged_df contains the movie_features
movie_features = movie_merged_df.drop(columns=['user_rating'])


# Print the first few rows of movie_features
print(movie_features.head())

           Action  Adventure  Animation  Biography  Comedy  Crime  Drama  \
movie_id                                                                   
tt0305224       0          0          0          0       1      0      0   
tt0305224       0          0          0          0       1      0      0   
tt0305224       0          0          0          0       1      0      0   
tt0305224       0          0          0          0       1      0      0   
tt0305224       0          0          0          0       1      0      0   

           Family  Fantasy  History  Horror  Music  Musical  Mystery  Romance  \
movie_id                                                                        
tt0305224       0        0        0       0      0        0        0        0   
tt0305224       0        0        0       0      0        0        0        0   
tt0305224       0        0        0       0      0        0        0        0   
tt0305224       0        0        0       0      0        0   

In [94]:
import pandas as pd

def MLP_recommend(model, user_features, movie_features, meta_df):
    """
    Recommend the top 10 movies for a given user based on user features and movie features.
    """
    # Convert user features to a DataFrame if it's a dictionary
    if isinstance(user_features, dict):
        user_features = pd.DataFrame([user_features])

    # Ensure user_features and movie_features have the same columns in the same order
    user_features = user_features[movie_features.columns]

    # Convert user_features and movie_features to tensors
    user_tensor = torch.FloatTensor(user_features.values)
    movie_tensor = torch.FloatTensor(movie_features.values)

    # Make predictions for the user on all movies
    user_predictions = model(user_tensor)  # Shape: (num_movies, 1)

    # Flatten predictions
    user_predictions = user_predictions.view(-1)

    # Sort movies based on predicted ratings (in descending order)
    sorted_movie_indices = user_predictions.argsort(descending=True)

    # Get the top 10 movie IDs and their predicted ratings
    top_10_movie_ids = movie_features.index[sorted_movie_indices][:10].tolist()
    top_10_ratings = user_predictions[sorted_movie_indices][:10].tolist()

    # Create a list of tuples with (movie_id, movie_title, predicted_rating)
    top_movies = [(movie_id, meta_df.loc[movie_id, 'title'], rating) for movie_id, rating in zip(top_10_movie_ids, top_10_ratings)]

    return top_movies

In [96]:
# Example usage:
print("Top 10 recommended movie titles for user 1174530 :")
user_features = get_user_features('1174530', ratings, meta)
print(MLP_recommend(model, user_features, movie_features, meta))

## return ratings as 'nan' is because our nb_epoche is too small

Top 10 recommended movie titles for user 1174530 :
[('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan), ('tt0305224', 'Anger Management', nan)]
