<a href="https://colab.research.google.com/github/millenasiqueira/GBC-Projects-/blob/main/Movie_Recommenders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This task tests your ability to apply Recommendation Engine concepts and techniques to a real-world Movie Recommender System.


Task: Build a Movie Recommender system with the following methods:

Popularity
Content Filter
Collaborative Filter
Matrix Factorization
Also, try the following libraries on the dataset:

Turicreate
Surprise
Dataset: MovieLens 20M

Source: https://grouplens.org/datasets/movielens/20m/

Hints:

1. Read Movies.csv, Ratings.csv and Tags.csv. No need for genome-scores.csv, genome-tags.csv

2. Create content filtering method on metadata obtained from merging movies and tags

3. Metadata should be formed from joining all tag field for each movie_title.

4. Build a Tfidf Vectorizer model and TruncatedSVD for Content filter - Latent matrix 1 on this data

5. Create a Collab filter on User Movie matrix (formed from pivot table on ratings data)

6. Create a Latent matrix 2 on this data

7. Code hybrid model


Import Libraries

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds


Load the dataset

In [57]:
movies = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/movies.csv')
ratings = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ratings.csv')
tags = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tags.csv')

In [58]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [59]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [60]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,1240597180
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078


Create content filtering method on metadata obtained from merging movies and tags

In [7]:
# reduce the dataset size to not crash

In [61]:
ratings_small = ratings.sample(frac=0.01, random_state=42)

In [62]:
# Calculate average ratings per movie
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()
average_ratings.columns = ['movieId', 'average_rating']

In [63]:
merged_df = pd.merge(movies, tags, on='movieId', how='left')
merged_df.head()


Unnamed: 0,movieId,title,genres,userId,tag,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1644.0,Watched,1417737000.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1741.0,computer animation,1183903000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1741.0,Disney animated feature,1183933000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1741.0,Pixar animation,1183935000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1741.0,TÃ©a Leoni does not star in this movie,1245094000.0


Metadata should be formed from joining all tag field for each movie_title.

In [64]:
# Create a metadata column by combining all tags for each movie
merged_df['metadata'] = merged_df.groupby('movieId')['tag'].transform(lambda x: ' '.join(x.astype(str)))
merged_df = merged_df.drop_duplicates(subset='movieId')
merged_df.head()

Unnamed: 0,movieId,title,genres,userId,tag,timestamp,metadata
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1644.0,Watched,1417737000.0,Watched computer animation Disney animated fea...
436,2,Jumanji (1995),Adventure|Children|Fantasy,1629.0,time travel,1394473000.0,time travel adapted from:book board game child...
559,3,Grumpier Old Men (1995),Comedy|Romance,2274.0,old people that is actually funny,1208582000.0,old people that is actually funny sequel fever...
577,4,Waiting to Exhale (1995),Comedy|Drama|Romance,9197.0,chick flick,1308559000.0,chick flick revenge characters chick flick cha...
583,5,Father of the Bride Part II (1995),Comedy,9197.0,Diane Keaton,1305523000.0,Diane Keaton family sequel Steve Martin weddin...


In [65]:
print(merged_df.describe())
print(merged_df.info())
print(merged_df.isna())

             movieId         userId     timestamp
count   27278.000000   19545.000000  1.954500e+04
mean    59855.480570   45139.216475  1.313807e+09
std     44429.314697   38836.035925  8.811192e+07
min         1.000000      18.000000  1.135936e+09
25%      6931.250000    6988.000000  1.243302e+09
50%     68068.000000   39214.000000  1.331660e+09
75%    100293.250000   70201.000000  1.395863e+09
max    131262.000000  138436.000000  1.427747e+09
<class 'pandas.core.frame.DataFrame'>
Index: 27278 entries, 0 to 473296
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movieId    27278 non-null  int64  
 1   title      27278 non-null  object 
 2   genres     27278 non-null  object 
 3   userId     19545 non-null  float64
 4   tag        19545 non-null  object 
 5   timestamp  19545 non-null  float64
 6   metadata   27278 non-null  object 
dtypes: float64(2), int64(1), object(4)
memory usage: 1.7+ MB
None
        movieId  ti

In [66]:
# Fill NaN values with empty string
merged_df['metadata'] = merged_df['metadata'].fillna('')

In [67]:
# Filter ratings_small to include only movies present in merged_df
valid_movie_ids = merged_df['movieId'].unique()
ratings_small = ratings_small[ratings_small['movieId'].isin(valid_movie_ids)]

Build a Tfidf Vectorizer model and TruncatedSVD for Content filter - Latent matrix 1 on this data

Latent Matrix 1 = capture features like genres and tags. Content-based filtering

In [68]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(merged_df['metadata'])

In [15]:
from sklearn.decomposition import TruncatedSVD

In [69]:
svd = TruncatedSVD(n_components=200)
latent_matrix_1 = svd.fit_transform(tfidf_matrix)


Create a Collab filter on User Movie matrix (formed from pivot table on ratings data)

In [70]:
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()
average_ratings.columns = ['movieId', 'average_rating']

In [71]:
# Create User-Item Matrix
user_movie_matrix = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)

  user_movie_matrix = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)


In [48]:
from scipy.sparse import csr_matrix

In [72]:
# Convert to sparse matrix
user_movie_matrix_sparse = csr_matrix(user_movie_matrix.values)

In [73]:
# Dimensionality Reduction using SVD
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(user_movie_matrix_sparse, k=50)
sigma = np.diag(sigma)

Create a Latent matrix 2 on this data. Latent Matrix 2 capture patterns in user ratings and preferences. Collaborative Filtering

> Add blockquote



In [74]:
latent_matrix_2 = np.dot(np.dot(U, sigma), Vt)

Code hybrid model

In [75]:
common_movie_ids = np.intersect1d(user_movie_matrix.columns, merged_df['movieId'].values)

In [32]:
print("Latent Matrix 1 shape:", latent_matrix_1.shape)
print("Latent Matrix 2 shape:", latent_matrix_2.shape)

Latent Matrix 1 shape: (27278, 200)
Latent Matrix 2 shape: (16325, 4649)


In [76]:
latent_matrix_1_df = pd.DataFrame(latent_matrix_1, index=merged_df['movieId'])
latent_matrix_1_filtered = latent_matrix_1_df.loc[common_movie_ids].values


In [77]:
latent_matrix_2_df = pd.DataFrame(latent_matrix_2.T, index=user_movie_matrix.columns)
latent_matrix_2_filtered = latent_matrix_2_df.loc[common_movie_ids].values.T

In [78]:
print("Latent Matrix 1 shape:", latent_matrix_1_filtered.shape)
print("Latent Matrix 2 shape:", latent_matrix_2_filtered.shape)

Latent Matrix 1 shape: (26744, 200)
Latent Matrix 2 shape: (138493, 26744)


In [80]:
hybrid_matrix = np.hstack((latent_matrix_1_filtered, latent_matrix_2_filtered.T))

print("Hybrid Matrix shape:", hybrid_matrix.shape)


Hybrid Matrix shape: (26744, 138693)


Recommendation Functions

In [81]:
def popular_recommendations():
    popularity_df = ratings.groupby('movieId').size().reset_index(name='count')
    top_movies = popularity_df.sort_values('count', ascending=False).head(10)
    return movies[movies['movieId'].isin(top_movies['movieId'])]

In [82]:
print(popular_recommendations())

      movieId                                      title  \
108       110                          Braveheart (1995)   
257       260  Star Wars: Episode IV - A New Hope (1977)   
293       296                        Pulp Fiction (1994)   
315       318           Shawshank Redemption, The (1994)   
352       356                        Forrest Gump (1994)   
476       480                       Jurassic Park (1993)   
523       527                    Schindler's List (1993)   
583       589          Terminator 2: Judgment Day (1991)   
587       593           Silence of the Lambs, The (1991)   
2486     2571                         Matrix, The (1999)   

                                genres  
108                   Action|Drama|War  
257            Action|Adventure|Sci-Fi  
293        Comedy|Crime|Drama|Thriller  
315                        Crime|Drama  
352           Comedy|Drama|Romance|War  
476   Action|Adventure|Sci-Fi|Thriller  
523                          Drama|War  
583        

In [83]:
def content_based_recommendations(movie_id):
    movie_index = merged_df.index[merged_df['movieId'] == movie_id].tolist()[0]
    sim_scores = cosine_similarity(tfidf_matrix[movie_index], tfidf_matrix)
    similar_movies = sim_scores.argsort().flatten()[-10:]
    return movies.iloc[similar_movies]

In [84]:
print(content_based_recommendations(movie_id=1))

       movieId                       title  \
21168   103141  Monsters University (2013)   
8278      8961     Incredibles, The (2004)   
11614    50872          Ratatouille (2007)   
15401    78499          Toy Story 3 (2010)   
6271      6377         Finding Nemo (2003)   
5121      5218              Ice Age (2002)   
4790      4886       Monsters, Inc. (2001)   
2270      2355        Bug's Life, A (1998)   
3027      3114          Toy Story 2 (1999)   
0            1            Toy Story (1995)   

                                                 genres  
21168                        Adventure|Animation|Comedy  
8278         Action|Adventure|Animation|Children|Comedy  
11614                          Animation|Children|Drama  
15401  Adventure|Animation|Children|Comedy|Fantasy|IMAX  
6271                Adventure|Animation|Children|Comedy  
5121                Adventure|Animation|Children|Comedy  
4790        Adventure|Animation|Children|Comedy|Fantasy  
2270                Adventure

In [85]:
def collaborative_filtering_recommendations(user_id, num_recommendations=10):
    user_ratings = user_movie_matrix.loc[user_id].values.reshape(1, -1)
    pred_ratings = np.dot(user_ratings, latent_matrix_2.T)
    # Sort predicted ratings and get indices of top recommendations, considering valid movie indices
    recommended_movie_indices = pred_ratings.argsort().flatten()[-num_recommendations:]
    # Ensure indices are within the valid range of the 'movies' DataFrame
    valid_indices = [idx for idx in recommended_movie_indices if idx < len(movies)]
    return movies.iloc[valid_indices]

In [86]:
print(collaborative_filtering_recommendations(user_id=1))

      movieId                       title          genres
8404    25819  Mark of the Vampire (1935)  Horror|Mystery


In [91]:
def hybrid_recommendations(user_id, movie_id):
    movie_index = merged_df.index[merged_df['movieId'] == movie_id].tolist()[0]
    user_ratings = user_movie_matrix.loc[user_id].values.reshape(1, -1)
    user_movie_vector = np.hstack((tfidf_matrix[movie_index].toarray(), user_ratings))

    # Transpose latent_matrix_1_filtered before stacking
    hybrid_matrix = np.hstack((latent_matrix_1_filtered, latent_matrix_2_filtered.T))

    # Transpose user_movie_vector for compatible shapes
    pred_ratings = np.dot(hybrid_matrix, user_movie_vector.T)
    recommended_movies = pred_ratings.argsort().flatten()[-10:]
    return movies.iloc[recommended_movies]

In [92]:
print(hybrid_recommendations(user_id=1, movie_id=1))

ValueError: shapes (26744,138693) and (50607,1) not aligned: 138693 (dim 1) != 50607 (dim 0)