In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/moviestitles/title.basics.tsv
/kaggle/input/moviestitles/title.episode.tsv
/kaggle/input/moviestitles/title.principals.tsv
/kaggle/input/moviestitles/title.ratings.tsv
/kaggle/input/moviestitles/name.basics.tsv
/kaggle/input/moviestitles/title.akas.tsv
/kaggle/input/moviestitles/title.crew.tsv
/kaggle/input/movies/movies.csv
/kaggle/input/movies/ratings.csv
/kaggle/input/movies/genome-tags.csv
/kaggle/input/movies/genome-scores.csv
/kaggle/input/movies/tags.csv
/kaggle/input/movies/links.csv


# **Load Data**
I've chosen to recommend films based on their ratings (`rating`) and genres, and to sort these recommendations according to their year of release (`year`), number of votes (`numVotes`) and average rating (`averageRating`).

In [2]:
import pandas as pd
import numpy as np
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
from collections import defaultdict

ratings = pd.read_csv('/kaggle/input/movies/ratings.csv')
movies = pd.read_csv('/kaggle/input/movies/movies.csv')
links = pd.read_csv('/kaggle/input/movies/links.csv')
basics = pd.read_csv('/kaggle/input/moviestitles/title.basics.tsv', sep='\t', low_memory=False)
imdb_ratings = pd.read_csv('/kaggle/input/moviestitles/title.ratings.tsv', sep='\t', low_memory=False)

## **Process Data**
To prepare the data, I took the following steps:

- **Identifier standardization**: Renaming and formatting identifiers ensures that DataFrames can be merged correctly on columns of common identifiers.
- **Data cleansing**: Removing irrelevant columns to keep only those I need.
- **Genre preparation**: Transforming genres into lists makes it easy to calculate similarity between films based on their genres.
- **Data type conversion**: Ensuring that identifiers are of the correct data type (integer or string) is essential for data merging and matching operations.

These data transformations and cleansing operations were essential to build an accurate and efficient recommendation model.

In [3]:
imdb_ratings=imdb_ratings.rename(columns={'tconst':'id'})
basics=basics.drop(columns=['endYear', 'isAdult', 'titleType', 'runtimeMinutes', 'originalTitle'])
basics=basics.rename(columns={'tconst':'id', 'primaryTitle':'title'})
basics['genres'] = basics['genres'].str.split(',')
movies['genres'] = movies['genres'].str.split('|')
links = links[links['tmdbId'].notnull() & ~links['tmdbId'].isin([np.inf, -np.inf])]
links['tmdbId'] = links['tmdbId'].astype(int)
links['imdbId'] = 'tt' + links['imdbId'].astype(str).str.zfill(7)

## **Merge dataframes**
- Merge all DataFrames to add IMDb (`imdbId`) and TMDb (`tmdbId`) identifiers to the movie data, which is crucial for linking information from different data sources.
- Combine IMDb notes with basic movie information, which is necessary to have complete data on each movie. 
- Combine all available information on each movie into a single DataFrame, including TMDb data, genres, IMDb notes, etc., which is crucial for linking information from different data sources.
- Merging the two genre lists. This gives a complete view of the genres for each film.

In [4]:
movies_with_links = pd.merge(movies, links, on='movieId')
imdb_ratings_with_basics = pd.merge(imdb_ratings, basics, left_on='id', right_on='id')
merged_data = pd.merge(movies_with_links, imdb_ratings_with_basics, left_on='imdbId', right_on='id')
merged_data = merged_data.drop(columns=['id'])

In [5]:
def merge_lists(list1, list2):
    return list(set(list1 + list2))

In [6]:
merged_data['genres'] = merged_data.apply(lambda row: merge_lists(row['genres_x'], row['genres_y']), axis=1)
user_movie_ratings = pd.merge(ratings, merged_data, on='movieId')
user_movie_ratings = user_movie_ratings.rename(columns={'title_y':'title'})
user_movie_ratings = user_movie_ratings.rename(columns={'startYear':'year'})
user_movie_ratings['genres'] = user_movie_ratings['genres'].astype(str)
df = user_movie_ratings[['userId', 'movieId', 'rating', 'title', 'genres', 'year', 'averageRating', 'numVotes']]

> Select columns relevant to the recommendation model and store them in the df DataFrame. The columns selected are :

- `userId` and `movieId`: Used by the `SVD model` to associate ratings with users and movies.
- `rating`: Used as training data for the `SVD model` to predict future user ratings for unrated movies.
- `title` and `genres`: Used to calculate similarity between films via `Jaccard similarity` and to display recommendations in an understandable way.
- `year`, `averageRating` and `numVotes`: Used to adjust recommendations to favor high-quality, recent or popular films.

In [7]:
del imdb_ratings
del basics
del movies
del ratings
del links
del movies_with_links
del imdb_ratings_with_basics
del merged_data
del user_movie_ratings

In [8]:
df.shape

(24981966, 8)

In [9]:
df.head()

Unnamed: 0,userId,movieId,rating,title,genres,year,averageRating,numVotes
0,1,296,5.0,Pulp Fiction,"['Comedy', 'Crime', 'Drama', 'Thriller']",1994,8.9,2237792
1,1,306,3.5,Three Colors: Red,"['Romance', 'Drama', 'Mystery']",1994,8.1,110608
2,1,307,5.0,Three Colors: Blue,"['Drama', 'Mystery', 'Music']",1993,7.8,111201
3,1,665,5.0,Underground,"['Comedy', 'Drama', 'War', 'Fantasy']",1995,8.0,61504
4,1,899,3.5,Singin' in the Rain,"['Comedy', 'Romance', 'Musical']",1952,8.3,262417


# **Prepare Sets for Model**

In [11]:
# All the rating are between 0.5 and 5
reader = Reader(rating_scale=(0.5, 5))

# Shuffle data
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Separate data into training and test sets
test_size = 0.2
train_size = int((1 - test_size) * len(df))
train_df = df.iloc[:train_size]
test_df = df.iloc[train_size:]

# Convert to Surprise formats
train_data = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']], reader)
test_data = Dataset.load_from_df(test_df[['userId', 'movieId', 'rating']], reader)

# Build trainset and testset
trainset = train_data.build_full_trainset()
testset = list(test_df[['userId', 'movieId', 'rating']].itertuples(index=False, name=None))

# Dictionnary for genres
movie_genres = df.set_index('movieId')['genres'].str.split('|').to_dict()

In [12]:
print(trainset.n_users)
print(trainset.n_items)
print(trainset.n_ratings)
print(len(testset))

162541
56432
19985572
4996394


# **Users Score**
### **Model**
> The chosen model is implemented via the Surprise library. Singular Value Decomposition (SVD) is a powerful choice for recommender systems due to :

- It is capable of capturing latent relationships between users and items.
- It handles well the problem of data sparsity, common in recommender systems.
- It is efficient for large datasets.

In [13]:
model_svd = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)
model_svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7a031c3f3850>

**Incorporating content similarity**

The system uses Jaccard similarity to compare film genres. This hybrid approach, combining collaborative filtering (SVD) with content-based filtering (genre similarity), overcomes the cold-start problem and improves the quality of recommendations.

In [14]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

def recommend_movies_for_two_users(user_id1, user_id2, model_svd, movie_genres, df, n=5):
    all_movies = df['movieId'].unique()
    
    # Predict ratings for each film for both users
    predictions1 = [model_svd.predict(user_id1, movie_id) for movie_id in all_movies]
    predictions2 = [model_svd.predict(user_id2, movie_id) for movie_id in all_movies]
    
    # Calculate the average of predictions
    avg_predictions = [(p1.est + p2.est) / 2 for p1, p2 in zip(predictions1, predictions2)]
    
    # Create a dictionary of average predictions
    movie_predictions = dict(zip(all_movies, avg_predictions))
    
    # Get movies already rated by the users
    user1_ratings = df[df['userId'] == user_id1].set_index('movieId')['rating']
    user2_ratings = df[df['userId'] == user_id2].set_index('movieId')['rating']
    
    # Calculate average ratings for films rated by the users
    common_movies = user1_ratings.index.intersection(user2_ratings.index)
    avg_ratings = (user1_ratings[common_movies] + user2_ratings[common_movies]) / 2
    
    # Adjust predictions based on content similarity and average scores
    for movie_id in all_movies:
        if movie_id in common_movies:
            avg_rating = avg_ratings[movie_id]
            movie_predictions[movie_id] = movie_predictions[movie_id] * (avg_rating / 5)
        else:
            # Calculer la similarité moyenne de genre avec les films notés
            genre_similarities = [jaccard_similarity(set(movie_genres[movie_id]), set(movie_genres[m])) 
                                  for m in common_movies if m in movie_genres]
            content_score = np.mean(genre_similarities) if genre_similarities else 0
            movie_predictions[movie_id] = 0.7 * movie_predictions[movie_id] + 0.3 * content_score
    
    # Sort movies
    sorted_predictions = sorted(movie_predictions.items(), key=lambda x: x[1], reverse=True)
    
    # Get details of recommended films
    recommended_movies = df[df['movieId'].isin([movie_id for movie_id, _ in sorted_predictions[:n]])][['movieId', 'title', 'genres', 'year', 'averageRating']].drop_duplicates('movieId')
    recommended_movies['predicted_score'] = [score for _, score in sorted_predictions[:n]]
    
    return recommended_movies.sort_values('predicted_score', ascending=False)

# **Recommend Movies for Pair**
### **The recommendation works as follow**
- **Predict Ratings**: Predict ratings for all movies for both users using the SVD model.
- **Average Predictions**: Calculate the average predicted ratings between the two users for each movie.
- **Adjust Predictions**:
    - For movies rated by both users, adjust predictions by the average rating and normalize by the maximum rating.
    - For movies not rated by both users, calculate the average genre similarity with rated movies and adjust predictions accordingly.
- **Sort Predictions**: Sort movies based on the adjusted predicted scores using .
- **Get Recommendations**: Retrieve details of the top n recommended movies based on predicted scores.

In [15]:
user_id1 = df['userId'].iloc[0]
user_id2 = df['userId'].iloc[1]
print(f"Generate recommendations for users {user_id1} and {user_id2}...")
recommendations = recommend_movies_for_two_users(user_id1, user_id2, model_svd, movie_genres, df)
print("\nRecommendations:")
recommendations

Generate recommendations for users 87306 and 139588...

Recommendations:


Unnamed: 0,movieId,title,genres,year,averageRating,predicted_score
26,58559,The Dark Knight,"['Crime', 'Action', 'IMAX', 'Drama']",2008,9.0,3.257338
461,170705,Band of Brothers,"['War', 'Action', 'Drama', 'History']",2001,9.4,3.254265
1338,5679,The Ring,"['Horror', 'Mystery', 'Thriller']",2002,7.1,3.249727
13247,86345,Louis C.K.: Hilarious,['Comedy'],2010,8.3,3.227058
82242,127098,Louis C.K.: Live at the Comedy Store,['Comedy'],2015,7.8,3.21571


**Find the most recommended movie**

In [16]:
most_recommended_movie = recommendations.iloc[0]

print("\nMost Recommended Movie:")
most_recommended_movie[['title', 'genres', 'year', 'averageRating', 'predicted_score']]


Most Recommended Movie:


title                                   The Dark Knight
genres             ['Crime', 'Action', 'IMAX', 'Drama']
year                                               2008
averageRating                                       9.0
predicted_score                                3.257338
Name: 26, dtype: object

**Create a list of the top 10 recommended movies**

In [17]:
top_10_recommendations = recommendations.head(10)
print("\nTop 10 Recommended Movies:")
top_10_recommendations[['title', 'genres', 'year', 'averageRating', 'predicted_score']]


Top 10 Recommended Movies:


Unnamed: 0,title,genres,year,averageRating,predicted_score
26,The Dark Knight,"['Crime', 'Action', 'IMAX', 'Drama']",2008,9.0,3.257338
461,Band of Brothers,"['War', 'Action', 'Drama', 'History']",2001,9.4,3.254265
1338,The Ring,"['Horror', 'Mystery', 'Thriller']",2002,7.1,3.249727
13247,Louis C.K.: Hilarious,['Comedy'],2010,8.3,3.227058
82242,Louis C.K.: Live at the Comedy Store,['Comedy'],2015,7.8,3.21571


# **Evaluate the model**
>`rmse` (Root Mean Squared Error) and `mae` (Mean Absolute Error) are calculated from the predictions and true rating values present in predictions. These metrics evaluate the accuracy of predictions compared to the true ratings of users in the test set.

>`precisions` and `recalls` are dictionaries where the keys are the user IDs and the values are the precision and recall scores respectively.

In [18]:
predictions = model_svd.test(testset)
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

RMSE: 0.7775
MAE:  0.5866


In [19]:
def precision_recall_at_k(predictions, k=10, threshold=3.5):
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls

In [20]:
precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=3.5)
print(f"Medium accuracy: {sum(prec for prec in precisions.values()) / len(precisions)}")
print(f"Average recall: {sum(rec for rec in recalls.values()) / len(recalls)}")

Medium accuracy: 0.837207967882356
Average recall: 0.585214235189616


# **Critics and Interpretations**

1. **Average Precision @10: 0.8381**
   - Average Precision @10 indicates the average proportion of recommended items that were actually relevant to the user among the top 10 recommendations. A precision of 0.8381 means that, on average, approximately 83.81% of the recommended movies in the top 10 were genuinely liked or useful for the users. This metric is important for evaluating the relevance of personalized recommendations.

2. **Average Recall @10: 0.5851**
   - Average Recall @10 measures the average proportion of relevant items that were effectively recommended among the top 10 recommendations. A recall of 0.5851 indicates that, on average, approximately 58.51% of the movies relevant to a user were included in the top 10 recommendations. This evaluates the model's effectiveness in capturing all relevant user preferences.

3. **RMSE (Root Mean Squared Error): 0.7776**
   - RMSE is a measure of the average difference between the model's predicted values and the actual values (in this case, movie ratings). An RMSE of 0.7776 means that, on average, the model's predictions have an error of 0.7776 stars compared to the actual user ratings. A lower RMSE indicates better predictive accuracy of the model.

4. **MAE (Mean Absolute Error): 0.5868**
   - MAE measures the average absolute difference between the model's predicted ratings and the actual ratings. A MAE of 0.5868 means that, on average, the model's predictions have an absolute error of 0.5868 stars compared to the actual user ratings. Similar to RMSE, a lower MAE signifies better predictive performance.

**Overall Interpretation:**
- The results show a relatively high average precision (@10), indicating that the recommendations are generally well-targeted and relevant for users.
- However, the average recall is slightly lower than average precision, suggesting that the model may not always capture all relevant user preferences within the top 10 recommendations.
- In terms of prediction errors (RMSE and MAE), the values are relatively low, indicating that the SVD model has a good ability to predict movie ratings accurately.

In conclusion, while the model demonstrates good overall performance in terms of precision and rating prediction, there may still be room for improvement to slightly increase recall and further reduce prediction errors.