# Exploring Movie Recommendation Strategies

Movie recommendation systems come in various forms, catering to different user preferences and needs. In this blog post, we'll delve into the implementation and mathematical foundation of two movie recommendation strategies: **Most Popular** and **Top Rated**. These strategies provide personalized recommendations based on the popularity and average ratings of movies, respectively.

## Most Popular Movie Recommendations


The most popular movie recommendation strategy prioritizes movies based on the total number of views. The assumption is that movies with higher views are likely to be well-received by a broader audience. The steps involved are:

1. **Representation of Movies**: Movies are represented as dictionaries, with movie IDs as keys and corresponding details as values.

2. **Counting Views**: The total number of views for each movie is calculated from the user reviews.

3. **Ranking Movies**: Movies are ranked based on the total number of views, and the top-k movies are recommended.


## Top Rated Movie Recommendations


The top-rated movie recommendation strategy focuses on movies with the highest average ratings. The assumption is that movies with consistently high ratings are likely to be enjoyable for users. The steps involved are:

1. **Representation of Movies**: Movies are represented as dictionaries, similar to the most popular strategy.

2. **Calculating Average Ratings**: The average ratings for each movie are computed from user reviews.

3. **Ranking Movies**: Movies are ranked based on average ratings, and the top-k movies are recommended.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from collections import OrderedDict

In [2]:
movies = pd.read_csv('datasets/ml-latest-small/movies.csv')
ratings = pd.read_csv('datasets/ml-latest-small/ratings.csv')

In [3]:
class MLDataset:
    def __init__(self, movies, ratings):
        self.movies = movies.copy()
        self.ratings = ratings.copy()
        self.m2m_new = {idx: i for i, idx in enumerate(ratings['movieId'].unique())}
        self.u2u_new = {idx: i for i, idx in enumerate(ratings['userId'].unique())}
    def convert(self):
        self.movies['movieId'] = self.movies['movieId'].apply(lambda x: self.m2m_new.get(x, -1))
        self.movies['movieId'] = self.movies['movieId'].astype(int)
        self.ratings['userId'] = self.ratings['userId'].map(self.u2u_new)
        self.ratings['movieId'] = self.ratings['movieId'].map(self.m2m_new)
        return self.movies, self.ratings
        


In [4]:
dataset = MLDataset(movies, ratings)
movies_m, ratings_m = dataset.convert()

In [5]:
rating_train, ratings_test = train_test_split(ratings, test_size=0.2, random_state=42)

In [6]:
class MovieRecSys:
    def __init__(self, movies, reviews):
        self.movies = self._get_movies(movies.copy())
        self.reviews = self._get_reviews(reviews.copy())
        self.movie_views = self._get_views()
        self.movie_avg_ratings = self._get_avg_ratings()

    def _get_movies(self, movies):
        movies['genres'] = movies['genres'].apply(lambda x: set(x.split('|')))
        movies_dict = movies.set_index('movieId').to_dict(orient='index')
        return movies_dict
    
    def _get_reviews(self, reviews):
        # Convert 'timestamp' to datetime
        reviews['timestamp'] = pd.to_datetime(reviews['timestamp'], unit='s')

        # Convert 'timestamp' to ISO format
        reviews['timestamp'] = reviews['timestamp'].dt.strftime('%Y-%m-%dT%H:%M:%S')
        return reviews
    
    def _get_views(self):
        reviews = self.reviews
        counts = reviews.groupby('movieId').count()['userId'].to_dict()
        ordered_counts = OrderedDict(sorted(counts.items(), key=lambda x: x[1], reverse=True))
        return ordered_counts
    
    def _get_avg_ratings(self):
        reviews = self.reviews
        counts = reviews.groupby('movieId')['rating'].mean().to_dict()
        ordered_counts = OrderedDict(sorted(counts.items(), key=lambda x: x[1], reverse=True))
        return ordered_counts
    
    def most_popular(self, top_k=10):
        # Get the top k most viewed movies
        top_k_movies = list(self.movie_views.items())[:top_k]
        response = list(map(lambda x: f"id: {x[0]: <10} {self.movies[x[0]]['title']: <80} counts: {x[1]}", top_k_movies))
        return response
    
    def top_rated(self, top_k=10):
        top_k_movies = list(self.movie_avg_ratings.items())[:top_k]
        response = list(map(lambda x: f"id: {x[0]: <10} {self.movies[x[0]]['title']: <80} counts: {x[1]}", top_k_movies))
        return response


In [7]:
mdb = MovieRecSys(movies, rating_train)
print('Most viewed:')
print(*mdb.most_popular(), sep='\n')
print('\nTop Rated:')
print(*mdb.top_rated(), sep='\n')

Most viewed:
id: 356        Forrest Gump (1994)                                                              counts: 276
id: 318        Shawshank Redemption, The (1994)                                                 counts: 257
id: 296        Pulp Fiction (1994)                                                              counts: 256
id: 593        Silence of the Lambs, The (1991)                                                 counts: 226
id: 2571       Matrix, The (1999)                                                               counts: 214
id: 260        Star Wars: Episode IV - A New Hope (1977)                                        counts: 194
id: 110        Braveheart (1995)                                                                counts: 189
id: 480        Jurassic Park (1993)                                                             counts: 185
id: 589        Terminator 2: Judgment Day (1991)                                                counts: 182
id: 1          

In [8]:
mdb.reviews

Unnamed: 0,userId,movieId,rating,timestamp
80568,509,7347,3.0,2015-07-04T07:23:17
50582,326,71462,4.0,2011-11-25T20:18:55
8344,57,2115,3.0,2000-08-09T05:15:55
99603,610,1127,4.0,2016-11-19T08:28:22
71701,462,2409,2.0,2007-03-21T00:50:49
...,...,...,...,...
6265,42,4005,4.0,2001-07-27T18:37:39
54886,364,141,4.0,1997-07-21T00:02:47
76820,480,6867,4.0,2007-05-14T17:19:31
860,6,981,3.0,1996-10-17T12:49:27


In [9]:
def euc_similarity(user_1_items, user_2_items):
    common_items = set(user_1_items.keys()).intersection(set(user_2_items.keys()))

    if len(common_items) == 0:
        # No common movies, return a similarity of 0
        return 0
    dist = np.sqrt(np.sum([(user_1_items[key] - user_2_items[key])**2 for key in common_items]))
    similarity = 1 / (1 + dist)
    return similarity

def pearson_similarity(user_1_items, user_2_items):

        common_items = set(user_1_items.keys()).intersection(set(user_2_items.keys()))
    
        if len(common_items) == 0:
            # No common movies, return a similarity of 0
            return 0
        ratings_user1 = [user_1_items[item] for item in common_items]
        ratings_user2 = [user_2_items[item] for item in common_items]

        if len(ratings_user1) < 3 or len(ratings_user2) < 3:
            # Not enough data points to calculate correlation, return 0
            return 0
        else:
            return np.corrcoef(ratings_user1, ratings_user2)[0, 1]

    

### Cosine Similarity

One widely used metric for measuring similarity between two vectors is **cosine similarity**. Given two vectors $ \mathbf{a} $ and $ \mathbf{b} $, the cosine similarity is computed as:

$$ \text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|} $$

where $ \cdot $ represents the dot product, and $ \|\mathbf{a}\| $ denotes the Euclidean norm of vector $ \mathbf{a} $. The resulting value ranges from -1 (perfect dissimilarity) to 1 (perfect similarity).

### Similarity-Based Recommendation

The idea behind similarity-based recommendation is to identify users or items that are similar to a target user or item and recommend items based on the preferences of similar users. The steps involve:

1. **Representing Items and Users**: Items and users are represented as vectors, where each dimension corresponds to a feature or attribute.

2. **Calculating Similarity**: The similarity between a target user and other users is computed using a similarity metric. Here, cosine similarity is employed.

3. **Ranking Recommendations**: Items that are liked by similar users but not yet interacted with by the target user are recommended. A weighted sum of ratings from similar users is used for ranking.


In [10]:
class SimilarityRecSys:
    def __init__(self, movies, reviews, similarity_function):
        self.movies = self._get_movies(movies.copy())
        self.reviews = self._get_reviews(reviews.copy())
        self.sim_function = similarity_function

    def _get_movies(self, movies):
        movies['genres'] = movies['genres'].apply(lambda x: set(x.split('|')))
        movies_dict = movies.set_index('movieId').to_dict(orient='index')
        return movies_dict
    
    def _get_reviews(self, reviews):
        # Convert 'timestamp' to datetime
        review_dict = {}
        for _, row in reviews.iterrows():
            user_id = row['userId']
            movie_id = int(row['movieId'])
            rating = row['rating']
            if user_id not in review_dict:
                review_dict[user_id] = {}
            review_dict[user_id][movie_id] = rating
        return review_dict
    
    def _sort_users(self, key_user):
        similarities = []
        for user in self.reviews:
            if user != key_user:
                similarities.append((user, self.sim_function(self.reviews[key_user], self.reviews[user])))
        return sorted(similarities, key=lambda x: x[1], reverse=True)

    def _get_rankings(self, similarities, key_user, similarity_threshold=0):
        movie_scores = {}
        sum_sim = {}
        for user, sim_score in similarities:
            if sim_score <= similarity_threshold:
                continue
            for movie_id, score in self.reviews[user].items():
                if self.reviews[key_user].get(movie_id, 0) == 0:
                    movie_scores.setdefault(movie_id, 0)
                    movie_scores[movie_id] += score*sim_score 
                    sum_sim.setdefault(movie_id, 0)
                    sum_sim[movie_id] += sim_score
        for movie_id, score in movie_scores.items():
            if not score > 0:
                print(movie_id, score)
        ranks = [(movie_id, score/sum_sim[movie_id]) for movie_id, score in movie_scores.items()]
        ranks = sorted(ranks, key=lambda x: x[1], reverse=True)
        return ranks

    def recommend_for_user(self, user, top_k=10):
        similarities = self._sort_users(user)
        movies_rankings = self._get_rankings(similarities, user)
        top_k_movies = movies_rankings[:top_k]
        response = list(map(lambda x: f"id: {x[0]: <10} {self.movies[x[0]]['title']: <80} counts: {x[1]}", top_k_movies))
        return response


In [11]:
src = SimilarityRecSys(movies, rating_train, euc_similarity)
users_test = ratings_test['userId'].unique()
src.recommend_for_user(users_test[0])

['id: 136556     Kung Fu Panda: Secrets of the Masters (2011)                                     counts: 5.000000000000001',
 'id: 6983       Jane Eyre (1944)                                                                 counts: 5.0',
 'id: 626        Thin Line Between Love and Hate, A (1996)                                        counts: 5.0',
 'id: 53         Lamerica (1994)                                                                  counts: 5.0',
 'id: 3567       Bossa Nova (2000)                                                                counts: 5.0',
 'id: 3940       Slumber Party Massacre III (1990)                                                counts: 5.0',
 'id: 3941       Sorority House Massacre (1986)                                                   counts: 5.0',
 'id: 3942       Sorority House Massacre II (1990)                                                counts: 5.0',
 'id: 3939       Slumber Party Massacre II (1987)                                         