# Hybrid Film Recommendation System  <a class="tocSkip">

This project aims to build a **hybrid film recommendation system** that combines multiple approaches for suggesting movies to users. The system will leverage techniques such as *content-based filtering*, *collaborative filtering*, and *popularity-based recommendations* to generate personalized and diverse recommendations. By integrating these methods, we expect to create a more robust and accurate recommendation system that caters to users' preferences and interests while also considering popular and trending movies. The final recommendation will be an aggregation of the individual methods, ensuring a well-rounded and comprehensive set of suggestions for each user.


In [137]:
import pandas as pd
import numpy as np
import random
import os
from IPython.display import display, HTML

display(HTML("<style>.container { width:90% !important; }</style>"))

In [2]:
movie_data = pd.read_csv('data/film_data/prepared_film_data.csv')

Integrating a **popularity-based recommendation system** within a comprehensive movie recommendation engine is important for providing broadly appealing suggestions, particularly for users with limited interaction data. This approach complements personalized methods like content-based and collaborative filtering, ensuring a diverse set of recommendations that balances familiar favorites and undiscovered gems to enhance users' overall movie-watching experience.

In [138]:
#my simple popularity based score takes into account avg. rating, vote count and release year (favoring more recent films)

def calculate_popularity_score(df, alpha = 0.8, beta=1, gamma=0.002): #alpha, beta and gamma ar weights for votecount, avg.rating, release year
    df = df.copy()
    df['norm_numVotes'] = df['numVotes'] / df['numVotes'].max()
    df['norm_averageRating'] = df['averageRating'] / df['averageRating'].max()
    df['norm_startYear'] = (df['startYear'] - df['startYear'].min()) / (df['startYear'].max() - df['startYear'].min())
    df['popularity_score'] = alpha *df['norm_numVotes'] + beta * df['norm_averageRating'] + gamma * df['norm_startYear']
    df = df.drop(columns=['norm_numVotes', 'norm_averageRating', 'norm_startYear'])
    
    return df

In [139]:
movie_data_with_popularity = calculate_popularity_score(movie_data)
movie_data_with_popularity = movie_data_with_popularity.sort_values('popularity_score', ascending=False)

In [140]:
movie_data_with_popularity.head(10)

Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,averageRating,numVotes,primaryName,Action,Adult,...,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,\N,popularity_score
22163,tt0111161,The Shawshank Redemption,0,1994.0,142.0,9.3,2733330.0,Frank Darabont,0,0,...,0,0,0,0,0,0,0,0,0,1.73155
183357,tt0468569,The Dark Knight,0,2008.0,152.0,9.0,2706445.0,Christopher Nolan,1,0,...,0,0,0,0,0,0,0,0,0,1.693899
206172,tt1375666,Inception,0,2010.0,148.0,8.8,2402346.0,Christopher Nolan,1,0,...,0,1,0,0,0,0,0,0,0,1.584925
23700,tt0137523,Fight Club,0,1999.0,139.0,8.8,2176102.0,David Fincher,0,0,...,0,0,0,0,0,0,0,0,0,1.518537
153386,tt0110912,Pulp Fiction,0,1994.0,154.0,8.9,2100399.0,Quentin Tarantino,0,0,...,0,0,0,0,0,0,0,0,0,1.506302
120287,tt0109830,Forrest Gump,0,1994.0,142.0,8.8,2126909.0,Robert Zemeckis,0,0,...,1,0,0,0,0,0,0,0,0,1.504061
152940,tt0068646,The Godfather,0,1972.0,175.0,9.2,1900604.0,Francis Ford Coppola,0,0,...,0,0,0,0,0,0,0,0,0,1.477484
180300,tt0167260,The Lord of the Rings: The Return of the King,0,2003.0,201.0,9.0,1880210.0,Peter Jackson,1,0,...,0,0,0,0,0,0,0,0,0,1.451996
270191,tt0133093,The Matrix,0,1999.0,136.0,8.7,1949993.0,Lana Wachowski,1,0,...,0,1,0,0,0,0,0,0,0,1.442358
180224,tt0120737,The Lord of the Rings: The Fellowship of the Ring,0,2001.0,178.0,8.8,1909123.0,Peter Jackson,1,0,...,0,0,0,0,0,0,0,0,0,1.440427


Now that we have a function that calculates the popularity, lets build a function that returns 5 movies for someone that doesnt know anything, the function will allow to pick the genre and return some very high rated movies based on the popularity score at random (so the recommended movies arent the same every time.

In [141]:
def get_pop_movies_with_random(df, num_movies=5, top_percent = 0.01, min_votes= 10000, random_seed=None, genre=None):
    
    if random_seed is not None:
        random.seed(random_seed)
    df = df[df['numVotes'] >= min_votes]
    
    if genre is not None:
        df = df[df[genre] == 1]
        
    top_n = int(df.shape[0] * top_percent)
    top_movies = df.head(top_n)
    
    num_movies = min(num_movies, top_movies.shape[0])
    selected_movies = top_movies.sample(num_movies, replace=False)
    
    
    columns=['primaryTitle', 'startYear', 'averageRating', 'numVotes']
    selected_movies = selected_movies[columns]
    
    return selected_movies

In [142]:
print(get_pop_movies_with_random(movie_data_with_popularity, num_movies=5, min_votes= 1000, random_seed=None, genre='Drama'))

              primaryTitle  startYear  averageRating   numVotes
273985                Argo     2012.0            7.7   623363.0
153474  American History X     1998.0            8.5  1139907.0
282068          Your Name.     2016.0            8.4   283362.0
188822  500 Days of Summer     2009.0            7.7   528655.0
212344           The Sting     1973.0            8.3   268791.0


Incorporating a **content-based recommendation** system within a movie recommendation engine is key for delivering personalized suggestions tailored to users' unique preferences. This approach complements broader methods like popularity-based and collaborative filtering, ensuring diverse recommendations that reflect individual tastes. By considering specific movie features, content-based recommendations help users discover new, lesser-known titles that align with their interests, enhancing their overall movie-watching experience.

In [143]:
from sklearn.metrics.pairwise import cosine_similarity

In [144]:
user_watchlists = pd.read_csv('data/user_ratings_data/user_watchlists.csv')
user_watchlists.drop(columns=['Unnamed: 0'], inplace=True)
user_watchlists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2470199 entries, 0 to 2470198
Data columns (total 3 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   user_id  object
 1   imdb_id  object
 2   rating   int64 
dtypes: int64(1), object(2)
memory usage: 56.5+ MB


In [145]:
def create_user_profiles(movie_data, user_watchlists):
    genre_columns = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama',
                     'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'News',
                     'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Talk-Show', 'Thriller', 'War', 'Western']
    
    user_movie_data = user_watchlists.merge(movie_data, left_on='imdb_id', right_on='tconst', how='inner') #merge both df's
    
    user_movie_data['penalty_factor'] = user_movie_data['rating']/ 10
    
    user_movie_data[genre_columns] = user_movie_data[genre_columns].multiply(user_movie_data['rating'] * user_movie_data['penalty_factor'], axis="index")
    
    user_movie_data['weighted_genres'] = user_movie_data[genre_columns].apply(lambda row: np.array(row), axis=1)
    user_profiles = user_movie_data.groupby('user_id')['weighted_genres'].apply(lambda x: np.mean(np.vstack(x), axis=0))
    
    return user_profiles

In [153]:
def recommend_movies_based_on_watchlist(movie_data, user_watchlists, new_user_watchlist, num_pool=50, num_recommendations=10, alpha=0.7):
    
    
    movie_data = movie_data[movie_data['numVotes'] >= 5000]
    movie_data_with_popularity = calculate_popularity_score(movie_data)
    
    combined_watchlist = pd.concat([user_watchlists, new_user_watchlist])
    user_id = new_user_watchlist.iloc[1, 0]
    user_profiles = create_user_profiles(movie_data_with_popularity, combined_watchlist)
    
    user_profile = user_profiles.loc[user_id]
    
    genre_columns = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama',
                     'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'News',
                     'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Talk-Show', 'Thriller', 'War', 'Western']
    
    
    
    genre_matrix  = movie_data_with_popularity[genre_columns].values
    similarity_scores = cosine_similarity([user_profile], genre_matrix)
    
    popularity_scores = movie_data_with_popularity['popularity_score']
    
    weighted_similarity_scores = alpha * similarity_scores + (1-alpha) * popularity_scores.values.reshape(1, -1)
    
    sorted_movie_indices = np.argsort(weighted_similarity_scores[0])[::-1]
    user_seen_movies = set(new_user_watchlist['imdb_id'])
    
    recommended_movie_ids  = []
    for movie_idx in sorted_movie_indices:
        movie_id = movie_data_with_popularity.iloc[movie_idx]['tconst']
        if movie_id not in user_seen_movies:
            recommended_movie_ids.append(movie_id)
        if len(recommended_movie_ids) >= num_pool:
            break;
            
            
    final_recomendations = random.sample(recommended_movie_ids, num_recommendations)
    columns=['primaryTitle', 'startYear', 'averageRating', 'numVotes']
    
    return movie_data_with_popularity.loc[movie_data_with_popularity['tconst'].isin(final_recomendations)][columns]

In [155]:
manto_watchlistas = pd.read_csv('data/user_ratings_data/test_ratings.csv')
karolinos_watchlistas = pd.read_csv('data/user_ratings_data/test_ratings2.csv')

In [156]:
print(recommend_movies_based_on_watchlist(movie_data, user_watchlists, manto_watchlistas, num_recommendations=10))

                          primaryTitle  startYear  averageRating  numVotes
179699   Ben-Hur: A Tale of the Christ     1925.0            7.8    7766.0
179952             The Hidden Fortress     1958.0            8.1   40306.0
180336  Crouching Tiger, Hidden Dragon     2000.0            7.9  275029.0
180725                         Kantara     2022.0            8.3   94177.0
193055                   Almost Famous     2000.0            7.9  282745.0
215937              Amar Akbar Anthony     1977.0            7.4    7921.0
216347               Riders of Justice     2020.0            7.5   54384.0
216452                         Dookudu     2011.0            7.4   13728.0
216599                              24     2016.0            7.8   22990.0
216636                   Minnal Murali     2021.0            7.8   30907.0


In [157]:
print(recommend_movies_based_on_watchlist(movie_data, user_watchlists, karolinos_watchlistas, num_recommendations=10))

                                         primaryTitle  startYear  \
38666                               The Kashmir Files     2022.0   
143786                            The Marathon Family     1982.0   
143848                                     Balkan Spy     1984.0   
145587                                    Kibar Feyzo     1978.0   
157650                             The Usual Suspects     1995.0   
180300  The Lord of the Rings: The Return of the King     2003.0   
180301          The Lord of the Rings: The Two Towers     2002.0   
180747                                   The Revenant     2015.0   
182874                              Kill Bill: Vol. 1     2003.0   
271392                                   Interstellar     2014.0   

        averageRating   numVotes  
38666             8.7   564031.0  
143786            8.8    16342.0  
143848            8.8    11902.0  
145587            8.7    16730.0  
157650            8.5  1101180.0  
180300            9.0  1880210.0  
180301

Now that we a system to do content-based recommendations lets move to colaborative filtering.
**Collaborative filtering** is a technique used in recommendation systems that identifies similar user preferences and generates recommendations based on those patterns. **Item-based collaborative filtering** focuses on the attributes of the items being recommended rather than the user preferences, making it a complementary approach to content-based filtering. Combining these methods can provide more diverse and personalized recommendations, enhancing the overall user experience.

In [168]:
user_watchlists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2470199 entries, 0 to 2470198
Data columns (total 3 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   user_id  object
 1   imdb_id  object
 2   rating   int64 
dtypes: int64(1), object(2)
memory usage: 56.5+ MB


In [169]:
def create_user_movie_matrix(ratings_data, min_votes=10):
    popular_movies = ratings_data.groupby('imdb_id').filter(lambda x: len(x) >= min_votes) #too many movies, need to make this smaller
    
    
    return popular_movies.pivot_table(index='user_id', columns='imdb_id', values='rating').fillna(0).astype('uint8') #pivot to get a matrix, change type from float64 to int8 to save space and time

In [160]:
def compute_movie_similarity(user_movie_matrix):
    user_movie_matrix_filled = user_movie_matrix.fillna(0)   # cant be empty to calculate cosine similary 
    return cosine_similarity(user_movie_matrix_filled.T)

In [175]:
def get_top_similar_movies(movie_id, user_movie_matrix, similarity_matrix, n=10):
    movie_idx = user_movie_matrix.columns.get_loc(movie_id)
    similar_movie_indices = np.argsort(similarity_matrix[movie_idx])[::-1][1:n+1]
    similar_movie_ids = [user_movie_matrix.columns[i] for i in similar_movie_indices]
    return similar_movie_ids

In [171]:
test = create_user_movie_matrix(user_watchlists)
test

imdb_id,tt0000001,tt0000003,tt0000005,tt0000008,tt0000010,tt0000012,tt0000013,tt0000014,tt0000016,tt0000023,...,tt9898858,tt9899086,tt9899090,tt9899922,tt9900092,tt9900782,tt9902160,tt9906260,tt9907782,tt9908860
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ur0001220,0,0,0,0,0,8,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ur0002746,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ur0011762,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ur0019730,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ur0033913,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ur9972457,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ur99771122,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ur99782462,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ur9983981,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [173]:
test2 = compute_movie_similarity(test)

In [177]:
test3

['tt0870984',
 'tt0290673',
 'tt3774694',
 'tt1937390',
 'tt1974419',
 'tt8359848',
 'tt2382009',
 'tt1527186',
 'tt10944760',
 'tt1588170']

In [191]:
test3 = get_top_similar_movies('tt1191111', test, test2)
top_similar_movies_idx = test3.index
recommended_movies = movie_data[movie_data['tconst'].isin(test3)]
recommended_movies

Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,averageRating,numVotes,primaryName,Action,Adult,...,Sport,Talk-Show,Thriller,War,Western,\N,norm_numVotes,norm_averageRating,norm_startYear,popularity_score
47587,tt1937390,Nymphomaniac: Vol. I,0,2013.0,117.0,6.9,127211.0,Lars von Trier,0,0,...,0,0,0,0,0,0,0.046541,0.69,0.922481,0.729077
50069,tt2382009,Nymphomaniac: Vol. II,0,2013.0,124.0,6.6,94923.0,Lars von Trier,0,0,...,0,0,0,0,0,0,0.034728,0.66,0.922481,0.689627
125682,tt3774694,Love,0,2015.0,135.0,6.1,61999.0,Gaspar Noé,0,0,...,0,0,0,0,0,0,0.022683,0.61,0.937984,0.630022
157821,tt0290673,Irreversible,0,2002.0,97.0,7.3,140084.0,Gaspar Noé,0,0,...,0,0,0,0,0,0,0.05125,0.73,0.837209,0.772675
183810,tt1588170,I Saw the Devil,0,2010.0,144.0,7.8,136271.0,Jee-woon Kim,1,0,...,0,0,0,0,0,0,0.049855,0.78,0.899225,0.821683
193670,tt1527186,Melancholia,0,2011.0,135.0,7.1,187541.0,Lars von Trier,0,0,...,0,0,0,0,0,0,0.068613,0.71,0.906977,0.766704
201532,tt10944760,Titane,0,2021.0,108.0,6.5,49580.0,Julia Ducournau,0,0,...,0,0,0,0,0,0,0.018139,0.65,0.984496,0.66648
246425,tt0870984,Antichrist,0,2009.0,108.0,6.5,130495.0,Lars von Trier,0,0,...,0,0,1,0,0,0,0.047742,0.65,0.891473,0.689977
248489,tt1974419,The Neon Demon,0,2016.0,117.0,6.1,99262.0,Nicolas Winding Refn,0,0,...,0,0,1,0,0,0,0.036315,0.61,0.945736,0.640944
264649,tt8359848,Climax,0,2018.0,97.0,6.9,72846.0,Gaspar Noé,0,0,...,0,0,0,0,0,0,0.026651,0.69,0.96124,0.713243


Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,averageRating,numVotes,primaryName,Action,Adult,...,Sport,Talk-Show,Thriller,War,Western,\N,norm_numVotes,norm_averageRating,norm_startYear,popularity_score
47587,tt1937390,Nymphomaniac: Vol. I,0,2013.0,117.0,6.9,127211.0,Lars von Trier,0,0,...,0,0,0,0,0,0,0.046541,0.69,0.922481,0.729077
50069,tt2382009,Nymphomaniac: Vol. II,0,2013.0,124.0,6.6,94923.0,Lars von Trier,0,0,...,0,0,0,0,0,0,0.034728,0.66,0.922481,0.689627
125682,tt3774694,Love,0,2015.0,135.0,6.1,61999.0,Gaspar Noé,0,0,...,0,0,0,0,0,0,0.022683,0.61,0.937984,0.630022
157821,tt0290673,Irreversible,0,2002.0,97.0,7.3,140084.0,Gaspar Noé,0,0,...,0,0,0,0,0,0,0.05125,0.73,0.837209,0.772675
183810,tt1588170,I Saw the Devil,0,2010.0,144.0,7.8,136271.0,Jee-woon Kim,1,0,...,0,0,0,0,0,0,0.049855,0.78,0.899225,0.821683
193670,tt1527186,Melancholia,0,2011.0,135.0,7.1,187541.0,Lars von Trier,0,0,...,0,0,0,0,0,0,0.068613,0.71,0.906977,0.766704
201532,tt10944760,Titane,0,2021.0,108.0,6.5,49580.0,Julia Ducournau,0,0,...,0,0,0,0,0,0,0.018139,0.65,0.984496,0.66648
246425,tt0870984,Antichrist,0,2009.0,108.0,6.5,130495.0,Lars von Trier,0,0,...,0,0,1,0,0,0,0.047742,0.65,0.891473,0.689977
248489,tt1974419,The Neon Demon,0,2016.0,117.0,6.1,99262.0,Nicolas Winding Refn,0,0,...,0,0,1,0,0,0,0.036315,0.61,0.945736,0.640944
264649,tt8359848,Climax,0,2018.0,97.0,6.9,72846.0,Gaspar Noé,0,0,...,0,0,0,0,0,0,0.026651,0.69,0.96124,0.713243
