## Recommendation Systems for Movies
- Movies listed in the Full MovieLens Dataset.Metadata on over 45,000 movies. 
- The dataset consists of movies released on or before July 2017.
- Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies,    countries, TMDB vote counts and vote averages. 
- Dataset contain 26 million ratings from over 270,000 users. 
- Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website
  In this notebook, there are the following techniques for providing movie recomendation system explored: 
- Recommendation based on top __Weighted scores for movies on platform__ - These movies can be recommended under "__Popular Picks__"
- __Content-based__ recommendation system 1. Genre based filtering: These movies can be recommended as "__Action Picks__," "__Drama Discoveries__," or "__Comedy Gems__" etc. 2. Movies recommended based on other movie features can be recommended under "__Because you watched x__"
- __Collaborative Filtering__ -based recommendation syste: Movies recommended based on collaborative filtering can be placed under "__Recommended for You__" or "__Popular among similar Users__" 


In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from ast import literal_eval
from sklearn.metrics import mean_absolute_error, mean_squared_error
from surprise import SVD, Reader, Dataset
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate, GridSearchCV
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
import warnings; warnings.simplefilter('ignore')

In [2]:
data_path ='C:/Users/Nimish/Documents/ML Projects/Movie Recommendation - Kaggle/Data'
csv_files = os.listdir(data_path)
csv_files

['credits.csv',
 'keywords.csv',
 'links.csv',
 'movies_metadata.csv',
 'ratings.csv',
 'ratings_small.csv']

In [3]:
# Import CSV files and store them in DataFrames
for file in csv_files:
    file_path = os.path.join(data_path, file)
    df_name = file[:-4]  # Remove '.csv' from the file name
    #dfs[df_name] = pd.read_csv(file_path)
    globals()[f"{df_name}_df"] = pd.read_csv(file_path)

In [4]:
# Preprocessing Steps
movies_metadata_df['id'] = pd.to_numeric(movies_metadata_df['id'], errors='coerce')
movies_metadata_df['popularity'] = pd.to_numeric(movies_metadata_df['popularity'], errors='coerce')
movies_metadata_df = movies_metadata_df.drop([19730, 29503, 35587])
keywords_df['id'] = keywords_df['id'].astype('int')
credits_df['id'] = credits_df['id'].astype('int')
movies_metadata_df['id'] = movies_metadata_df['id'].astype('int')
#movies_metadata_df['genres'] = movies_metadata_df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

## Recommendation based on Top Movie scores computed based on votes and popularity

In [19]:
genre_series = movies_metadata_df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
genre_series.name = 'genre'
movies_df = movies_metadata_df.drop('genres', axis=1).join(genre_series)

In [23]:
movies_df['genre'].unique()

array(['Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy', 'Romance',
       'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Science Fiction', 'Mystery', 'War', 'Foreign', nan, 'Music',
       'Documentary', 'Western', 'TV Movie', 'Carousel Productions',
       'Vision View Entertainment', 'Telescene Film Group Productions',
       'Aniplex', 'GoHands', 'BROSTA TV',
       'Mardock Scramble Production Committee', 'Sentai Filmworks',
       'Odyssey Media', 'Pulser Productions', 'Rogue State', 'The Cartel'],
      dtype=object)

## 1b. Recommendation based on Genre, Top Movies scored based Votes and Popularity

In [33]:
def top_n_genre_movies(gen, n=25):
    movies_gen = movies_df[movies_df['genre']==gen]
    movies_subset = movies_gen[['title','vote_average','vote_count','popularity','id','genre']]
    vote_count_threshold_gen = 500
    movies_subset = movies_subset[movies_subset['vote_count']>vote_count_threshold_gen]
    average_rating = movies_subset['vote_average'].mean()
    movies_subset['movie_score'] = (movies_subset['vote_average'] * movies_subset['vote_count'] + average_rating*vote_count_threshold_gen)/(movies_subset['vote_count']+vote_count_threshold_gen)
    top_movies = movies_subset.sort_values(by = 'movie_score',ascending = False).head(n)
    return top_movies
movies_recommendation_genre = top_n_genre_movies('Science Fiction')
print(movies_recommendation_genre[['title','movie_score','genre']])

                                       title  movie_score            genre
1154                 The Empire Strikes Back     8.070198  Science Fiction
15480                              Inception     8.045561  Science Fiction
22879                           Interstellar     8.032108  Science Fiction
256                                Star Wars     7.990979  Science Fiction
1225                      Back to the Future     7.889679  Science Fiction
23753                Guardians of the Galaxy     7.834045  Science Fiction
2458                              The Matrix     7.827607  Science Fiction
1163                      A Clockwork Orange     7.810923  Science Fiction
1167                      Return of the Jedi     7.768240  Science Fiction
1171                                   Alien     7.763062  Science Fiction
22168                                    Her     7.752926  Science Fiction
536                             Blade Runner     7.739960  Science Fiction
7208   Eternal Sunshine o

## Content based Recommendation System for Movies
- Here we will use other features related to user and movies to further personalize the recommendations. 
- An examples include using movie genre, Cast, Director to suit User preference 


In [5]:
movies_merged_df = movies_metadata_df.merge(credits_df,on='id')
movies_merged_df = movies_merged_df.merge(keywords_df,on='id')
movies_merged_df['cast'] = movies_merged_df['cast'].apply(literal_eval)
movies_merged_df['crew'] = movies_merged_df['crew'].apply(literal_eval)
movies_merged_df['keywords'] = movies_merged_df['keywords'].apply(literal_eval)

In [6]:
def director_crew(data):
    for item in data:
        if item["job"] == "Director":
            return item["name"]
    return np.nan

# Get director names
movies_merged_df["director"] = movies_merged_df["crew"].apply(director_crew)

# Extract and limit cast names to top3
movies_merged_df["cast"] = movies_merged_df["cast"].apply(lambda x: [i["name"].lower().replace(" ", "") for i in x if isinstance(x, list)][:3])

# Extract and normalize keywords
movies_merged_df["keywords"] = movies_merged_df["keywords"].apply(lambda x: [i["name"].lower() for i in x if isinstance(x, list)])

# Enumerate directors
movies_merged_df["director"] = movies_merged_df["director"].astype("str").apply(lambda x: x.lower().replace(" ", "")).apply(lambda x: [x])

# Extract genres, leave their ids, and make them as a list
movies_merged_df['genres'] = movies_merged_df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [7]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [120]:
movies_merged_df

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,"[tomhanks, timallen, donrickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends, riva...",[johnlasseter]
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[robinwilliams, jonathanhyde, kirstendunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[board game, disappearance, based on children'...",[joejohnston]
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[waltermatthau, jacklemmon, ann-margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, best friend, duringcreditsstinger, o...",[howarddeutch]
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[whitneyhouston, angelabassett, lorettadevine]","[{'credit_id': '52fe44779251416c91011acb', 'de...","[based on novel, interracial relationship, sin...",[forestwhitaker]
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[stevemartin, dianekeaton, martinshort]","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlife crisis, confidence, aging, daug...",[charlesshyer]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46623,False,,0,"[Drama, Family]",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0,"[leilahatami, kouroshtahami, elhamkorda]","[{'credit_id': '5894a97d925141426c00818c', 'de...",[tragic love],[hamidnematollah]
46624,False,,0,[Drama],,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,Released,,Century of Birthing,False,9.0,3.0,"[angelaquino, perrydizon, hazelorencio]","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...","[artist, play, pinoy]",[lavdiaz]
46625,False,,0,"[Action, Drama, Thriller]",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,Released,A deadly game of wits.,Betrayal,False,3.8,6.0,"[erikaeleniak, adambaldwin, juliedupage]","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",[],[markl.lester]
46626,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,Released,,Satan Triumphant,False,0.0,0.0,"[iwanmosschuchin, nathalielissenko, pavelpavlov]","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",[],[yakovprotazanov]


In [11]:
movies_merged_short = movies_merged_df.head(25000)

In [12]:
s = movies_merged_short.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()
s = s[s > 1]
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words
movies_merged_short['keywords'] = movies_merged_short['keywords'].apply(filter_keywords)
movies_merged_short['keywords'] = movies_merged_short['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
movies_merged_short['keywords'] = movies_merged_short['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
movies_merged_short['movies_meta'] = movies_merged_short['keywords'] + movies_merged_short['cast'] + movies_merged_short['director'] + movies_merged_short['genres']
movies_merged_short['movies_meta'] = movies_merged_short['movies_meta'].apply(lambda x: ' '.join(x))
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0.25, stop_words='english')
count_matrix = count.fit_transform(movies_merged_df['movies_meta'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)
movies_merged_short = movies_merged_short.reset_index()
titles = movies_merged_df['title']
indices = pd.Series(movies_merged_short.index, index=movies_merged_short['title'])

MemoryError: Unable to allocate 16.2 GiB for an array with shape (46628, 46628) and data type float64

In [None]:
get_recommendations('The Dark Knight').head(10)

### 3. Movies Recommendations using Collaborative Filtering with SVD
- This model is based on historical interactions of a user and similar interactions with similar users. 
- The approach movie ratings provided by users in ratings_small dataset due to limitation in compute available. 

In [58]:
ratings_small_df1 = ratings_small_df.merge(movies_metadata_df[['id','title','vote_average','vote_count']], left_on = 'movieId', right_on = 'id')
ratings_small_df1.dropna()
movieid_list = ratings_small_df1['movieId'].unique()  # Number of unique movies in the dataset
movie_dict = movies_metadata_df.set_index('id')['title'].to_dict()

In [52]:
#SVD model training using small movie ratings data
reader = Reader(rating_scale=(0.5, 5))  
data = Dataset.load_from_df(ratings_small_df1[['userId', 'movieId', 'rating']], reader)

model = SVD()

cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

trainset = data.build_full_trainset()
model.fit(trainset)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9038  0.8955  0.8870  0.9040  0.9060  0.8993  0.0071  
MAE (testset)     0.6980  0.6902  0.6869  0.6932  0.6983  0.6933  0.0044  
Fit time          0.47    0.49    0.50    0.48    0.49    0.49    0.01    
Test time         0.05    0.06    0.06    0.34    0.07    0.11    0.11    


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2bda3a0a430>

In [54]:
def get_top_n_recommendations(model, user_id, n=10):
    '''Top n movie recommendations using SVD Model trained earlier'''
    recommendations = []
    
    for movie_id in movieid_list:  # Assuming items are numbered from 1 to num_items
        prediction = model.predict(user_id, movie_id)
        recommendations.append((movie_id, prediction.est))
    
    # Sort recommendations by estimated rating
    recommendations.sort(key=lambda x: x[1], reverse=True)
    
    top_n_recommendations = recommendations[:n]
    return top_n_recommendations

In [60]:
# Recommended movies for user ID 20
user_id = 20
top_recommendations = get_top_n_recommendations(model, user_id, n=10)
print("Top Recommendations for User", user_id)
for movieId, estimated_rating in top_recommendations:
    movie_title = movie_dict[movieId]
    print("Movie ID:", movieId,"Movie Title:", movie_title, "Estimated Rating:", estimated_rating)

Top Recommendations for User 20
Movie ID: 497 Movie Title: The Green Mile Estimated Rating: 4.2359730745092525
Movie ID: 2690 Movie Title: Irma la Douce Estimated Rating: 4.030973885525562
Movie ID: 745 Movie Title: The Sixth Sense Estimated Rating: 4.030033013918051
Movie ID: 31658 Movie Title: Hour of the Gun Estimated Rating: 3.9853126545778776
Movie ID: 780 Movie Title: The Passion of Joan of Arc Estimated Rating: 3.9427401685184984
Movie ID: 6016 Movie Title: The Good Thief Estimated Rating: 3.903978025827506
Movie ID: 2186 Movie Title: Within the Woods Estimated Rating: 3.896197227940051
Movie ID: 1248 Movie Title: Hannibal Rising Estimated Rating: 3.8716323968341726
Movie ID: 1411 Movie Title: The Rapture Estimated Rating: 3.850195779939997
Movie ID: 534 Movie Title: Terminator Salvation Estimated Rating: 3.8328057360044996


In [65]:
# Recommended movies for user ID 500
user_id = 500
top_recommendations = get_top_n_recommendations(model, user_id, n=10)
print("Top Recommendations for User", user_id)
for movieId, estimated_rating in top_recommendations:
    movie_title = movie_dict[movieId]
    print("Movie ID:", movieId,"Movie Title:", movie_title, "Estimated Rating:", estimated_rating)

Top Recommendations for User 500
Movie ID: 2324 Movie Title: Local Color Estimated Rating: 4.032201646044475
Movie ID: 509 Movie Title: Notting Hill Estimated Rating: 3.9735308188924394
Movie ID: 4011 Movie Title: Beetlejuice Estimated Rating: 3.8673547109993383
Movie ID: 1280 Movie Title: 3-Iron Estimated Rating: 3.8385455670474733
Movie ID: 318 Movie Title: The Million Dollar Hotel Estimated Rating: 3.827051600964442
Movie ID: 916 Movie Title: Bullitt Estimated Rating: 3.8164827831196706
Movie ID: 3035 Movie Title: Frankenstein Estimated Rating: 3.8121605678560586
Movie ID: 866 Movie Title: Finding Neverland Estimated Rating: 3.810145469447373
Movie ID: 6016 Movie Title: The Good Thief Estimated Rating: 3.8100216776290226
Movie ID: 2359 Movie Title: Sicko Estimated Rating: 3.759862369850388
