## Recommendation Systems for Movies
- Movies listed in the Full MovieLens Dataset.Metadata on over 45,000 movies.
- The dataset consists of movies released on or before July 2017.
- Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies,    countries, TMDB vote counts and vote averages.
- Dataset contain 26 million ratings from over 270,000 users.
- Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website
  
In this notebook, there are the following techniques for providing movie recomendation system explored:
- Recommendation based on top __Weighted scores for movies on platform__ - These movies can be recommended under "__Popular Picks__"
- __Content-based__ recommendation system 1. Genre based filtering: These movies can be recommended as "__Action Picks__," "__Drama Discoveries__," or "__Comedy Gems__" etc. 2. Movies recommended based on other movie features can be recommended under "__Because you watched x__"
- __Collaborative Filtering__ -based recommendation syste: Movies recommended based on collaborative filtering can be placed under "__Recommended for You__" or "__Popular among similar Users__"


In [8]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from ast import literal_eval
from sklearn.metrics import mean_absolute_error, mean_squared_error
from surprise import SVD, Reader, Dataset
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate, GridSearchCV
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
import warnings; warnings.simplefilter('ignore')

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [10]:
#data_path ='C:/Users/Nimish/Documents/ML Projects/Movie Recommendation - Kaggle/Data'
colab_data_path = '/content/drive/MyDrive/ML Projects/Movie Recommendation - Kaggle/Data'
csv_files = os.listdir(colab_data_path)
csv_files

['credits.csv',
 'movies_metadata.csv',
 'links_small.csv',
 'links.csv',
 'keywords.csv',
 'ratings.csv',
 'ratings_small.csv']

In [11]:
# Import CSV files and store them in DataFrames
for file in csv_files:
    file_path = os.path.join(colab_data_path, file)
    df_name = file[:-4]  # Remove '.csv' from the file name
    #dfs[df_name] = pd.read_csv(file_path)
    globals()[f"{df_name}_df"] = pd.read_csv(file_path,engine="python")

In [22]:
# Preprocessing Steps
movies_metadata_df['id'] = pd.to_numeric(movies_metadata_df['id'], errors='coerce')
movies_metadata_df['popularity'] = pd.to_numeric(movies_metadata_df['popularity'], errors='coerce')
movies_metadata_df = movies_metadata_df.drop([19730, 29503, 35587])
keywords_df['id'] = keywords_df['id'].astype('int')
credits_df['id'] = credits_df['id'].astype('int')
movies_metadata_df['id'] = movies_metadata_df['id'].astype('int')
movies_metadata_df['genres'] = movies_metadata_df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

## 1. Recommendation based on Top Movies on the platform
Weighted scores computed based on votes and popularity

In [23]:
genre_series = movies_metadata_df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
genre_series.name = 'genre'
movies_df = movies_metadata_df.drop('genres', axis=1).join(genre_series)

### 1-a Top movies on the platform based on weighted scores

In [14]:
def top_n_movies(n=25):
    movies_subset = movies_df[['title','vote_average','vote_count','popularity','id','genre']]
    vote_count_threshold = 5000
    movies_subset = movies_subset[movies_subset['vote_count']>vote_count_threshold]
    average_rating = movies_subset['vote_average'].mean()
    movies_subset['movie_score'] = (movies_subset['vote_average'] * movies_subset['vote_count'] + average_rating*vote_count_threshold_gen)/(movies_subset['vote_count']+vote_count_threshold_gen)
    top_movies = movies_subset.sort_values(by = 'movie_score',ascending = False).head(n)
    return top_movies
popular_movies = top_n_movies()
print(popular_movies[['title','movie_score']])

                                                   title  movie_score
314                             The Shawshank Redemption     8.091968
12481                                    The Dark Knight     8.042284
834                                        The Godfather     8.005579
2843                                          Fight Club     7.996791
292                                         Pulp Fiction     7.974433
15480                                          Inception     7.919109
351                                         Forrest Gump     7.899514
22879                                       Interstellar     7.886835
1154                             The Empire Strikes Back     7.840799
7000       The Lord of the Rings: The Return of the King     7.839113
18465                                   The Intouchables     7.820510
256                                            Star Wars     7.807039
4863   The Lord of the Rings: The Fellowship of the Ring     7.787612
46                  

### 1b Movie Recommendations based on Genre and weighted scores

In [25]:
def top_n_genre_movies(gen, n=25):
    movies_gen = movies_df[movies_df['genre']==gen]
    movies_subset = movies_gen[['title','vote_average','vote_count','popularity','id','genre']]
    vote_count_threshold_gen = 500
    movies_subset = movies_subset[movies_subset['vote_count']>vote_count_threshold_gen]
    print(movies_gen.shape,movies_subset.shape)
    average_rating = movies_subset['vote_average'].mean()
    movies_subset['movie_score'] = (movies_subset['vote_average'] * movies_subset['vote_count'] + average_rating*vote_count_threshold_gen)/(movies_subset['vote_count']+vote_count_threshold_gen)
    top_movies = movies_subset.sort_values(by = 'movie_score',ascending = False).head(n)
    return top_movies
movies_recommendation_genre = top_n_genre_movies('Science Fiction')
print(movies_recommendation_genre[['title','movie_score','genre']])

(3049, 24) (336, 6)
                                       title  movie_score            genre
1154                 The Empire Strikes Back     8.070198  Science Fiction
15480                              Inception     8.045561  Science Fiction
22879                           Interstellar     8.032108  Science Fiction
256                                Star Wars     7.990979  Science Fiction
1225                      Back to the Future     7.889679  Science Fiction
23753                Guardians of the Galaxy     7.834045  Science Fiction
2458                              The Matrix     7.827607  Science Fiction
1163                      A Clockwork Orange     7.810923  Science Fiction
1167                      Return of the Jedi     7.768240  Science Fiction
1171                                   Alien     7.763062  Science Fiction
22168                                    Her     7.752926  Science Fiction
536                             Blade Runner     7.739960  Science Fiction
7208 

## 2. Content based Recommendation System for Movies
- Here we will use other features related to user and movies to further personalize the recommendations.
- An examples include using movie genre, Cast, Director to suit User preference


###2a. Based on Genre, cast, director and keywords of a movie

In [34]:
def merge_dataframes(df1, df2, on='id'):
  """Merges two DataFrames based on a specific column."""
  merged_df = df1.merge(df2, on=on)
  return merged_df

def filter_dataframe(df, target_column, filter_df):
  """Filters a DataFrame based on values in another DataFrame."""
  filtered_df = df[df[target_column].isin(filter_df['movieId'])]
  return filtered_df

def literal_eval_column(df, column_name):
  """Applies literal_eval function to a specific column in a DataFrame."""
  df[column_name] = df[column_name].apply(literal_eval)
  return df

# Merge DataFrames
movies_merged_df = merge_dataframes(movies_metadata_df, credits_df)
movies_merged_df = merge_dataframes(movies_merged_df, keywords_df)

# Filter DataFrames
movies_merged_short = filter_dataframe(movies_merged_df, 'id', links_small_df)

# Apply literal_eval
movies_merged_short = literal_eval_column(movies_merged_short, 'cast')
movies_merged_short = literal_eval_column(movies_merged_short, 'crew')
movies_merged_short = literal_eval_column(movies_merged_short, 'keywords')

In [35]:
def director_crew(data):
    for item in data:
        if item["job"] == "Director":
            return item["name"]
    return np.nan

# Get director names
movies_merged_short["director"] = movies_merged_short["crew"].apply(director_crew)

# Extract and limit cast names to top3
movies_merged_short["cast"] = movies_merged_short["cast"].apply(lambda x: [i["name"].lower().replace(" ", "") for i in x if isinstance(x, list)][:3])

# Extract and normalize keywords
movies_merged_short["keywords"] = movies_merged_short["keywords"].apply(lambda x: [i["name"].lower() for i in x if isinstance(x, list)])

# Enumerate directors
movies_merged_short["director"] = movies_merged_short["director"].astype("str").apply(lambda x: x.lower().replace(" ", "")).apply(lambda x: [x]*3)

In [36]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    filtered_scores = [arr for arr in sim_scores if arr[0] != idx]
    print(len(sim_scores), len(filtered_scores))
    filtered_scores = sorted(filtered_scores, key=lambda x: x[1], reverse=True)
    filtered_scores = filtered_scores[1:31]
    movie_indices = [i[0] for i in filtered_scores]
    return titles.iloc[movie_indices]

In [37]:
# Define stop words and stemmer
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

def preprocess_keywords(keywords):
    preprocessed_keywords = []
    for keyword in keywords:
        # Lowercase, remove stop words, and stem
        if keyword.lower() not in stop_words:
            preprocessed_keywords.append(stemmer.stem(keyword.lower().replace(" ", "")))
    return preprocessed_keywords

# Apply preprocessing to all keywords
movies_merged_short['keywords'] = movies_merged_short['keywords'].apply(preprocess_keywords)

# Combine features and join
movies_merged_short['movies_meta'] = movies_merged_short['keywords'] + movies_merged_short['cast'] + movies_merged_short['director'] + movies_merged_short['genres']
movies_merged_short['movies_meta'] = movies_merged_short['movies_meta'].apply(lambda x: ' '.join(x))

# Create CountVectorizer with preprocessed keywords
count = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0.10, stop_words='english')
count_matrix = count.fit_transform(movies_merged_short['movies_meta'])
#svd = TruncatedSVD(n_components=4)
#reduced_matrix = svd.fit_transform(count_matrix)
#count_matrix = reduced_matrix

In [None]:
#reduced_matrix = reduced_matrix.astype(np.float16)
#count_matrix = reduced_matrix
#cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [38]:
def check_and_convert_matrix(matrix):
  """
  Checks if the range of a sparse matrix is within (-32768, 32767) and reduces it to int16 if possible.

  Args:
    matrix: A sparse matrix.

  Returns:
    A converted sparse matrix (int16) if possible, otherwise the original matrix.
  """
  # Check if data type is already int16
  if matrix.dtype == np.int16:
    return matrix

  # Check if all values are within the range of int16
  if np.all(matrix.data >= -32768) and np.all(matrix.data <= 32767):
    # Convert data type to int16
    return matrix.astype(np.int16)
  else:
    print("Warning: Data range exceeds the range of int16. Using original data type.")
    return matrix

converted_matrix = check_and_convert_matrix(count_matrix)

In [39]:
cosine_sim = cosine_similarity(converted_matrix, converted_matrix)
movies_merged_short = movies_merged_short.reset_index()
titles = movies_merged_short['title']
indices = pd.Series(movies_merged_short.index, index=movies_merged_short['title'])

In [40]:
get_recommendations('The Godfather').head(10)

2858 2857


20                  Taxi Driver
57     The Shawshank Redemption
91                       Malice
133               Trainspotting
187            Bonnie and Clyde
219       To Kill a Mockingbird
225                  GoodFellas
230      The Godfather: Part II
250                 Stand by Me
260              Cool Hand Luke
Name: title, dtype: object

### 2b. Based on Movie overview and tagline

In [42]:
movies_merged_short['tagline'] = movies_merged_short['tagline'].fillna('')
movies_merged_short['description'] = movies_merged_short['overview'] + movies_merged_short['tagline']
movies_merged_short['description'] = movies_merged_short['description'].fillna('')
tfidf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0.1, stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_merged_short['description'])

In [43]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
movies_merged_short = movies_merged_df.reset_index()
titles = movies_merged_short['title']
indices = pd.Series(movies_merged_short.index, index=movies_merged_short['title'])

In [44]:
get_recommendations('The Godfather').head(10)

2858 2857


1666            The Winter Guest
20                    Get Shorty
24             Leaving Las Vegas
29                Shanghai Triad
39      Cry, the Beloved Country
63                 Two If by Sea
64                      Bio-Dome
74                     Big Bully
119      The Boys of St. Vincent
146                      Amateur
Name: title, dtype: object

## 3. Movies Recommendations using Collaborative Filtering
- This model is based on historical interactions of a user and similar interactions with similar users.
- The approach movie ratings provided by users in ratings_small dataset due to limitation in compute available.

In [None]:
ratings_small_df1 = ratings_small_df.merge(movies_metadata_df[['id','title','vote_average','vote_count']], left_on = 'movieId', right_on = 'id')
ratings_small_df1.dropna()
movieid_list = ratings_small_df1['movieId'].unique()  # Number of unique movies in the dataset
movie_dict = movies_metadata_df.set_index('id')['title'].to_dict()

In [None]:
#SVD model training using small movie ratings data
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(ratings_small_df1[['userId', 'movieId', 'rating']], reader)

model = SVD()

cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

trainset = data.build_full_trainset()
model.fit(trainset)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9038  0.8955  0.8870  0.9040  0.9060  0.8993  0.0071  
MAE (testset)     0.6980  0.6902  0.6869  0.6932  0.6983  0.6933  0.0044  
Fit time          0.47    0.49    0.50    0.48    0.49    0.49    0.01    
Test time         0.05    0.06    0.06    0.34    0.07    0.11    0.11    


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2bda3a0a430>

In [None]:
def get_top_n_recommendations(model, user_id, n=10):
    '''Top n movie recommendations using SVD Model trained earlier'''
    recommendations = []

    for movie_id in movieid_list:  # Assuming items are numbered from 1 to num_items
        prediction = model.predict(user_id, movie_id)
        recommendations.append((movie_id, prediction.est))

    # Sort recommendations by estimated rating
    recommendations.sort(key=lambda x: x[1], reverse=True)

    top_n_recommendations = recommendations[:n]
    return top_n_recommendations

In [None]:
# Recommended movies for user ID 20
user_id = 20
top_recommendations = get_top_n_recommendations(model, user_id, n=10)
print("Top Recommendations for User", user_id)
for movieId, estimated_rating in top_recommendations:
    movie_title = movie_dict[movieId]
    print("Movie ID:", movieId,"Movie Title:", movie_title, "Estimated Rating:", estimated_rating)

Top Recommendations for User 20
Movie ID: 497 Movie Title: The Green Mile Estimated Rating: 4.2359730745092525
Movie ID: 2690 Movie Title: Irma la Douce Estimated Rating: 4.030973885525562
Movie ID: 745 Movie Title: The Sixth Sense Estimated Rating: 4.030033013918051
Movie ID: 31658 Movie Title: Hour of the Gun Estimated Rating: 3.9853126545778776
Movie ID: 780 Movie Title: The Passion of Joan of Arc Estimated Rating: 3.9427401685184984
Movie ID: 6016 Movie Title: The Good Thief Estimated Rating: 3.903978025827506
Movie ID: 2186 Movie Title: Within the Woods Estimated Rating: 3.896197227940051
Movie ID: 1248 Movie Title: Hannibal Rising Estimated Rating: 3.8716323968341726
Movie ID: 1411 Movie Title: The Rapture Estimated Rating: 3.850195779939997
Movie ID: 534 Movie Title: Terminator Salvation Estimated Rating: 3.8328057360044996


In [None]:
# Recommended movies for user ID 500
user_id = 500
top_recommendations = get_top_n_recommendations(model, user_id, n=10)
print("Top Recommendations for User", user_id)
for movieId, estimated_rating in top_recommendations:
    movie_title = movie_dict[movieId]
    print("Movie ID:", movieId,"Movie Title:", movie_title, "Estimated Rating:", estimated_rating)

Top Recommendations for User 500
Movie ID: 2324 Movie Title: Local Color Estimated Rating: 4.032201646044475
Movie ID: 509 Movie Title: Notting Hill Estimated Rating: 3.9735308188924394
Movie ID: 4011 Movie Title: Beetlejuice Estimated Rating: 3.8673547109993383
Movie ID: 1280 Movie Title: 3-Iron Estimated Rating: 3.8385455670474733
Movie ID: 318 Movie Title: The Million Dollar Hotel Estimated Rating: 3.827051600964442
Movie ID: 916 Movie Title: Bullitt Estimated Rating: 3.8164827831196706
Movie ID: 3035 Movie Title: Frankenstein Estimated Rating: 3.8121605678560586
Movie ID: 866 Movie Title: Finding Neverland Estimated Rating: 3.810145469447373
Movie ID: 6016 Movie Title: The Good Thief Estimated Rating: 3.8100216776290226
Movie ID: 2359 Movie Title: Sicko Estimated Rating: 3.759862369850388
