# Recommender Systems
## Content-Based Recommenders

The following tutorial and code are attributed to the following source:

Garodia, S. (2020). Content-based recommender systems in python. Medium. Retrieved from https://medium.com/analytics-vidhya/content-based-recommender-systems-in-python-2b330e01eb80. 

In [1]:
#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os as os
%matplotlib inline 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import sys
print("The Python version is %s.%s.%s" % sys.version_info[:3])

The Python version is 3.9.7


In [2]:
%pwd

'C:\\Users\\micha\\OneDrive\\Desktop\\Rockhurst University\\Classes\\BIA 6303 - Predictive Models\\Module7\\code'

In [3]:
cd C:\\Users\\micha\\OneDrive\\Desktop\\Rockhurst University\\Classes\\BIA 6303 - Predictive Models\\Module7\\data

C:\Users\micha\OneDrive\Desktop\Rockhurst University\Classes\BIA 6303 - Predictive Models\Module7\data


The MovieLens datasets are created by the University of Minnesota's Social Computing Research. The project is called GroupLens and contains numerous large datasets suitable for training recommender systems. We are going to look at only one file titled 'movies_metadata' for the 100K movie review datasets. You can get much larger datasets by visiting the following site: https://grouplens.org/datasets/movielens/. 

In [4]:
# Load Movies Metadata
movies = pd.read_csv('movies_metadata.csv', low_memory=False) 

# Print the first three rows
movies.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


We are going to focus our attention on overview column. In particular, we want to use the words in the overview as our features for the 'item profile.'

In [5]:
movies.shape

(45466, 24)

In [6]:
movies['overview'][0] #overview of the first listed movie

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

First, we need to construct a TF-IDF matrix. Here's more information on TF-IDF from Wikipedia: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [7]:
tfidf = TfidfVectorizer(stop_words='english') #remove the common 'stop words'
movies['overview'] = movies['overview'].fillna('')
#Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
overview_matrix = tfidf.fit_transform(movies['overview'])
#Output the shape of tfidf_matrix
overview_matrix.shape

(45466, 75827)

Notice the number of columns we have now as opposed to the original movies_metadata file. 

Now we are going to calculate the cosine similarities of these item profiles.

In [8]:
#may take a while to run
#requires 64-bit Python
from sklearn.metrics.pairwise import linear_kernel
similarity_matrix = linear_kernel(overview_matrix,overview_matrix)
similarity_matrix

array([[1.        , 0.01504121, 0.        , ..., 0.        , 0.00595453,
        0.        ],
       [0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
        0.00929411],
       [0.        , 0.04681953, 1.        , ..., 0.        , 0.01402548,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.00595453, 0.02198641, 0.01402548, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.00929411, 0.        , ..., 0.        , 0.        ,
        1.        ]])

Now we add in the movie title as an index to the similarity matrix to make it easier to search. 

In [9]:
#movies index mapping
mapping = pd.Series(movies.index,index = movies['title'])
mapping

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64

The following function will give us 15 movies with the highest cosine similarity scores for a movie that a user has already seen.

In [10]:
def recommend_movies_based_on_plot(movie_input):
    
    movie_index = mapping[movie_input]
#get similarity values with other movies
#similarity_score is the list of index and similarity matrix
    similarity_score = list(enumerate(similarity_matrix[movie_index]))
#sort in descending order the similarity score of movie inputted with all the other movies
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
# Get the scores of the 15 most similar movies. Ignore the first movie.
    similarity_score = similarity_score[1:15]
#return movie names using the mapping series
    movie_indices = [i[0] for i in similarity_score]
    return (movies['title'].iloc[movie_indices])

Now we try out the content-based recommender system. Here are the top 15 movie recommendations for someone who watched "The Amazing Spider-Man".

In [11]:
recommend_movies_based_on_plot('The Amazing Spider-Man')

23187                    The Amazing Spider-Man 2
11780                                Spider-Man 3
5215                                   Spider-Man
39831    The Fjällbacka Murders: Friends for Life
8853                                  Next of Kin
42932                               Almost Angels
7909                                 Spider-Man 2
19456                    High, Wide, and Handsome
8808                                Peter-No-Tail
12540                   Forgetting Sarah Marshall
20424                         Dancing in the Rain
884                                       Charade
3368                                         Hook
20380                      The Summer I Turned 15
Name: title, dtype: object