# Recommender Engineering

In this notebook, I will leverage the filtered and pre-processed data to generate two movie recommender systems: content-based and collaborative-filtering. I will then evaluate the recommendations and choose one for production.

![contentbasedfiltering_vs_collaborativefiltering](/contentbasedfiltering_vs_collaborativefiltering.png)

To introduce the two main filtering methods, the main difference between the two is that the movies are vectorized in different ways. For collaborative filtering, each movie is vectorized by the reviews of other reviews. On the other hand for content based filtering method, the features of a movie (i.e. the genres and overview) are used as vector representations of the movies. In both cases, the cosine similarity is calculated between the vectors to help recommend the most similar movie. 

In [1]:
from scipy.sparse import coo_matrix

In [2]:
!pip install psycopg2-binary



In [3]:
import pandas as pd
import numpy as np
import psycopg2 as pg2
from psycopg2.extras import RealDictCursor, Json
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from sklearn.metrics import pairwise_distances
import seaborn as sns
from sys import getsizeof

sns.set()

import matplotlib.pyplot as plt
%matplotlib inline

## Content Based Filtering/Recommendation

In [4]:
merged = pd.read_csv('../assets/merged_processed.csv', lineterminator = '\n', index_col = 0)

In [5]:
merged.head()

Unnamed: 0,movie_id,tmdb_id,release_date,overview,title,genres,overview_tokenized,overview_features,genres_split,genres_features,all_features
0,1,862,1995-10-30,"Led by Woody, Andy's toys live happily in his ...",Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"['Led', 'by', 'Woody', 'Andy', 's', 'toys', 'l...",Led by Woody Andy s toy live happily in his ro...,"['Adventure', 'Animation', 'Children', 'Comedy...",Adventure Animation Children Comedy Fantasy,Adventure Animation Children Comedy Fantasy Le...
1,2,8844,1995-12-15,When siblings Judy and Peter discover an encha...,Jumanji (1995),Adventure|Children|Fantasy,"['When', 'siblings', 'Judy', 'and', 'Peter', '...",When sibling Judy and Peter discover an enchan...,"['Adventure', 'Children', 'Fantasy']",Adventure Children Fantasy,Adventure Children Fantasy When sibling Judy a...
2,3,15602,1995-12-22,A family wedding reignites the ancient feud be...,Grumpier Old Men (1995),Comedy|Romance,"['A', 'family', 'wedding', 'reignites', 'the',...",A family wedding reignites the ancient feud be...,"['Comedy', 'Romance']",Comedy Romance,Comedy Romance A family wedding reignites the ...
3,129,110972,1996-02-09,Pie in the Sky is a 1996 American romantic com...,Pie in the Sky (1996),Comedy|Romance,"['Pie', 'in', 'the', 'Sky', 'is', 'a', '1996',...",Pie in the Sky is a 1996 American romantic com...,"['Comedy', 'Romance']",Comedy Romance,Comedy Romance Pie in the Sky is a 1996 Americ...
4,654,278978,1996-02-29,,And Nobody Weeps for Me (Und keiner weint mir ...,Drama|Romance,[],,"['Drama', 'Romance']",Drama Romance,Drama Romance


In [6]:
merged['overview'].fillna('', inplace = True)
merged['overview'].replace('No overview found.', '')
merged['overview_features'].fillna('', inplace = True)

I instantiate a tfidf vectorizer using stop words of english to filter out common stop words such as 'like', 'but', 'again', 'with' and etc.

In [7]:
tfidf = TfidfVectorizer(min_df = 0, stop_words = 'english')

In [8]:
merged.isnull().sum()

movie_id               0
tmdb_id                0
release_date          12
overview               0
title                  0
genres                 0
overview_tokenized     0
overview_features      0
genres_split           0
genres_features        0
all_features           0
dtype: int64

In [9]:
tfidf_matrix = tfidf.fit_transform(merged['all_features'])

With the generated tfidf matrix of the features, comprised of genre and movie overview, I create a cosine_similarity matrix (a.k.a. taking the dot product of itself)

In [10]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

The code below that generate recommendations are borrowed from a DataCamp tutorial:
https://www.datacamp.com/community/tutorials/recommender-systems-python

In [11]:
indices = pd.Series(merged.index, index = merged['title']).drop_duplicates()

In [12]:
def content_recommendations(title, cosine_sim=cosine_sim):
    index = indices[title]
    sim_scores = list(enumerate(cosine_sim[index]))
    sorted_sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    top15_sorted_sim_scores = sorted_sim_scores[1:16]
    movie_indices = [i[0] for i in top15_sorted_sim_scores]
    return merged['title'].iloc[movie_indices]

In [13]:
content_recommendations('Toy Story (1995)')

15302                    Toy Story 3 (2010)
3060                     Toy Story 2 (1999)
24003     Toy Story Toons: Small Fry (2011)
10359        40-Year-Old Virgin, The (2005)
23383    Andy Hardy's Blonde Trouble (1944)
1936                  Child's Play 2 (1990)
1098           Rebel Without a Cause (1955)
1986                       Condorman (1981)
11446         For Your Consideration (2006)
21081       Andy Hardy's Double Life (1942)
3120                 Man on the Moon (1999)
1937                  Child's Play 3 (1991)
23110     Andy Hardy Meets Debutante (1940)
6520          What's Up, Tiger Lily? (1966)
506                           Malice (1993)
Name: title, dtype: object

In [14]:
content_recommendations('Batman & Robin (1997)')

9212                   Batman & Mr. Freeze: Subzero (1998)
21120               Batman: Mystery of the Batwoman (2003)
605                                          Batman (1989)
160                                  Batman Forever (1995)
14903                               Leaves of Grass (2009)
15464                    Batman: Under the Red Hood (2010)
16736                               For Love of Ivy (1968)
20925    Batman Unmasked: The Psychology of the Dark Kn...
1362                                 Batman Returns (1992)
7799                        Day After Tomorrow, The (2004)
18124                        Dark Knight Rises, The (2012)
10877                       Ice Age 2: The Meltdown (2006)
21702                               Super Cops, The (1974)
18015                              Batman and Robin (1949)
9292             Batman Beyond: Return of the Joker (2000)
Name: title, dtype: object

In [15]:
content_recommendations('Jumanji (1995)')

19603                    Indie Game: The Movie (2012)
21335                              Table No.21 (2013)
15373                      Slender Thread, The (1965)
17131                          Dark Angel, The (1935)
6254                                 Brainscan (1994)
13706                               Rhinoceros (1974)
8881                                   Quintet (1979)
9566                                 Word Wars (2004)
7828                       Last of Sheila, The (1973)
23564                      Beast Must Die, The (1974)
13599             Mindscape of Alan Moore, The (2003)
22547                               Love Games (2012)
6141                          Poolhall Junkies (2002)
8164                           Masks (Masques) (1987)
19537    Nightmaster (Watch the Shadows Dance) (1987)
Name: title, dtype: object

### Evaluation of Content-based filtering method

Some of the recommendations are questionable.
- **Toy Story (1995)**: While the first three recommendations are valid, (i.e. users who enjoyed Toy Story are likely to enjoy the movie sequels and the TV series related to Toy Story) subsequent recommendations such as 40-Year-Old Virgin and Child's Play 2 are unexpected. The rationale behind why such movies may have been recommended is because of how content-based filtering methods work behind the scenes. Because the movies are vectorized based on the genre and overview provided, if a different movie's overview contains similar diction, it is likely to be represented as vectors close to each other.
- **Batman & Robin (1997)**: Similar narrative to the Toy Story example but the potential pitfalls of content-based method is amplified. In this example, more than half of the recommendations are batman related - most likely because the overviews of the movies include the word 'batman', 'gotham' and etc.
- **Jumanji (1995)**: Arguably, the best recommendation out of the three, Jumanji does not have a specific entity name that is being represented across other movie vectors and therefore, the genres features may hold more weight in providing the recommendations.

Examine an output of the recommender:

In [16]:
print(merged[merged['title'] == '40-Year-Old Virgin, The (2005)']['all_features'].values)

[ 'Comedy Romance Andy Stitzer ha a pleasant life with a nice apartment and a job stamping invoice at an electronics store But at age 40 there s one thing Andy hasn t done and it s really bothering his sex obsessed male co worker Andy is still a virgin Determined to help Andy get laid the guy make it their mission to de virginize him But it all seems hopeless until Andy meet small business owner Trish a single mom']


In [17]:
print(merged[merged['title'].str.contains('Play 2')]['all_features'].values)

[ 'Horror Thriller When Andy s mother is admitted to a psychiatric hospital the young boy is placed in foster care and Chucky determined to claim Andy s soul is not far behind']


In [18]:
print(merged[merged['title'] == 'Toy Story (1995)']['all_features'].values)

[ 'Adventure Animation Children Comedy Fantasy Led by Woody Andy s toy live happily in his room until Andy s birthday brings Buzz Lightyear onto the scene Afraid of losing his place in Andy s heart Woody plot against Buzz But when circumstance separate Buzz and Woody from their owner the duo eventually learns to put aside their difference']


For example, examining the overview of 'Toy Story (1995)', '40-Year-Old Virgin, The (2005)', and 'Child's Play 2', we can see that the likely reason that the two movies were included in the recommendation because the word 'Andy' appears in the movie vectors.

**Overall** <br>
Going forward, as briefly mentioned in the Jumanji evaluation and further examined in the evaluation of the Toy Story recommendations, removing entity names such as 'Andy' or names of places is likely to improve our recommendations. By doing so, the recommendations can truly be based on the verbs or other diction within the overview, which also leads me to believe that semantic analysis of the overviews may improve the recommender. 

## Collaborative Filtering/Recommendation

For computation efficiency, I used PostgresSQL on the server-side to store my data for easier querying and manipulation of data.

In [19]:
%run '../sql.py'

In [20]:
def con_cur_to_db(dbname=DBNAME):
    ''' 
    Returns both a connection and a cursor object for your database
    '''
    con = pg2.connect(host=IP_ADDRESS,
                  dbname=dbname,
                  user=USER,
                  password=PASSWORD)
    cur = con.cursor()
    return con, cur

def con_cur_to_db_dict(dbname=DBNAME):
    con = pg2.connect(host=IP_ADDRESS,
                  dbname=dbname,
                  user=USER,
                  password=PASSWORD)
    cur = con.cursor(cursor_factory=RealDictCursor)
    return con, cur
    
def execute_query(query, dbname=DBNAME):
    '''
    Executes a query directly to a database, without having to create a cursor and connection each time. 
    '''
    con, cur = con_cur_to_db(dbname)
    cur.execute(f'{query}')
    data = cur.fetchall()
    con.close()
    return data
    
def execute_query_dict(query, dbname=DBNAME):
    con, cur = con_cur_to_db_dict(dbname)
    cur.execute(f'{query}')
    data = cur.fetchall()
    con.close()
    return data

I select my scaled ratings from the scaled table and obtain the movie id and user ids and store into the variable response

In [21]:
query = """
SELECT ROUND(a.scaled_rating::numeric, 7), b.s_movie_id, c.s_user_id
FROM scaled a
INNER JOIN serial_movie_id b
ON a.movie_id = b.movie_id
INNER JOIN serial_user_id c
ON a.user_id = c.user_id;
"""

In [22]:
response = execute_query(query)

In [23]:
response[:5]

[(Decimal('0.6499937'), 4125, 52292),
 (Decimal('-2.0933574'), 4168, 52292),
 (Decimal('-0.5017699'), 4170, 52292),
 (Decimal('-0.2067129'), 4230, 52292),
 (Decimal('0.4276300'), 4238, 52292)]

I unzip my response and then zip them back into a list of tuples with the rating, movie_id and user_id in that order

In [24]:
ratings, movie_ids, user_ids = [*zip(*response)]

For computational efficiency, I will be utilizing numpy arrays to perform calculations. With the list of tuples now representing reviews for a movie by a user, I am ready to make it into a numpy sparse matrix. 

In [25]:
sparse_mat = coo_matrix((ratings[0:], (movie_ids[0:], user_ids[0:])), dtype='float32')

In [26]:
sparse_mat.shape

(15452, 138494)

The sparse matrix generated has a shape of 15,452 rows and 138,494 columns, with each row representing movies and users on the columns. In other words, each movie is now represented by 138,494 user reviews or 'features'. I will later compute the cosine similarity among my vectors. 

### Create serial_movie_id_title table 

First, I created a table in PostgreSQL that maps the serial_movie_id (which are the rows and columns of my cosine distance dataframe) with title of the movie so that I can overwrite the rows and columns of my cosine distance dataframe. This dataframe will be utilized heavily in generating recommendations in the following functions.

In [27]:
query = """
SELECT * FROM serial_movie_id_title
"""

In [28]:
response = execute_query(query)

In [29]:
serial_movie_id_title = pd.DataFrame(response)

serial_movie_id_title.columns = ['s_movie_id', 'title']

serial_movie_id_title.set_index('s_movie_id')

serial_movie_id_title.head()


Unnamed: 0,s_movie_id,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [30]:
serial_movie_id_title.shape

(15451, 2)

### Cosine similarity summarized

### $$
cos(\theta) = \frac{A \cdot B}{\left\| A\right\| \left\| B\right\| } = \frac{A \cdot B}{\sqrt{\sum{A_i^2}} \cdot \sqrt{\sum{B_i^2}}}
$$

Essentially, we take the dot product of two vectors in space and divide by the magnitude of the two vectors multiplied where magnitude represents the length of the vector. The resulting output represents the cosine of the angle between the two vectors (a.k.a. cosine similarity value) with -1 being the farthest away and 1 being perfectly overlapped or identical to the the vector. 

In [31]:
cosine_matrix = cosine_similarity(sparse_mat, sparse_mat)

Aside from rounding errors, my scores are now between 0 and 1 with 0 being the most dissimilar and 1 being the most similar movie (a.k.a. itself). They do not range from -1 and 1 as previously described due to the fact that a user cannot leave a negative review for a movie. 

In [32]:
cosine_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  1.00023127,  0.13220523, ...,  0.00168974,
        -0.00227867, -0.00126228],
       [ 0.        ,  0.13220523,  0.99999362, ..., -0.00137833,
         0.00293332, -0.00112325],
       ..., 
       [ 0.        ,  0.00168974, -0.00137833, ...,  1.        ,
        -0.1644212 , -0.02935707],
       [ 0.        , -0.00227867,  0.00293332, ..., -0.1644212 ,
         0.99999988,  0.        ],
       [ 0.        , -0.00126228, -0.00112325, ..., -0.02935707,
         0.        ,  0.99999994]], dtype=float32)

By subtracting the cosine similarity scores from 1, I am effectively taking the cosine similarity of each movie and turn them into a distance metric (the distance is the gap between the two vectors pointing towards similar direction. (lower the distance, the smaller the gap between the vectors; cosine similarity which ranges from -1 to +1 is now will be normalized to distances of 0 (equivalent to cosine similarity of 1) to 2(equivalent to cosine similarity of -1); we want to recommend movies with smallest distances between each other.

In [33]:
cosine_distance = 1 - cosine_matrix

In [34]:
cosine_distance

array([[  1.00000000e+00,   1.00000000e+00,   1.00000000e+00, ...,
          1.00000000e+00,   1.00000000e+00,   1.00000000e+00],
       [  1.00000000e+00,  -2.31266022e-04,   8.67794752e-01, ...,
          9.98310268e-01,   1.00227869e+00,   1.00126231e+00],
       [  1.00000000e+00,   8.67794752e-01,   6.37769699e-06, ...,
          1.00137830e+00,   9.97066677e-01,   1.00112319e+00],
       ..., 
       [  1.00000000e+00,   9.98310268e-01,   1.00137830e+00, ...,
          0.00000000e+00,   1.16442120e+00,   1.02935708e+00],
       [  1.00000000e+00,   1.00227869e+00,   9.97066677e-01, ...,
          1.16442120e+00,   1.19209290e-07,   1.00000000e+00],
       [  1.00000000e+00,   1.00126231e+00,   1.00112319e+00, ...,
          1.02935708e+00,   1.00000000e+00,   5.96046448e-08]], dtype=float32)

To echo a previous point, because there are no negative reviews, we do not see a distance of 2 which is equivalent of having a cosine similarity of -1. 

Save out the cosine_distances

In [36]:
cosine_distance_df = pd.DataFrame(cosine_distance)

In [37]:
cosine_distance_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15442,15443,15444,15445,15446,15447,15448,15449,15450,15451
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,-0.000231,0.867795,0.942891,0.978433,0.932063,0.954288,0.937579,0.978023,0.984004,...,0.995371,0.998733,0.994089,0.996774,0.995118,0.993352,1.000163,0.99831,1.002279,1.001262
2,1.0,0.867795,6e-06,0.91259,0.947597,0.884898,0.964926,0.916033,0.920896,0.94124,...,1.005929,1.001054,0.998055,1.009926,0.995807,0.995381,1.000714,1.001378,0.997067,1.001123
3,1.0,0.942891,0.91259,-4.3e-05,0.950798,0.789566,0.952719,0.878553,0.932447,0.919004,...,1.007036,1.000609,0.992483,1.003399,0.99739,0.981669,0.999555,0.999758,1.000546,1.000818
4,1.0,0.978433,0.947597,0.950798,7e-06,0.932865,0.977778,0.94466,0.921253,0.96378,...,1.012262,1.000916,1.000612,1.0,1.00042,1.001828,1.0,1.000229,1.0,1.0


I drop the zero column and row index because information queried from PostgreSQL starts with index 1.

In [38]:
cosine_distance_df.drop(0, axis = 0, inplace = True)

In [39]:
cosine_distance_df.drop(0, axis = 1, inplace = True)

I overwrite the columns and rows with the tiles from serial_movie_id_title

In [40]:
cosine_distance_df.columns = [serial_movie_id_title['title']]

In [41]:
cosine_distance_df.index = [serial_movie_id_title['title']]

In [42]:
cosine_distance_df.head()

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Monty Python: Almost the Truth - Lawyers Cut (2009),Wild Card (2015),Hot Tub Time Machine 2 (2015),The Coven (2015),Focus (2015),The Second Best Exotic Marigold Hotel (2015),Run All Night (2015),Cinderella (2015),Frozen Fever (2015),Insurgent (2015)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),-0.000231,0.867795,0.942891,0.978433,0.932063,0.954288,0.937579,0.978023,0.984004,0.909073,...,0.995371,0.998733,0.994089,0.996774,0.995118,0.993352,1.000163,0.99831,1.002279,1.001262
Jumanji (1995),0.867795,6e-06,0.91259,0.947597,0.884898,0.964926,0.916033,0.920896,0.94124,0.865359,...,1.005929,1.001054,0.998055,1.009926,0.995807,0.995381,1.000714,1.001378,0.997067,1.001123
Grumpier Old Men (1995),0.942891,0.91259,-4.3e-05,0.950798,0.789566,0.952719,0.878553,0.932447,0.919004,0.939663,...,1.007036,1.000609,0.992483,1.003399,0.99739,0.981669,0.999555,0.999758,1.000546,1.000818
Waiting to Exhale (1995),0.978433,0.947597,0.950798,7e-06,0.932865,0.977778,0.94466,0.921253,0.96378,0.974755,...,1.012262,1.000916,1.000612,1.0,1.00042,1.001828,1.0,1.000229,1.0,1.0
Father of the Bride Part II (1995),0.932063,0.884898,0.789566,0.932865,-4.3e-05,0.961784,0.839269,0.924316,0.921195,0.931484,...,1.007164,0.999551,1.000183,0.997643,0.99649,0.997482,1.000502,0.997634,0.99753,0.999121


In [43]:
cosine_distance_df.shape

(15451, 15451)

With the cosine_distance_df and serial_movie_id_title dataframes, I am able to make recommendations via function.

## Generate recommender function

In [44]:
def get_similar_movies(query_movie_id, n = 16):
    '''
    This function seeks to utilize the cosine_distance_df by searching for the movie given a query_movie_id and 
    returning a default number of 15 movies most similar to the generated query_movie_id;
    This function will be embedded in the begin_user_search() function to be defined below.
    '''
    
    scores = cosine_distance_df[serial_movie_id_title.iloc[query_movie_id]['title']]
    sorted_scores = scores.iloc[:, 0].sort_values()
    top_n_titles = sorted_scores.head(n)
    titles = top_n_titles.reset_index()['title']
    return titles.iloc[1:, :] #by including the 1 in .iloc, I am removing the entered movie itself as the most similar movie should be itself.

In [45]:
serial_movie_id_title.head()

Unnamed: 0,s_movie_id,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [46]:
def begin_user_search():   
    '''
    Given a search term or the title of a movie, I search my serial_movie_id_title table with the generated query_movie_id 
    that is used in the get_similar_movies function defined above to grab by default 15 most similar movies.
    '''
    
    search_term = input("Please enter a search term: ")

    search_mask = serial_movie_id_title.title.str.contains(search_term)
    if search_mask.sum() == 1: #If user enters the exact name of the movie:
        query_movie_id = serial_movie_id_title[search_mask]['s_movie_id']
    elif search_mask.sum() == 0: #If user's search term does not match any part of a movie's title
        print('Invalid Search Term') 
        return None #doesn't execute rest of the code if search_mask.sum() == 0; a.k.a. couldn't find the movie
    elif search_mask.sum() > 1: #If user enters a broad term that matches with more than one movie
        print('Please choose intended movie:')
        matches = serial_movie_id_title[search_mask][['title', 's_movie_id']].values
        smid_dict = {}
        for i, (title, smid) in enumerate(matches): 
            print(f'{i}, {title}')
            smid_dict[i] = smid
        refined_selection = input('Please enter the number of the correct title: ') #I utilize the refined selection

        query_movie_id = smid_dict[int(refined_selection)]-1 
        #Utilize the refined selection from user input to grab query_movie_id that will be passed to the get_similar movies defined above
        #I subtract one because it refers to dataframe index
    
    print (get_similar_movies(query_movie_id))
    

## Demo of recommender:

In [47]:
begin_user_search()

Please enter a search term:  Toy Story


Please choose intended movie:
0, Toy Story (1995)
1, Toy Story 2 (1999)
2, Toy Story 3 (2010)
3, Toy Story of Terror (2013)
4, Toy Story Toons: Hawaiian Vacation (2011)
5, Toy Story Toons: Small Fry (2011)
6, Toy Story That Time Forgot (2014)


Please enter the number of the correct title:  1


                                        title
1                            Toy Story (1995)
2                        Bug's Life, A (1998)
3                       Monsters, Inc. (2001)
4                         Finding Nemo (2003)
5   Time Regained (Temps retrouvé, Le) (1999)
6                       Voices of Iraq (2004)
7                           Angel Eyes (2001)
8                   Lady and the Tramp (1955)
9             Who Framed Roger Rabbit? (1988)
10                                Babe (1995)
11                      Lion King, The (1994)
12                             Aladdin (1992)
13                             Shrek 2 (2004)
14                Beauty and the Beast (1991)
15                                 Big (1988)


In [48]:
begin_user_search()

Please enter a search term:  Hunting


Please choose intended movie:
0, Good Will Hunting (1997)
1, Hunting of the President, The (2004)
2, Hunting and Gathering (Ensemble, c'est tout) (2007)
3, Hunting Party, The (2007)
4, Screamers: The Hunting (2009)


Please enter the number of the correct title:  0


                               title
1                    Rain Man (1988)
2          Dead Poets Society (1989)
3          As Good as It Gets (1997)
4           Beautiful Mind, A (2001)
5        Jane Austen's Mafia! (1998)
6               Jerry Maguire (1996)
7             Mortal Thoughts (1991)
8                Forrest Gump (1994)
9   Shawshank Redemption, The (1994)
10                 Braveheart (1995)
11                  Gladiator (2000)
12           Sixth Sense, The (1999)
13         Christmas Carol, A (1938)
14            Green Mile, The (1999)
15             Ocean's Eleven (2001)


In [49]:
begin_user_search()

Please enter a search term:  Shining


Please choose intended movie:
0, Shining, The (1980)
1, Shining Through (1992)


Please enter the number of the correct title:  0


                                        title
1                               Psycho (1960)
2                  Clockwork Orange, A (1971)
3                      Poltergeist III (1988)
4                    Full Metal Jacket (1987)
5                       Apocalypse Now (1979)
6                                Alien (1979)
7                               Carrie (1976)
8                2001: A Space Odyssey (1968)
9                     Paris Is Burning (1990)
10                                Jaws (1975)
11           Silence of the Lambs, The (1991)
12                         Poltergeist (1982)
13                         Taxi Driver (1976)
14  Henry: Portrait of a Serial Killer (1986)
15                           Omen, The (1976)


## Evaluation of collaborative filtering recommender system

For the most part, the recommender system seems to provide legitimate movies that one may enjoy given a previously enjoyed movie. 

- **Toy Story**:
For example, when a user inputs Toy Story and chooses the movie that the user intended to feed into the system (in this case Toy Story 2), the system generates recommendations such as Toy Story (the first one), A Bug's Life, Monsters, Inc. and more. The recommended movies are all animated, family-based and an adventure movie.

- **Good Will Hunting**:
Another classic movie. The returned recommendations are Rain Man, Dead Poets Society, As Good as It Gets, A Beautiful Mind and more. The recommended movies are also in the drama genres that are highly rated and well-liked by the general public.

- **The Shining**:
To experiment the whole spectrum of genres, I also tried a thriller/horror movie and a notable movie is The Shining starring Jack Nicholson. The top recommended movies are Psycho, A Orange Clockwork, Poltergeist III, Full Metal Jacket and more. Many of the movies are crime/thriller and horror; therefore, I am confident that users who enjoyed watching The Shining will enjoy the recommended movies.

**Overall**: <br> The recommender system tends to recommend movies that are of similar genres. While this may be accurate in that a user that enjoyed a thriller/crime movie is likely to find another thriller/crime enjoyable, it would be interesting and beneficial if the system can recommend across genres more frequently. <br> <br>

## Conclusion and Going Forward

Comparing the two types of recommender system, the collaborative filtering method seemed to have provided a more robust set of movie recommendations. The content-based filtering method faces two potential shortcomings: the first being that it usually does not recommend movies of different genres, simply due to the fact that if the genres of movies overlap, their vectors are likely to be similar due to the overlap. Similarly, named entities will play a large role in which movies are recommended as evident in the Toy Story example. In addition, the content-based method likely does not address personal tastes of users. Therefore, we explored the collaborative filtering method which seeks to represent movies as vectors of user reviews. 

The results from the three movies in the collaborative filtering method lead me to believe that a user is likely to enjoy using this recommender system more than the content-based one. If I were to push one recommender to production, I would proceed wtih the collaborative filtering method. 