# Movie Recommendation System: Content-Based Model

**Background**:

An accurate, personalized recommendation system can improve business and sales and build customer satisfaction.


**Types of Recommendation Engines**:
1. **Popularity Model (see "popularity_movie_recommendation" notebook)**
2. **Recommendation algorithms**
    - **Content-based filtering (see "content_based_movie_recommendation" notebook)**
        - Idea: if you like an item then you will also like a similar item
        - Based on similarity of the items being recommended
        - Works well when it is easy to determine the context/properties of each item, ie song or movie recommendation
        - Works well when easy to determine the context/properties of each item
        - User profile is generated from the data provided by the user either explicitly (ratings) or implicitly (clicking on the link) to make  suggestions
        - The more inputs from users, the more accurate the recommender
        - Term Frequency: frequency of word in a document
        - Inverse Document Frequency: inverse of the documen frquency among the whole corpus of documents
        - TF-IDF weighting negates the effect of high frequency words in determing the importance of an item (document)
        - Vector Space model used to determine which items are closer to each other by computing the proximity based on the angle between he vectors
        - Each item is stored as a vector of its attributes in an n-dimensional space and the angles between the vectors are calcualted to determine the simlilarity between the vectors
        - The user profile vectors are also created based on his actions on previous attributes of items
        - The similarity between an item and user is also determined in a similar way
        - Cosine is used because the value of cosine will increase with decreasing value of the angle which signifies more similarity
        - Vectors are length normalized to become vecotrs of length 1 and then cosine calculating is performed using sum-product of the vectors
        - Advantages:
            - No need for data on other users
            - Can recommend to users with unique tastes
            - Can recommend new and unpopular items
            - Can provide explanations for recommended items by listing content features that caused an item to be recommended (ie movie genres)
        - Disadvantages:
            - Hard to find appropriate features
            - Does not recommend items outside a user's content profile
            - Unable to exploit quality judgments of other users
    - **Memory-based collaborative filtering (see "memory_based_collaborative_filtering_movie_recommendation" notebook)**
        1. User-user collaborative filtering
        2. Item-item collaborative filtering
    - **Model-based collaborative filtering**
        1. **Matrix Factorization**
            1. **Singular Vector Decomposition (see "matrix_factorization_svd_movie_recommendation" notebook)**
            2. Probabilitistic Matrix Factorization
            3. Non -ve Matrix Factorization
        2. **Deep learning/neural network (see "deep_learning_movie_recommendation" notebook)**
                
3. **Using a classifier to make recommendation**
    - Classifiers are parametric solutions that require some parameters of the user and item to be defined first
    - Pros:
        - Incorporates personalization
        - Works even if the user's past history is short or not available
    - Cons:
        - Features might not be availalbe or sufficient to create a good classifier
        - Making a good classifier will become exponentially difficult as the number of user and items grow
        
(https://medium.com/@james_aka_yale/the-4-recommendation-engines-that-can-predict-your-movie-tastes-bbec857b8223)

**Data**:

We will be using the online movie recommender service MovieLens' dataset collected from the MovieLens website. The datasets were collected over several periods of time.
Users were selected at random to be included in the data. All users have rated 20+ movies. No demographic information is included.

The data includes:
- 100K ratings (1-5) from 1000 users on 1700 movies
- Each user has rated 20+ movies
- Simple demographic information for the users, such as gender, age, occupation, zip, etc.
- Genre information of movies

(https://grouplens.org/datasets/movielens/10m/)

In [10]:
import pandas as pd
import numpy as np
import scipy as sc
import pickle

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

# Data

## Users Data

In [18]:
users = pd.read_pickle('users.pickle')
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


## Ratings Data

In [17]:
ratings = pd.read_pickle('ratings.pickle')
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


## Movies Data

In [16]:
movies = pd.read_pickle('movies.pickle')
movies.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,imdb_url,unknown,action,adventure,animation,childrens,comedy,crime,documentary,drama,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western,genres
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,animation|childrens|comedy
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,action|adventure|thriller
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,thriller
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,action|comedy|drama
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,crime|drama|thriller


# Content-Based Movie Recommendation Model

(https://medium.com/@james_aka_yale/the-4-recommendation-engines-that-can-predict-your-movie-tastes-bbec857b8223)

**We will use the `TfidfVectorizer` function from scikit-learn to transform text to feature vectors that can be used as input to estimator.**

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])

**We will use the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies, and we will be using the `linear_kernel` method instead of `cosine_similarities` because it is much faster.** 

In [15]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

**After getting a pairwise cosine similarity matrix for all the movies, we'll return the 20 most similar movies based on the cosine similarity score.**

In [19]:
titles = movies ['movie_title']
indices = pd.Series(movies.index, index=movies['movie_title'])

def genre_rec(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return(titles.iloc[movie_indices])

In [20]:
genre_rec('Toy Story (1995)')

421                Aladdin and the King of Thieves (1996)
1218                                Goofy Movie, A (1995)
101                                Aristocats, The (1970)
403                                      Pinocchio (1940)
624                        Sword in the Stone, The (1963)
945                         Fox and the Hound, The (1981)
968           Winnie the Pooh and the Blustery Day (1968)
1065                                         Balto (1995)
1077                              Oliver & Company (1988)
1408                            Swan Princess, The (1994)
1411    Land Before Time III: The Time of the Great Gi...
1469                              Gumby: The Movie (1995)
94                                         Aladdin (1992)
62                               Santa Clause, The (1994)
93                                      Home Alone (1990)
137                           D3: The Mighty Ducks (1996)
138                                  Love Bug, The (1969)
224           

In [22]:
genre_rec('Body Snatcher, The (1945)')

218                    Nightmare on Elm Street, A (1984)
350                              Prophecy II, The (1998)
378    Tales From the Crypt Presents: Demon Knight (1...
412    Tales from the Crypt Presents: Bordello of Blo...
423           Children of the Corn: The Gathering (1996)
435               American Werewolf in London, An (1981)
436              Amityville 1992: It's About Time (1992)
437                                Amityville 3-D (1983)
438                  Amityville: A New Generation (1993)
439                 Amityville II: The Possession (1982)
440                        Amityville Horror, The (1979)
441                         Amityville Curse, The (1990)
442                                    Birds, The (1963)
444                            Body Snatcher, The (1945)
445                               Burnt Offerings (1976)
446                                        Carrie (1976)
447                                     Omen, The (1976)
550                            

**There is no quantitative method of measuring the performance of the content-based movie recommender, but based on our experiences, our genre recommendations is pretty good because the recommended items fall in similar genres (animation, children's, comedy and horror, respectively) as the movie we based the recommendations off of.**