In [16]:
from IPython.display import HTML

HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [18]:
%%html
<a href="Intro.html" target="_self"><font size="5">Introduction</font></a>&nbsp&nbsp&nbsp
<a href="CF.html" target="_self"><font size="5">Collaborative Filtering</font></a>&nbsp&nbsp&nbsp
<a href="MF.html" target="_self"><font size="5">Matrix Factorization</font></a>&nbsp&nbsp&nbsp
<a href="Con.html" target="_self"><font size="5">Conclusion</font></a>&nbsp&nbsp&nbsp

# Hybrid Approach


- [3.1 Our Model](#Our-Model)
- [3.2 Recommendation Step](#Recommendation-Step)

## Introduction

We now incorporate content-based recommendations into our collaborative filtering model, leading to a hybrid between the two recommender system approaches. Since our original dataset contains the Amazon movie IDs, we utilize the Amazon API to extract relevant features of each movie. Using relevant metadata of each movie such as its genre, director, and cast, we assemble a similarity matrix between movies that is potentially more representative of the true similarities between movies, compared to our item-based approach from before which built a similarity matrix based on combining user ratings.

We will repeat our modeling procedure from our item-based collaborative filtering model, switching the similarity matrix from the latter case with a similarity matrix developed from the similarity between features of movies.

We present our procedure of developing the new similarity matrix, as well as some recommended movies for several users.

## Our Model

### Loading Data

We first load our dataset from before, which is a compressed dataset with 3 columns: movieID, reviewerID, and rating. Below is a sample of the dataset.

In [1]:
# importing required modules
import pandas as pd
from scipy.spatial.distance import cosine
import numpy as np
from sklearn import preprocessing

In [2]:
# The dataset we use for user collaborative filtering is compressed dataset with 3 columns: 
# movieID, reviewerID, and rating.
# Here is the sample of the dataset.
data_full = pd.read_csv('subset.csv')[["movieID", "reviewerID", "rating"]]
data_full.head()

Unnamed: 0,movieID,reviewerID,rating
0,5019281,ADZPIG9QOCDG5,4.0
1,5019281,A35947ZP82G7JH,3.0
2,5019281,A3UORV8A9D5L2E,3.0
3,5019281,A1VKW06X1O2X7V,5.0
4,5019281,A3R27T4HADWFFJ,4.0


### Constructing the Movie Similarity Matrix

We now construct the movie similarity matrix using metadata from the Amazon API. Our collected data is stored in the ``movie_info.csv`` file.

In [3]:
data = pd.read_csv('movie_info.csv').iloc[:, range(1,13)]

Since there are quite many missing values in the dataset, and they are categorical variables, instead of imputing with the mode, which may cause bias, we impute the dataset with random categories present in the columns. We present the imputed dataset below.

In [8]:
# impute the dataset with random categories present in the columns
data = data.apply(lambda x:x.fillna(x.unique()[randint(0, len(x.unique()) - 1)]))

In [9]:
data.head(n = 5)

Unnamed: 0,movidID,title,price,publisher,category,actor,audience_rating,brand,director,feature,genre,release_date
0,310263662,The Passion of the Christ,10.14,Zondervan,DVD,Kurt Russell,PG-13 (Parents Strongly Cautioned),Provident Distribution Group,Darren Aronofsky,Condition: Used - Good,Performing Arts,8/31/04
1,767002652,Upstairs Downstairs - The Premiere Season [VHS],9.99,A&E Home Video,Video,Pauline Collins,NR (Not Rated),Sony,Bill Bain,Condition: Used - Good,Hero,1/15/98
2,076780192X,Close Encounters of the Third Kind (Widescreen...,8.49,Sony Pictures Home E,Video,Richard Dreyfuss,PG (Parental Guidance Suggested),Sony,Steven Spielberg,Condition: Used - Good,Hero,8/29/00
3,767802470,Das Boot - The Director's Cut,9.99,Sony Pictures Home Entertainment,DVD,Herbert Gronemeyer,R (Restricted),Columbia/Tristar Studios,Wolfgang Petersen,Factory sealed DVD,Drama,12/9/97
4,767802519,Das Boot - The Director's Cut,9.99,Sony Pictures Home Entertainment,DVD,Herbert Gronemeyer,R (Restricted),Columbia/Tristar Studios,Wolfgang Petersen,Factory sealed DVD,Drama,12/9/97


To compare the similarity between columns, we will use cosine similarity measure between each movie. This means that the data in each column has to be numeric. As such, we used the ``LabelEncoder`` package in ``sklearn`` to process the data. Note that this might inadvertedly cause the categorical variables to be "ordinal", i.e. existence of some rank order but this should not impact our similarity comparison. The "ordinal" effect of variables will only be more profound when we are trying to predict the outcome with models like Logistic Regression. The table below shows some of the encoded metadata.

In [11]:
encodedCol = data.iloc[:, range(3,12)].apply(preprocessing.LabelEncoder().fit_transform)
encodedCol.head()

Unnamed: 0,publisher,category,actor,audience_rating,brand,director,feature,genre,release_date
0,26,0,31,4,13,12,7,11,50
1,1,1,41,1,14,5,7,6,3
2,11,1,42,2,14,45,7,6,48
3,12,0,15,5,6,50,11,4,23
4,12,0,15,5,6,50,11,4,23


Alarmingly, we only have 68 unique movies in this dataset, based on the names of the movies. However, they have unique movieID as extracted from the database. We will first proceed with finding out the similarity matrix for the 449 movies (asssuming they are unique). If time permits, we will revisit this problem and learn to extract unique movies from Amazon API.

We will use Cosine similarity to measure the similarity between two non-zero vectors. Two vectors of the the same orientation will have a cosine similarity of 1, while two vectors of  $90^{\circ}$  will have a cosine similarity of 0. We display the first 5 rows of the similarity matrix computed.

In [20]:
# generate the similarity matrix
# Let's fill in those empty spaces with cosine similarities
#outer loop to loop through each item
for i in range(0,len(movie_similarity_matrix.columns)) :
    #inner loop to identify similarity of the other columns with the column
    for j in range(0, len(movie_similarity_matrix.columns)) :
    # Fill in placeholder with cosine similarities
    # identical items will have value of 1
      movie_similarity_matrix.ix[i,j] = 1 - cosine(dfnew.ix[:,i], dfnew.ix[:,j])
        
movie_similarity_matrix.head()

Unnamed: 0,310263662,767002652,076780192X,767802470,767802519,767802624,767802659,767805267,767811100,767824571,...,B00006RCNW,B00007E2F5,B000083C6V,B00008DDVU,B00008DDXB,B00008EY63,B00008W64E,B00008YGRU,B00009MGEM,B00009OOFA
310263662,1.0,0.605415,0.881451,0.680368,0.680368,0.635233,0.635233,0.82512,0.82512,0.828958,...,0.478732,0.478732,0.478732,0.478732,0.478732,0.836313,0.836313,0.921959,0.921959,0.921959
767002652,0.605415,1.0,0.665278,0.443719,0.443719,0.933954,0.933954,0.3066,0.3066,0.388308,...,0.458631,0.458631,0.458631,0.458631,0.458631,0.913536,0.913536,0.54524,0.54524,0.54524
076780192X,0.881451,0.665278,1.0,0.891936,0.891936,0.627992,0.627992,0.895837,0.895837,0.927992,...,0.760439,0.760439,0.760439,0.760439,0.760439,0.885954,0.885954,0.871617,0.871617,0.871617
767802470,0.680368,0.443719,0.891936,1.0,1.0,0.447691,0.447691,0.906124,0.906124,0.910325,...,0.937806,0.937806,0.937806,0.937806,0.937806,0.673311,0.673311,0.642559,0.642559,0.642559
767802519,0.680368,0.443719,0.891936,1.0,1.0,0.447691,0.447691,0.906124,0.906124,0.910325,...,0.937806,0.937806,0.937806,0.937806,0.937806,0.673311,0.673311,0.642559,0.642559,0.642559


The table below shows the 9 nearest-neighbor movies for selected movies in the dataset.

In [23]:
# loop through similarity matrix above and fill in the neighbor names
# shortlist the top 5 most similar hours to respective hours
for i in range(0, len(movie_similarity_matrix.columns)):
    data_neighbors.iloc[i, :] = movie_similarity_matrix.iloc[:, i].sort_values(ascending = False)[1:10].index
    
data_neighbors.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9
310263662,790745399,792101448,792101324,079074841X,783241542,783221088,783221487,6301208773,630150528X
767002652,1415718113,080017948X,1404983082,1415724784,1415724806,1417030321,1417054069,1400322715,6304089767
076780192X,783239408,783241038,B00005JKQZ,B00005JKVZ,B00005JKZY,B00005JL3A,B00005JL3T,792833171,792837614
767802470,767802470,B00005JL8F,B00005JL78,B00005JLBE,B00005JLBQ,B00005JLET,B00005JLF2,B00005JLKN,B00005JOC9
767802519,767802470,B00005JL8F,B00005JL78,B00005JLBE,B00005JLBQ,B00005JLET,B00005JLF2,B00005JLKN,B00005JOC9


## Recommendation Step

We will now use the similarity matrix computed above to recommend movies to users based on the same procedure that we used as in the case of the item-based filtering from before.

In [31]:
for i in range(0,500):
    # extract the top 6 scores, then obtain the top 6 movie names
    data_recommend_new.ix[i,1:] = recommended_movie_matrix_new.ix[i,:].sort_values(ascending=False).ix[1:7,].index.transpose()
data_recommend_new.head()
data_recommend_new[['1', '2', '3', '4', '5', '6']].head()

Unnamed: 0_level_0,1,2,3,4,5,6
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A00503563AVX48TRHJGR6,B00009OOFA,6304383827,6304117752,6304176287,6304178352,6304214502
A0056274FAHZQC4N2ZN8,B00005QZ7U,B00005QCYC,B00005O5CM,B00005LOUP,B00005R87Q,6305076154
A00700212KB3K0MVESPIY,6303965415,6304117752,6304176287,6304178352,6304214502,6304233639
A01174011QPNX7GZF4B92,6300151379,6300183513,156219464X,1558908242,1558807381,1424819253
A0195162PO4TYE44CR9T,6303965415,6304117752,6304176287,6304178352,6304214502,6304233639


The above table displays 6 of the top recommended movies for 5 of the users in our dataset. Unfortunately, for our hybrid approach, we had insufficient examples to perform masking in order to assess the accuracy of our model, as we did with the item-based collaborative filtering method. We examine some of the movies recommended by the model.

For example, for the first reviewer in the table above (reviewerID A00503563AVX48TRHJGR6), the recommended movies corresponding to the Amazon IDs are:
1. Gods and Generals (B00009OOFA)
2. Emma (6304383827)
3. The Sound of Music (6304117752)
4. Willy Wonka & the Chocolate Factory (6304176287)
5. My Fair Lady (6304178352)
6. Heat (6304214502)

The above list comprise mainly of musicals and drama movies. As for the second reviewer (reviewerID A0056274FAHZQC4N2ZN8), the recommended movies are:
1. Moulin Rouge (B00005QZ7U)
2. Jurassic Park Trilogy (B00005QCYC)
3. Legally Blonde (B00005O5CM)
4. Dr. Seuss' How the Grinch Stole Christmas (B00005LOUP)
5. The Fast and the Furious (B00005R87Q)
6. Planet of the Apes Collection (6305076154)
6. Fawlty Towers - The Complete Collection (6305076464)

which is a mix of mostly action-based movies and comedy. And for the fourth reviewer (reviewer ID A01174011QPNX7GZF4B92), we recommend
1. Labyrinth (6300151379)
2. Rear Window (6300183513)
3. Grave of the Fireflies (156219464X)
5. Pulp Fiction (1558908242)
5. Legend (1558807381)
6. Stargate Sg-1: Season 1 (1424819253)

which consist of mostly of crime action and drama films.