# DSC478 Project: Movie Recommender System
### Machine Learning Models and Results

FinalProject_MLmodels

Movie dataset sourced from https://www.kaggle.com/rounakbanik/the-movies-dataset#movies_metadata.csv 

By Harsha Puvvada, Paripon Thantong, Yili Lin

---------------------------------------------------

# Notebook Overview:

### Content based algorithms

- IMDB weighted Average Rating: display top 25 movies at beginning for user to choose 10 and rate from 1 - 5
- TF-IDF + Cosine Similarity based on title, genres, cast, keywords and overview

### Collaborative Filtering Algorithms

- Singular Value Decomposition(Item based filtering): Predicts user ratings for unseen movies based on their ratings and shows their top picks
- K nearest Neighbors: finds similar movies based on user ratings and shows their top picks

### Bagging:
- predict top 5 movies based on all the ML models

-------------------------------------------------------
### How the program would work:

Initially, the user is displayed a list of the top 25 movies based on the IMDB weighted score and top 25 based on popularity. The user will pick 5 movies from the 2 tables and rate them on a scale of 1 - 10. The program will then use the users ratings in determining the top 5 picks out of all the ML models using bagging and display it. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
%matplotlib inline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

In [145]:
#Load Data
movieDF = pd.read_csv('./finalData/cleanedMoviesMetadata.csv')
ratingsDF = pd.read_csv('./finalData/cleanedRatings.csv')
smallRatingsDF = pd.read_csv('./finalData/cleanedSmallRatings.csv')

In [3]:
smallRatingsDF.shape

(23712, 5)

In [4]:
movieDF.shape

(16470, 14)

--------------------------------------------------------------------------------------------------------------------

# Results Array to store ML model outputs

In [5]:
recommendationList = []

# IMDB weighted Average Algorithm

calculate IMDB weighted score to tell user the top 50 movies based on the votes.
The IMDB weighted score accounts for popularity and tries to get the true opinion of it.

Formula:
IMDB = [V/(V + M)]R + [M/(V + M)]C

- V = number of Votes
- M = min votes required - We consider M = 0.9 quantile for this case
- R = Mean vote across dataset 
- C = mean vote

In [6]:
#Set minimum number of Votes required first (M)
vote_count = movieDF.vote_count
M = vote_count.astype(float).quantile(.9)
tempDF = movieDF[movieDF['vote_count']>M]
M

589.0

In [7]:
#Set V,C, R
R = tempDF.vote_average.mean()
V = tempDF.vote_count
C = tempDF.vote_average

In [8]:
#calculate IMDB and add to movieDF
IMDB = ((V / (V + M)) * R) + ((M / (V + M)) * C)
movieDF['IMDB'] = IMDB
movieDF.head(2)

Unnamed: 0,movieId,imdb_id,title,runtime,genres,cast,keywords,overview,budget,revenue,popularity,vote_average,vote_count,year,IMDB
0,862,tt0114709,Toy Story,81.0,"['Animation', 'Comedy', 'Family']","['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...","['jealousy', 'toy', 'boy', 'friendship', 'frie...","Led by Woody, Andy's toys live happily in his ...",30000000,373554033,21.946943,7.7,5415,1995,6.698536
1,8844,tt0113497,Jumanji,104.0,"['Adventure', 'Fantasy', 'Family']","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['board game', 'disappearance', ""based on chil...",When siblings Judy and Peter discover an encha...,65000000,262797249,17.015539,6.9,2413,1995,6.650505


In [9]:
#Function to display top list in neat format:
def displayList(listName):
    listName = list(listName.title)
    for i in range(len(listName)):
        print("#{:2d} - {:s}".format(i+1, listName[i]))

In [10]:
#Create a list and display top 50 movies with the highest IMDB weighted Rating
top25 = movieDF.sort_values(by=['IMDB'], ascending = False).head(25)
print("######################################\n  Top 25 Movies Based on IMDB Rating\n######################################")
displayList(top25)

######################################
  Top 25 Movies Based on IMDB Rating
######################################
# 1 - Band of Brothers
# 2 - Mommy
# 3 - Sing Street
# 4 - Paperman
# 5 - Mary and Max
# 6 - Once Upon a Time in America
# 7 - The Elephant Man
# 8 - The Man from Earth
# 9 - The Best Offer
#10 - Blue Velvet
#11 - Okja
#12 - Carlito's Way
#13 - Before Sunset
#14 - Girl, Interrupted
#15 - Kung Fury
#16 - The Little Prince
#17 - Me and Earl and the Dying Girl
#18 - Amadeus
#19 - Raging Bull
#20 - Rushmore
#21 - Kubo and the Two Strings
#22 - Before Sunrise
#23 - Brazil
#24 - Evil Dead II
#25 - True Romance


In [11]:
#Show top 25 movies based on popularity
top25 = movieDF.sort_values(by=['popularity'], ascending = False).head(25)
print("######################################\n  Top 25 Movies Based on Popularity\n######################################")
displayList(top25)

######################################
  Top 25 Movies Based on Popularity
######################################
# 1 - Minions
# 2 - Wonder Woman
# 3 - Beauty and the Beast
# 4 - Baby Driver
# 5 - Big Hero 6
# 6 - Deadpool
# 7 - Guardians of the Galaxy Vol. 2
# 8 - Avatar
# 9 - John Wick
#10 - Gone Girl
#11 - The Hunger Games: Mockingjay - Part 1
#12 - War for the Planet of the Apes
#13 - Captain America: Civil War
#14 - Pulp Fiction
#15 - Pirates of the Caribbean: Dead Men Tell No Tales
#16 - The Dark Knight
#17 - Blade Runner
#18 - The Avengers
#19 - Captain Underpants: The First Epic Movie
#20 - The Circle
#21 - The Bad Batch
#22 - The Maze Runner
#23 - Dawn of the Planet of the Apes
#24 - Alien: Covenant
#25 - Ghost in the Shell


__________________________

# User Selection and User Rating
The code below deals with the user choosing their top 5 movie picks and their rating of it.

In [12]:
# function to get user ratings and input, ignore errorchecking
def askUser():
    print("please enter your rating in this format: movie title and then your rating out from 0.0 - 10.0")
    
    count = 0
    userRatings = []
    while(count != 5):
        movieTitle = input("Enter movie Title: ")
        rating = input("Enter your rating for {}: ".format(movieTitle))
        userRatings.append([movieTitle.strip(), rating.strip()])
        count += 1
    return userRatings

In [13]:
userRatings = askUser()

please enter your rating in this format: movie title and then your rating out from 0.0 - 10.0
Enter movie Title: 1408
Enter your rating for 1408: 9
Enter movie Title: 2 Days in Paris
Enter your rating for 2 Days in Paris: 7
Enter movie Title: 2010
Enter your rating for 2010: 8
Enter movie Title: 24 Hour Party People
Enter your rating for 24 Hour Party People: 8
Enter movie Title: xXx
Enter your rating for xXx: 10


In [14]:
userRatings = pd.DataFrame(data = userRatings,columns = ["MovieTitle","Rating"])
userRatings

Unnamed: 0,MovieTitle,Rating
0,1408,9
1,2 Days in Paris,7
2,2010,8
3,24 Hour Party People,8
4,xXx,10


------------------------

# ML Model 1: TF-IDF + Cosine Similarity Recommender
Make recommendations based on Term frequency and inverse document frequency. It calculates the tfidf values of the overview section and find the movies with the closest cosine similarity based on the user input.

In [15]:
movieDF.head(2)

Unnamed: 0,movieId,imdb_id,title,runtime,genres,cast,keywords,overview,budget,revenue,popularity,vote_average,vote_count,year,IMDB
0,862,tt0114709,Toy Story,81.0,"['Animation', 'Comedy', 'Family']","['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...","['jealousy', 'toy', 'boy', 'friendship', 'frie...","Led by Woody, Andy's toys live happily in his ...",30000000,373554033,21.946943,7.7,5415,1995,6.698536
1,8844,tt0113497,Jumanji,104.0,"['Adventure', 'Fantasy', 'Family']","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['board game', 'disappearance', ""based on chil...",When siblings Judy and Peter discover an encha...,65000000,262797249,17.015539,6.9,2413,1995,6.650505


In [16]:
#Store overview as strings into movieCorpus
movieCorpus = movieDF['overview']
movieCorpus

0        Led by Woody, Andy's toys live happily in his ...
1        When siblings Judy and Peter discover an encha...
2        A family wedding reignites the ancient feud be...
3        Cheated on, mistreated and stepped on, the wom...
4        Just when George Banks has recovered from his ...
                               ...                        
16465    Pretty, popular, and slim high-schooler Aly Sc...
16466    Hyperactive teenager Kelly is enrolled into a ...
16467    It's Halloween in the 100 Acre Wood, and Roo's...
16468    In this true-crime documentary, we delve into ...
16469    A film archivist revisits the story of Rustin ...
Name: overview, Length: 16470, dtype: object

In [17]:
print(movieCorpus[0])

Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.


In [18]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movieDF['overview'].dropna())
tfidf_matrix.shape

(16457, 407910)

In [19]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[:4, :4]

array([[1.        , 0.00658054, 0.        , 0.        ],
       [0.00658054, 1.        , 0.01621647, 0.        ],
       [0.        , 0.01621647, 1.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        ]])

In [20]:
# Build a 1-dimensional array with movie titles
titles = movieDF['title']
indices = pd.Series(movieDF.index, index=movieDF['title'])

In [21]:
# Function that get movie recommendations based on the cosine similarity score of movie overview
def tfidf_recommender(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices].head(5)

In [22]:
tfidf_recommender('Avatar')

9853     James Gandolfini: Tribute To A Friend
10285                     Bulletproof Salesman
2749                     America's Sweethearts
3883                  White Hunter Black Heart
350                             Dangerous Game
Name: title, dtype: object

In [23]:
# Get 5 recommendations for each user input movie and store it into results array
tfidfRecommendations = []
for movie in userRatings['MovieTitle']:
    recommendation = tfidf_recommender(movie)
    print("For: ", movie)
    print(recommendation, "\n") 
    tfidfRecommendations.append(recommendation)

For:  1408
11825          Old Enough
15083     Vincent N Roxxy
12097    Hot Girls Wanted
10898         Seventh Son
12514        The Opponent
Name: title, dtype: object 

For:  2 Days in Paris
4114                                      The Reckoning
13030                              Hotel Transylvania 2
6190     VeggieTales: The Pirates Who Don't Do Anything
8954                                       The Congress
441                                   Striking Distance
Name: title, dtype: object 

For:  2010
9389      Just Before Dawn
11284    Midnight Crossing
10038     Messages Deleted
2886        Mission to Mir
10575       The Fraternity
Name: title, dtype: object 

For:  24 Hour Party People
13177          Familiar Strangers
3688      Only the Strong Survive
3625           Cradle 2 the Grave
12348    Mickey's Christmas Carol
4250                 Pure Country
Name: title, dtype: object 

For:  xXx
15441    Perry Mason: The Case of the Fatal Framing
4870              Charlie and the Ch

# ML Model 2: KNN Search
Makes recommendations based on an unsupervised algorithm: K-nearest Neighborhood search. It searches through cleanedSmallRatings.csv and finds the closest distance between user input ratings.

In [24]:
#data Preprocessing
movie_data = movieDF
movie_rating = smallRatingsDF
movie_data = movie_data.rename(columns = {'id':'movieId'})
movie_data = movie_data[['movieId', 'title', 'runtime', 'genres',
       'cast', 'budget', 'revenue', 'popularity',
       'vote_average', 'vote_count', 'year']]
movie_data.head(3)

Unnamed: 0,movieId,title,runtime,genres,cast,budget,revenue,popularity,vote_average,vote_count,year
0,862,Toy Story,81.0,"['Animation', 'Comedy', 'Family']","['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...",30000000,373554033,21.946943,7.7,5415,1995
1,8844,Jumanji,104.0,"['Adventure', 'Fantasy', 'Family']","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...",65000000,262797249,17.015539,6.9,2413,1995
2,15602,Grumpier Old Men,101.0,"['Romance', 'Comedy']","['Walter Matthau', 'Jack Lemmon', 'Ann-Margret...",0,0,11.7129,6.5,92,1995


In [25]:
# merge data
new_movie_data = pd.merge(movie_data, movie_rating, on='movieId')

In [26]:
## Drop duplicate data
combine_movie_rating = new_movie_data.dropna(axis = 0, subset = ['title'])

## Group data and rename the column name for another dataframe
movie_ratingCount = (new_movie_data.
     groupby(by = ['title'])['rating'].
     count().
     reset_index().
     rename(columns = {'rating': 'totalRatingCount'})
     [['title', 'totalRatingCount']]
    )
movie_ratingCount.head()

Unnamed: 0,title,totalRatingCount
0,!Women Art Revolution,2
1,10 Items or Less,11
2,10 Things I Hate About You,7
3,"10,000 BC",3
4,11'09''01 - September 11,1


In [27]:
# Merge 2 datasets by title
rating_with_totalRatingCount = combine_movie_rating.merge(movie_ratingCount, left_on = 'title', right_on = 'title', how = 'left')
rating_with_totalRatingCount.head(2)

Unnamed: 0.1,movieId,title,runtime,genres,cast,budget,revenue,popularity,vote_average,vote_count,year,Unnamed: 0,userId,rating,timestamp,totalRatingCount
0,949,Heat,170.0,"['Action', 'Crime', 'Drama', 'Thriller']","['Al Pacino', 'Robert De Niro', 'Val Kilmer', ...",60000000,187436818,17.924927,7.7,1886,1995,4118,23,3.5,1148721092,16
1,949,Heat,170.0,"['Action', 'Crime', 'Drama', 'Thriller']","['Al Pacino', 'Robert De Niro', 'Val Kilmer', ...",60000000,187436818,17.924927,7.7,1886,1995,15468,102,4.0,956598942,16


In [28]:
## Filter movie that have ratie above 4.5
movie_list = rating_with_totalRatingCount[rating_with_totalRatingCount.rating >=4.5]
# Pick 30 sample
movie_list_sample = movie_list.sample(30) 

In [29]:
# Drop duplicate and reshape data.
movie_list_sample = movie_list_sample.drop_duplicates(['userId', 'title'])
movie_list_sample_pivot = movie_list_sample.pivot(index = 'title', columns = 'userId', values = 'rating').fillna(0)

In [30]:
## KNN for unsupervised learner for implementing neighbor search
## By rating
## Can tuned parameter such as metric, n_neighbors.
rating_with_totalRatingCount = rating_with_totalRatingCount.drop_duplicates(['userId', 'title'])
rating_with_totalRatingCount_pivot = rating_with_totalRatingCount.pivot(index = 'movieId', columns = 'userId', values = 'rating').fillna(0)
rating_with_totalRatingCount_matrix = csr_matrix(rating_with_totalRatingCount_pivot.values)

model_knn = NearestNeighbors(n_neighbors = 11,metric = 'cosine', algorithm = 'brute')
model_knn.fit(rating_with_totalRatingCount_matrix)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=11, p=2,
                 radius=1.0)

In [31]:
movie_list

Unnamed: 0.1,movieId,title,runtime,genres,cast,budget,revenue,popularity,vote_average,vote_count,year,Unnamed: 0,userId,rating,timestamp,totalRatingCount
3,949,Heat,170.0,"['Action', 'Crime', 'Drama', 'Thriller']","['Al Pacino', 'Robert De Niro', 'Val Kilmer', ...",60000000,187436818,17.924927,7.7,1886,1995,33669,242,5.0,956688825,16
7,949,Heat,170.0,"['Action', 'Crime', 'Drama', 'Thriller']","['Al Pacino', 'Robert De Niro', 'Val Kilmer', ...",60000000,187436818,17.924927,7.7,1886,1995,53461,387,5.0,974670478,16
8,949,Heat,170.0,"['Action', 'Crime', 'Drama', 'Thriller']","['Al Pacino', 'Robert De Niro', 'Val Kilmer', ...",60000000,187436818,17.924927,7.7,1886,1995,61635,452,4.5,1133735550,16
19,1408,Cutthroat Island,119.0,"['Action', 'Adventure']","['Geena Davis', 'Matthew Modine', 'Frank Lange...",98000000,10017322,7.284477,5.7,137,1995,800,11,5.0,1391658667,43
23,1408,Cutthroat Island,119.0,"['Action', 'Adventure']","['Geena Davis', 'Matthew Modine', 'Frank Lange...",98000000,10017322,7.284477,5.7,137,1995,9454,63,4.5,1079098267,43
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23729,2791,The Chronicles of Riddick: Dark Fury,34.0,"['Action', 'Animation', 'Science Fiction', 'Th...","['Vin Diesel', 'Rhiana Griffith', 'Keith David...",0,0,1.628881,5.9,157,2004,96571,646,5.0,953449488,106
23730,2791,The Chronicles of Riddick: Dark Fury,34.0,"['Action', 'Animation', 'Science Fiction', 'Th...","['Vin Diesel', 'Rhiana Griffith', 'Keith David...",0,0,1.628881,5.9,157,2004,97789,654,4.5,1145390525,106
23731,2791,The Chronicles of Riddick: Dark Fury,34.0,"['Action', 'Animation', 'Science Fiction', 'Th...","['Vin Diesel', 'Rhiana Griffith', 'Keith David...",0,0,1.628881,5.9,157,2004,98243,656,5.0,986242465,106
23743,2331,Jesus,240.0,"['History', 'Drama']","['Jeremy Sisto', 'Armin Mueller-Stahl', 'Debra...",20000000,0,2.524781,5.4,8,1999,33831,242,5.0,956683960,9


In [32]:
# Set up list of 10 movies
movie_list_sample1 = movie_list
movie_list_sample1 = movie_list_sample1.drop_duplicates(['userId','title'])
movie_list_sample1.index = [x for x in range(1, len(movie_list_sample1.values)+1)]
movie_list_sample1_pivot = movie_list_sample1.pivot(index = 'title', columns = 'userId', values = 'rating').fillna(0)

# with 10 movies list of high rating score. (Randomly select)
movie_list_sample1_pivot

userId,2,3,4,5,6,7,8,9,10,11,...,661,662,664,665,666,667,668,669,670,671
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Items or Less,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2 Days in Paris,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2010,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24 Hour Party People,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Young Adam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Young Black Stallion,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zodiac,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
eXistenZ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
def Recommend_id(n):
    query_index = n   ### put index of the movie
    distances, indices = model_knn.kneighbors(rating_with_totalRatingCount_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for: {0}\n'.format(movie_list_sample1_pivot.index[query_index]))
        else:
            print('{0}:Movie Id {1}, with distance of :{2}'.format(i, rating_with_totalRatingCount_pivot.index[indices.flatten()[i]], distances.flatten()[i]))            

In [34]:
print(Recommend_id(1))
print(Recommend_id(2))
print(Recommend_id(3))
print(Recommend_id(4))
print(Recommend_id(707))

Recommendations for: 1408

1:Movie Id 25, with distance of :0.49513058155839584
2:Movie Id 786, with distance of :0.5325538642689402
3:Movie Id 95, with distance of :0.5383533091562768
4:Movie Id 608, with distance of :0.539475353930275
5:Movie Id 16, with distance of :0.5698249448446511
None
Recommendations for: 2 Days in Paris

1:Movie Id 66, with distance of :0.6220139271694556
2:Movie Id 26386, with distance of :0.6663884534477982
3:Movie Id 711, with distance of :0.6840072602259729
4:Movie Id 107, with distance of :0.6910454076896677
5:Movie Id 1165, with distance of :0.6916226622033201
None
Recommendations for: 2010

1:Movie Id 1812, with distance of :0.4434468273072043
2:Movie Id 3048, with distance of :0.45778585788618364
3:Movie Id 2898, with distance of :0.45778585788618364
4:Movie Id 558, with distance of :0.4883858692153267
5:Movie Id 1595, with distance of :0.4946458421661398
None
Recommendations for: 24 Hour Party People

1:Movie Id 5680, with distance of :0.5635240694863

___________________

# ML Model 3: SVD Recommender
This model uses a matrix factorization method called Singular value decomposition (SVD). SVD decreases the dimension of the utility matrix by extracting its latent factors. Essentially, we map each user and each movie into a latent space with dimension r. Therefore, it helps us better understand the relationship between users and movies as they become directly comparable.

In [35]:
rating=pd.DataFrame(smallRatingsDF)
movie=pd.DataFrame(movieDF)
#clean for the data frame 
#movie=movie.drop(columns=['Unnamed: 0']) #drop the first cloume
rating=rating.drop(columns=['Unnamed: 0','timestamp'])
movie=movie.rename(columns={'id':'movieId'}) # rename the movieid

In [36]:
movie.index = range(len(movie))
rating.index=range(len(rating))
#add the movie count to the movie
movie['count_no']=movie.index
movie.movieId.unique().shape[0]

16188

In [37]:
modifiedUserRatings = userRatings
modifiedUserRatings['Rating'] = modifiedUserRatings['Rating'].astype(int)

In [38]:
#plug the user into the users table
userRatings['rating']=userRatings['Rating']/2

In [39]:
#add the movieid into the table
movie_id_user=[]
for i in range(len(userRatings.index)):
    movie_id_user.append(list(movie[movie['title']==userRatings.loc[i]['MovieTitle']].movieId)[0])

In [40]:
userRatings['movieId']=np.array(movie_id_user)
userRatings

Unnamed: 0,MovieTitle,Rating,rating,movieId
0,1408,9,4.5,3021
1,2 Days in Paris,7,3.5,1845
2,2010,8,4.0,4437
3,24 Hour Party People,8,4.0,2750
4,xXx,10,5.0,7451


In [41]:
#add the rating and movieID to the usertbale
row1={'userId':int(672),'movieId':userRatings.loc[0]['movieId'],'rating':userRatings.loc[0]['rating']}
row2={'userId':int(672),'movieId':userRatings.loc[1]['movieId'],'rating':userRatings.loc[1]['rating']}
row3={'userId':int(672),'movieId':userRatings.loc[2]['movieId'],'rating':userRatings.loc[2]['rating']}
row4={'userId':int(672),'movieId':userRatings.loc[3]['movieId'],'rating':userRatings.loc[3]['rating']}
row5={'userId':int(672),'movieId':userRatings.loc[4]['movieId'],'rating':userRatings.loc[4]['rating']}

In [42]:
rating.loc[23712]=row1
rating.loc[23713]=row2
rating.loc[23714]=row3
rating.loc[23715]=row4
rating.loc[23716]=row5

In [43]:
rating.userId.unique()

array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,  33.,
        34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,  44.,
        45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,  55.,
        56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,  66.,
        67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,
        78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,
        89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,  99.,
       100., 101., 102., 103., 104., 105., 106., 107., 108., 109., 110.,
       111., 112., 113., 114., 115., 116., 117., 118., 119., 120., 121.,
       122., 123., 124., 125., 126., 127., 128., 129., 130., 131., 132.,
       133., 134., 135., 136., 137., 138., 139., 140., 141., 142., 143.,
       144., 145., 146., 147., 148., 149., 150., 15

In [44]:
#merge into one dataset
df=pd.merge(rating,movie, on='movieId')
#recommond systerm based on the facorty matrix
movie_matrix = df.pivot_table(index='userId', columns='count_no', values='rating').fillna(0)

In [45]:
R=movie_matrix.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

  """Entry point for launching an IPython kernel.


In [46]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = movie_matrix.columns)

In [47]:
def recommend_movies(predictions_df,  movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    userID=672
    user_row_number = userID - 3 # UserID starts at 1, not 0, missing 2 lines
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    user_data = rating[rating.userId == (userID)]
    user_full = df[df.userId==(userID)]

    #print( 'User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    #print ('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movie[~movie['count_no'].isin(user_full['count_no'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'count_no',
               right_on = 'count_no').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

already_rated, predictions = recommend_movies(preds_df,  movie, rating, 25)


In [48]:
recomend_list=list(predictions['movieId'])
recomend_list

[1968,
 111,
 1247,
 2020,
 1959,
 1956,
 2791,
 926,
 924,
 231,
 296,
 1639,
 4816,
 215,
 2605,
 8874,
 1073,
 2115,
 8970,
 866,
 2324,
 30707,
 1358,
 2321,
 3512]

# ML Model 4: Cosine Similarity Recommender
Make recommendations based on title and ratings. It uses cosine similarity as a metric to find movies that are similar to what the user rated.

In [49]:
#clean the movie from the movie side
movie=movie[movie.movieId.isin(rating['movieId'])]
#clean for the movie in rating 
rating=rating[rating.movieId.isin(movie['movieId'])]

In [50]:
movie.index = range(len(movie))
rating.index=range(len(rating))
#add the movie count to the movie
movie['count_no']=movie.index


In [51]:
#merge into one dataset
df=pd.merge(rating,movie, on='movieId')

In [52]:
#recommond systerm based on the facorty matrix
movie_matrix = df.pivot_table(index='userId', columns='count_no', values='rating').fillna(0)

In [53]:
# calculate the cos similarity
def calculationSimilarity(data_matrix):
    user_similarity = cosine_similarity(data_matrix, dense_output=True)
    item_similarity = cosine_similarity(data_matrix.T, dense_output=True)
    return item_similarity


In [54]:
item_similarity=calculationSimilarity(movie_matrix)

In [55]:
def rec_sys(df, item_similarity, keywords, k):
    '''
    :param item_similarity: simlilarity matrix
    :param keywords: name of the moive of key word
    :param k: number of recomendation
    :return: the lift of recommendation
    '''
    movie_list = []     # movie list
    movie_id = list(movie[movie['title'].str.contains(keywords)].count_no)[0]   # id
    movie_similarity = item_similarity[movie_id - 1]    # calculate the similarity
    movie_similarity_index = np.argsort(-movie_similarity)[1:k + 1]     

    for index in movie_similarity_index:
        rec_movie = list(movie[movie['count_no'] == index + 1].movieId)[0] # movie id
        #rec_movie = list(movie[movie['count_no'] == index + 1].title)     # movie name
        #id_movie=movie[movie['count_no'] == index + 1].movieId
        #rec_movie.append(df[df['count_no'] == index + 1].rating.mean()) # average rating
        #rec_movie.append(len(df[df['count_no'] ==index + 1]))    # number of rating 
        movie_list.append(rec_movie)
    return movie_list

In [56]:
print(rec_sys(df,item_similarity,'1408',5))
print(rec_sys(df,item_similarity,'2 Days in Paris',5))
print(rec_sys(df,item_similarity,'2010',5))
print(rec_sys(df,item_similarity,'24 Hour Party People',5))
print(rec_sys(df,item_similarity,'xXx',5))

[640, 2892, 4978, 25750, 4553]
[8130, 74727, 1990, 49299, 4012]
[109161, 4518, 33138, 314, 6022]
[39446, 6963, 49013, 5686, 2193]
[2621, 2617, 671, 49299, 313]


### For all ml model outputs: record the movieId numbers into an excel file

# Bagging
Taking the ouput of all the different ML models, the number of times a movie is recommended is analyzed. The top 5 movies that have been recommended the most will be displayed. This bagging model is based on majority vote count. 
Data is stored in baggingData.csv. It contains 100 movieId recommendations; 25 each from every ML model.

In [66]:
baggingDF = pd.read_csv('./finalData/baggingData.csv')
baggingDF.head()

Unnamed: 0,movieID
0,11825
1,15083
2,12097
3,10898
4,12514


In [108]:
#Count frequencies of movieID
finalRecommendations = baggingDF['movieID'].value_counts()[:5]
print(finalRecommendations)
finalRecommendations = str(finalRecommendations)
finalRecommendations = finalRecommendations.split()
finalRecommendations = finalRecommendations[0:9:2]
finalIds = []
for number in finalRecommendations:
    finalIds.append(int(number))
finalIds

25       2
49299    2
25750    1
74727    1
1956     1
Name: movieID, dtype: int64


[25, 49299, 25750, 74727, 1956]

In [161]:
def printOutput(finalIds):
    for item in finalIds:
        print(movieDF[movieDF["movieId"] == item].title, "\n\n\n")

In [167]:
print("#### Your top 5 movie recommendations are: #####\n")
printOutput(finalIds)


#### Your top 5 movie recommendations are: #####

5248    Jarhead
Name: title, dtype: object 



350    Dangerous Game
Name: title, dtype: object 



694    Children of the Corn IV: The Gathering
Name: title, dtype: object 



14070    Long Pigs
Name: title, dtype: object 



3610    Gerry
Name: title, dtype: object 



