# Capstone Demo Notebook

This notebook is 4/4. It is meant for demonstrating the functionality of the system. If all of the cells below are run, then the demo section at the bottom will be functional for testing.

In [1]:
# import relevant packages
import pandas as pd
import numpy as np

from surprise import Dataset
from surprise.reader import Reader
from surprise import accuracy
from collections import defaultdict
from surprise import dump

import pickle

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Capstone-Demo-Notebook" data-toc-modified-id="Capstone-Demo-Notebook-1">Capstone Demo Notebook</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-2">Functions</a></span><ul class="toc-item"><li><span><a href="#1.-Getting-Top-N-Ratings-for-Users" data-toc-modified-id="1.-Getting-Top-N-Ratings-for-Users-2.1">1. Getting Top N Ratings for Users</a></span></li><li><span><a href="#2.-Getting-Top-N-Ratings-for-2" data-toc-modified-id="2.-Getting-Top-N-Ratings-for-2-2.2">2. Getting Top N Ratings for 2</a></span></li><li><span><a href="#3.-Finding-Similar-to-Top-N-for-2" data-toc-modified-id="3.-Finding-Similar-to-Top-N-for-2-2.3">3. Finding Similar to Top N for 2</a></span></li><li><span><a href="#4.-Top-In-Each-Genre" data-toc-modified-id="4.-Top-In-Each-Genre-2.4">4. Top In Each Genre</a></span></li><li><span><a href="#5.-Addition-of-New-Ratings" data-toc-modified-id="5.-Addition-of-New-Ratings-2.5">5. Addition of New Ratings</a></span></li></ul></li><li><span><a href="#Demo" data-toc-modified-id="Demo-3">Demo</a></span></li></ul></div>

In [2]:
# import relevant data
movies_reference = pd.read_csv('My Datasets/movies_reference.csv')

In [4]:
# reading in movie similarities data
movie_similarities = np.load('My Datasets/movie_similarities.npy')

In [4]:
# getting the mixed predictions pkl file
mixed_predictions = open('Models/mixed_predictions.pkl', 'rb')
KNN_SVD_predictions = pickle.load(mixed_predictions)

In [5]:
mixed_predictions.close()

# Functions

## 1. Getting Top N Ratings for Users

The code below is heavily influenced by the suprise function [get_top_n](https://surprise.readthedocs.io/en/stable/FAQ.html). This function will be used to take the hybrid ratings from the function previous and then sort them according to the predicted ratings. The resulting dictionary will then be returned with only the top N for each user, as specified in the function call.

In [10]:
def get_top_N_movies(predictions, n=10, threshold=3.5):
    ''' Takes in dictionary of user ids and associated true rating/predited rating tuples.
    Returns the top N per user, as specified in the function call.
    Required format: dictionary of lists of tuples, ex. uid: [(iid, est)]
    Where uid = user id, iid = item id and est = predicted rating of item by user.
    Arguments: predictions, top N ratings, ratings threshold'''
    
    user_top_n_films = defaultdict(list)
    
    # going through the list of tuples, sorting by the predicted rating
    # slicing out only the top n for each user
    for uid, ratings in predictions.items():
        
        # sort the tuples in the keys by the predicted rating
        # lambda function calls x as each tuple, indexes to the first item (est)
        # sorts the tuple list by that estimated rating
        ratings.sort(key=lambda x: x[1], reverse=True)
        
        # separates off the top n ratings for each user
        user_top_n_films[uid] = ratings[:n]
    
    # deleting data in the top n ratings that is not within our rating threshold
    for uid, ratings in user_top_n_films.items():
        # if the estimated rating is below the threshold
        for i, rating in enumerate(ratings):
            if rating[1] < threshold:
                ratings.pop(i)
        
    return user_top_n_films

In [11]:
top_dict = get_top_N_movies(KNN_SVD_predictions, n=100, threshold=4)

## 2. Getting Top N Ratings for 2

The final function in predicting ratings for two users will need to take into account items that appear on both users top N lists. It will then return a dataframe with options that would rank highly for both users, as well as their averaged rating.

In [12]:
def ratings_for_two(user1, user2):
    '''Takes in two user ids, returns a dataframe of movies that would be recommended for both'''
    
    # uses top_n function to create list of movies that each user would like
    movies_for_one = top_dict[user1]
    movies_for_two = top_dict[user2]
    
    # instatiate empty list for the combined movies
    movies_for_both = []
    
    # filling movies for both list, averaging the couple rating
    for (iid1, rating1) in movies_for_one:
        for (iid2, rating2) in movies_for_two:
            # IF the movie ids match, append the id and an averaged rating to the list
            if iid1 == iid2:
                movies_for_both.append((iid1, ((rating1+rating2)/2)))
            
    
    # sort the list by the averaged rating
    movies_for_both.sort(key=lambda x: x[1], reverse=True)
    
    # instantiating dataframe for visibility
    top_for_both = pd.DataFrame(columns=['Title', 'Year Of Release', 'IMDB Rating /10', 'Couple Pred Rating /5', 'IMDB Vote Count', 'MovieId'])
    
    # for each movie + rating in the movies for both list
    for i, (iid, rating) in enumerate(movies_for_both):
        try:
        # populate dataframe, using the movies reference table for the values
            top_for_both.loc[i] = [str(movies_reference['title'][movies_reference['movieId'] == iid].values).strip("(?:[''])"),
                                   int(movies_reference['year_of_release'][movies_reference['movieId'] == iid].values),
                                   float(movies_reference['averageRating'][movies_reference['movieId'] == iid].values),
                                   round(rating,1),
                                   int(movies_reference['numVotes'][movies_reference['movieId'] == iid].values),
                                   int(movies_reference['movieId'][movies_reference['movieId'] == iid].values)]
        except TypeError:
            pass
    
    top_for_both = top_for_both[top_for_both['IMDB Vote Count'] > 45000]
    top_for_both = top_for_both[top_for_both['Year Of Release'] > 1965]
    
    top_for_both.reset_index(drop=True, inplace=True)
    
    # return the dataframe
    return top_for_both

## 3. Finding Similar to Top N for 2

In [13]:
def similar_items_to_top_5(user1, user2):
    ''' Takes in dataframe, returns a dataframe of top 5 similar movies 
    to the top 5 movies in the dataframe'''
    
    df = ratings_for_two(user1, user2)
    
    to_concat_list = []

    for i, iid in enumerate(df['MovieId'].iloc[:5]):

        movie_index = (movies_reference[movies_reference['movieId'] == iid].index)
        similar_to = str(movies_reference['title'][movies_reference['movieId'] == iid].values).strip("[\W']")

        i = pd.DataFrame({'Movie':movies_reference['title'],
                          'Year of Release':movies_reference['year_of_release'],
                          'IMDB Rating':movies_reference['averageRating'],
                          'Similarity Score': np.array(movie_similarities[movie_index, :].squeeze()),
                         'Similar To:': similar_to})

        i = i.sort_values(by='Similarity Score', ascending=False).iloc[1:6,:]

        to_concat_list.append(i)

    similar_df = pd.concat(to_concat_list, axis=0)
    
    similar_df = similar_df.set_index(['Similar To:', 'Movie'])
    
    return similar_df

## 4. Top In Each Genre

In [3]:
# reading in movie utility matrix 
movies_utility = pd.read_csv('My Datasets/movies_utility.csv', index_col=0)
# taking out the first 20 columns, these are all of the genres
movies_genres_df = movies_utility.iloc[:,:20]
# creation of new reference df, with the movieId & genres 
movies_reference_genre = pd.merge(movies_reference, movies_genres_df, on='movieId', how='left')

In [5]:
def top_in_genre():
    ''' Takes in user input of genre, returns the top 15 films in each genre
    by their overall rating on IMDB and the number of votes.'''
    
    print('Available Genre Choices:\n')
    for col in movies_genres_df.drop(columns='movieId').columns:
        print(f'{col}')

    genre_choice = input('\nWhat genre do you want to watch? ').capitalize()

    best_in_genre = movies_reference_genre[['title', 'year_of_release','runtimeMinutes', 'numVotes', 'averageRating']][movies_reference_genre[genre_choice] == 1].sort_values(['numVotes', 'averageRating'], ascending=False).head(15)
    
    return best_in_genre

## 5. Addition of New Ratings

In [16]:
def add_new_user(userId):
    
    '''Creates a new dataframe of user ratings through user input search of the exisiting
    reference dataframe. The new dataframe can then be used to add into exisiting ratings
    dataset and have the model re-trained.'''
    
    # instantiate new empty dataframe
    new_user_df = pd.DataFrame(columns=['userId', 'movieId', 'rating'])
    
    # to keep while loop running, instantiating a variable to true.
    add_rating = True
    # while the above is true...
    while add_rating == True:
        print('Press q to quit')
        # search is the input of a string, keyword search for film title
        search = str(input('Try searching by title, pls capitalize each word: \n'))
        # if the user types 'q', program quits
        if search == 'q':
            break
        print('\nOptions listed below:\n')
        # prints out the options for that search
        print(movies_reference[['movieId', 'title']][movies_reference['title'].str.contains(search)])
        print('\nPlease choose a movie and rate (1-5):')
        # requires user specifies movie Id
        movieid = int(input("Movie ID: "))
        # requires user specifies rating
        new_rating = int(input("Rating: "))
        # quit option
        if new_rating == 'q':
            break
        # quit option
        elif movieid == 'q':
            break
        # if the rating is out of bounds, have user re-type
        elif new_rating > 5 & new_rating < 1:
            print('Rating cannot be less than 1 or more than 5')
            new_rating = int(input("Rating: "))
        
        # creation of new row variable with information from user input
        new_row = {'userId': userId, 'movieId': movieid, 'rating': new_rating}
        
        # appending that row onto the user's rating dataframe
        new_user_df = new_user_df.append(new_row, ignore_index=True)
    
    return new_user_df

------------

# Demo

Select two known user ids: `ratings_for_two(id1, id2)`

In [17]:
ratings_for_two(44,60)

Unnamed: 0,Title,Year Of Release,IMDB Rating /10,Couple Pred Rating /5,IMDB Vote Count,MovieId
0,Saving Private Ryan,1998,8.6,4.5,1249095,2028
1,Toy Story 2,1999,7.9,4.4,531981,3114
2,"Green Mile, The",1999,8.6,4.3,1162106,3147
3,Life Is Beautiful (La Vita è bella,1997,8.6,4.3,630999,2324
4,Apollo 13,1995,7.6,4.3,272113,150
5,Braveheart,1995,8.3,4.2,966758,110
6,October Sky,1999,7.8,4.2,83538,2501
7,Die Hard,1988,8.2,4.2,799366,1036
8,"Hunt for Red October, The",1990,7.6,4.2,182158,1610
9,"Christmas Story, A",1983,7.9,4.2,133465,2804


List of similar movies to the top 5 for the couple: `similar_items_to_top_5(id1,id2)`

In [18]:
similar_items_to_top_5(44,60)

Unnamed: 0_level_0,Unnamed: 1_level_0,Year of Release,IMDB Rating,Similarity Score
Similar To:,Movie,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Saving Private Ryan,Schindler's List,1993,8.9,0.563621
Saving Private Ryan,"Boot, Das (Boat, The)",1981,8.3,0.477214
Saving Private Ryan,"Downfall (Untergang, Der)",2004,8.2,0.468293
Saving Private Ryan,Braveheart,1995,8.3,0.444878
Saving Private Ryan,Gladiator,2000,8.5,0.43069
Toy Story 2,Toy Story,1995,8.3,0.629941
Toy Story 2,"Monsters, Inc.",2001,8.1,0.521168
Toy Story 2,Pooh's Heffalump Movie,2005,6.4,0.512823
Toy Story 2,Toy Story 3,2010,8.2,0.501739
Toy Story 2,Pinocchio,1940,7.4,0.491354


Gives the top 15 movies in a specified genre, using user input: `top_in_genre()`

In [7]:
top_in_genre()

Available Genre Choices:

Adventure
Animation
Children
Comedy
Fantasy
Romance
Drama
Action
Crime
Thriller
Horror
Mystery
Sci-Fi
IMAX
Documentary
War
Musical
Western
Film-Noir

What genre do you want to watch? fantasy


Unnamed: 0,title,year_of_release,runtimeMinutes,numVotes,averageRating
4897,"Lord of the Rings: The Fellowship of the Ring,...",2001,178.0,1679027.0,8.8
7041,"Lord of the Rings: The Return of the King, The",2003,201.0,1659235.0,8.9
5853,"Lord of the Rings: The Two Towers, The",2002,179.0,1500435.0,8.7
6429,Pirates of the Caribbean: The Curse of the Bla...,2003,143.0,1021118.0,8.0
0,Toy Story,1995,81.0,896826.0,8.3
4790,"Monsters, Inc.",2001,92.0,823722.0,8.1
17499,Harry Potter and the Deathly Hallows: Part 2,2011,130.0,773494.0,8.1
15401,Toy Story 3,2010,103.0,764119.0,8.2
20046,"Hobbit: An Unexpected Journey, The",2012,169.0,762832.0,7.8
17040,Thor,2011,115.0,756202.0,7.0


Creates new dataframe of user ratings, through user input search: `add_new_user(userId)`

In [17]:
add_new_user('haley')

Press q to quit
Try searching by title, pls capitalize each word: 
Star Wars

Options listed below:

       movieId                                              title
257        260                 Star Wars: Episode IV - A New Hope
1171      1196     Star Wars: Episode V - The Empire Strikes Back
1184      1210         Star Wars: Episode VI - Return of the Jedi
2543      2628          Star Wars: Episode I - The Phantom Menace
5281      5378       Star Wars: Episode II - Attack of the Clones
10117    33493       Star Wars: Episode III - Revenge of the Sith
12926    61160                          Star Wars: The Clone Wars
15507    79006  Empire of Dreams: The Story of the 'Star Wars'...
20393   100089                    Star Wars Uncut: Director's Cut
22977   109713                      Star Wars: Threads of Destiny

Please choose a movie and rate (1-5):
Movie ID: 260
Rating: 5
Press q to quit
Try searching by title, pls capitalize each word: 
Mean Girls

Options listed below:

       m

Unnamed: 0,userId,movieId,rating
0,haley,260,5
1,haley,7451,5
