***
# <font color=blue size=10>Movie Recommendation Systems</font>
***

# <font color=blue>Introduction</font>
***

Welcome! This notebook contains a number of different recommenders for movies. The dataset used as a base for them is the movielens latest-small database, available at https://grouplens.org/datasets/movielens/. In order to use this notebook, you may either select "Run All" and follow the instructions or instantiate the user and run the specific recommender that you wish to use.

The <b>next steps</b> for this project include: create web app in Flask/Django and create a content based recommender.



# <font color=blue>Program Initialization</font>
***

Import required libraries: pandas and numpy. Read dataframes and inital table cleaning.

In [1]:
# Import libaries and dataframes
import pandas as pd
import numpy as np
movies = pd.read_csv("movies.csv")
movies = movies.set_index("movieId")
ratings = pd.read_csv("ratings.csv")
ratings = ratings.drop("timestamp",axis=1)
movies['mean_rating'] = ratings.groupby("movieId").mean()['rating']
movies['total_votes'] = ratings["movieId"].value_counts()
movies = movies.dropna() #Drop movies that don't appear in the ratings dataframe
last_user = ratings['userId'].max()

User class which will contain the user ratings and will be used for recommendation.

In [2]:
class User():
    def __init__(self):
        self.userId = last_user+1 #UserId
        self.most_popular = movies.sort_values(by=["total_votes"],ascending=False)
        self.most_popular = self.most_popular.head(60).index
        self.data = []
        self.get_movies_input()
        self.uratings = []
        self.umovieids = []
        for i in self.data:
            self.umovieids.append(i[0])
            self.uratings.append(i[1])
        
        self.uratings_df = pd.DataFrame({'movieId':self.umovieids,
                                         'rating':self.uratings}).set_index('movieId')
    
    #Method to receive ratings on the top 60 most popular movies.
    def get_movies_input(self): 
        proceed = True
        manual = True
        while proceed==True:
            for movie in self.most_popular:
                try:
                    auto_rating = float(input(f"Your rating for: {movies.loc[movie]['title']}"))
                    if auto_rating <= 5.0 and auto_rating >=0:
                        self.data.append([movie,auto_rating])
                    else:
                        print("Error: rating outside range 0 to 5")
                except:
                    pass
            while manual==True:
                if input("Do you want to add more movies manually? ('y' for yes)") in ['yes','y','Yes','Y']:
                    try:
                        manual_movie = int(input("Please enter movie id."))
                        manual_rating = float(input("Please enter movie rating."))
                    except:
                        print("Invalid ID or rating.")
                    if manual_movie in movies.index and manual_movie not in self.most_popular and manual_rating <= 5.0 and manual_rating >=0:
                        self.data.append([manual_movie,manual_rating])
                    if manual_movie not in movies.index:
                        print("Movie id not found.")
                    if manual_movie in self.most_popular:
                        print("Movie already computed.")
                    if manual_rating>5.0 or manual_rating<0:
                        print("Rating outside range 0 to 5")
                else:
                    manual = False            
            proceed = False
    
    def add_movies(self): #Method to add more movies
        data2=[]
        manual = True
        while manual==True:
            if input("Do you want to add more movies manually? ('y' for yes)") in ['yes','y','Yes','Y']:
                try:
                    manual_movie = int(input("Please enter movie id."))
                    manual_rating = float(input("Please enter movie rating."))
                except:
                    print("Invalid ID or rating.")
                if manual_movie in movies.index and manual_movie not in self.most_popular and manual_rating <= 5.0 and manual_rating >=0:
                    data2.append([manual_movie,manual_rating])
                if manual_movie not in movies.index:
                    print("Movie id not found.")
                if manual_movie in self.most_popular:
                    print("Movie already computed.")
                if manual_rating>5.0 or manual_rating<0:
                    print("Rating outside range 0 to 5")
            else:
                manual = False    
        for i in data2:
            self.umovieids.append(i[0])
            self.uratings.append(i[1])
        
        self.uratings_df = pd.DataFrame({'movieId':self.umovieids,
                                         'rating':self.uratings}).set_index('movieId')

In [3]:
new_user = User() #Instantiate class User

Your rating for: Forrest Gump (1994)5
Your rating for: Shawshank Redemption, The (1994)5
Your rating for: Pulp Fiction (1994)5
Your rating for: Silence of the Lambs, The (1991)3
Your rating for: Matrix, The (1999)4
Your rating for: Star Wars: Episode IV - A New Hope (1977)5
Your rating for: Jurassic Park (1993)2
Your rating for: Braveheart (1995)1
Your rating for: Terminator 2: Judgment Day (1991)2
Your rating for: Schindler's List (1993)3
Your rating for: Fight Club (1999)5
Your rating for: Toy Story (1995)1
Your rating for: Star Wars: Episode V - The Empire Strikes Back (1980)5
Your rating for: Usual Suspects, The (1995)4
Your rating for: American Beauty (1999)5
Your rating for: Seven (a.k.a. Se7en) (1995)5
Your rating for: Independence Day (a.k.a. ID4) (1996)3
Your rating for: Apollo 13 (1995)5
Your rating for: Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)2
Your rating for: Lord of the Rings: The Fellowship of the Ring, The (2001)5
Your rating for: S

Disclaimer: If you see some ratings above the first time you open this notebook, they were created randomly. The author is not responsible for any moral damage caused by seing these ratings!

# <font color=blue>First Recommender: Best Rated</font>
*** 

This simple recommender will return the N top best rated movies, excluding those already watched by the user.

In [4]:
class BestRatedRecommender:
    
    def __init__(self, user_ratings):
        best_rated = movies[(movies["total_votes"]>50)].sort_values("mean_rating",ascending=False)
        best_rated = best_rated[~best_rated.index.isin(user_ratings.index)]
        self.recommendations = best_rated
        
    def recommend(self,n=10):
        print("Here are the top {} best rated movies by our community you may have not seen yet:".format(n))
        display(self.recommendations.drop(['genres','mean_rating','total_votes'],axis=1).head(n))

In [5]:
BRR = BestRatedRecommender(new_user.uratings_df)
BRR.recommend(10)

Here are the top 10 best rated movies by our community you may have not seen yet:


Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1276,Cool Hand Luke (1967)
750,Dr. Strangelove or: How I Learned to Stop Worr...
904,Rear Window (1954)
1221,"Godfather: Part II, The (1974)"
48516,"Departed, The (2006)"
1213,Goodfellas (1990)
912,Casablanca (1942)
1208,Apocalypse Now (1979)
2329,American History X (1998)
1252,Chinatown (1974)


# <font color=blue>Second Recommender: Rare Pearls</font>
*** 

This recommender will return movies that don't have a lot of votes (above a minimum amount) but have high ratings.


In [6]:
class RarePearlsRecommender:

    def __init__(self, user_ratings):
        best_rated = movies[(movies["total_votes"]<=50)&(movies["total_votes"]>=10)].sort_values("mean_rating",ascending=False)
        best_rated = best_rated[~best_rated.index.isin(user_ratings.index)]
        self.recommendations = best_rated
        
    def recommend(self,n=10):
        print(f"{n} less known movies our community loves:")
        display(self.recommendations.drop(['genres','mean_rating','total_votes'],axis=1).head(n))

In [7]:
RPR = RarePearlsRecommender(new_user.uratings_df)
RPR.recommend()

10 less known movies our community loves:


Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1041,Secrets & Lies (1996)
3451,Guess Who's Coming to Dinner (1967)
1178,Paths of Glory (1957)
1104,"Streetcar Named Desire, A (1951)"
2360,"Celebration, The (Festen) (1998)"
1217,Ran (1985)
951,His Girl Friday (1940)
1927,All Quiet on the Western Front (1930)
3468,"Hustler, The (1961)"
922,Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)


# <font color=blue>Third Recommender: Bubble</font>
*** 

This recommender will calculate the K nearest neighbors to the user and recommend movies based 
on the best rated movies by this "bubble". Given the amount of empty values, the KNN model had
to be adapted so that it will only calculate the distance between points commom to each pair of
users. 

This of course is not a perfect representation of the distance. For example, two users may
enjoy a particular sci-fi movie because of a specific actor in it, but have different tastes
on the genre as a whole, which would cause their distance to increase have we had every information
available. 

In order to minimize this effect, the distance is only computed for pairs that have
a minimum number of movies in commom (see method get_user_distance). Also, the recommendations 
are separated in two groups: a general recommender, which only selects movies present in the 
90th quantile of neighbors_count, and another for the movies present in the 50th - 90th quantiles. These
values may be adapted. The latter recommender allows the suggestion of niche movies, over which we
don't have much confidence on whether they are relevant to the user or not (based on the explanation 
above).

In [8]:
class bubble_rec:
    
    def __init__(self, user, k = 50, n = None):
        self.n = n
        self.k = k
        self.user = user
        self.user_ratings = self.get_user_ratings(user)
        self.all_users = ratings['userId'].unique()
        if n:
            self.all_users = np.random.choice(all_users,size=n) #n allows the selection of less users in case of a big DataFrame
        
        # Calculates knn
        self.knn_df = self.modified_knn(user)
        
        #Get ratings from similar users
        self.similar_users = self.knn_df.index
        self.similar_users_ratings = ratings.set_index("userId").loc[self.similar_users]
        self.__recommendations = self.similar_users_ratings.groupby("movieId").mean()[['rating']]
        self.instances = self.similar_users_ratings.groupby("movieId").count()[['rating']]
        
        #Prepares recommendations matrix and calculates quantiles
        self.__recommendations = self.__recommendations.join(self.instances, lsuffix="_l", rsuffix="_r")
        self.__recommendations.columns = ['neighbors_ratings','neighbors_count']
        self.__recommendations = self.__recommendations.query("neighbors_count > %.2f" % 5)
        self.quantile90 = self.__recommendations.neighbors_count.quantile(q=0.9)
        self.quantile50 = self.__recommendations.neighbors_count.quantile(q=0.5)
        
        # Prepares border recommendations
        self.__border_recommendations = self.__recommendations.query("neighbors_count < %.2f" % (self.quantile90))
        self.__border_recommendations = self.__border_recommendations.query("neighbors_count >= %.2f" % (self.quantile50))        
        self.__border_recommendations = self.__border_recommendations.sort_values("neighbors_ratings", ascending=False)
        self.__border_recommendations = self.__border_recommendations.drop(self.user.umovieids,errors='ignore')
        self.__border_recommendations = self.__border_recommendations.join(movies) 
        
        # Finalizes recommendation matrix
        self.__recommendations = self.__recommendations.query("neighbors_count >= %.2f" % self.quantile90)  
        self.__recommendations = self.__recommendations.sort_values("neighbors_ratings", ascending=False)
        self.__recommendations = self.__recommendations.drop(user.umovieids,errors='ignore')
        self.__recommendations = self.__recommendations.join(movies)       
        
        
    def get_user_ratings(self,user): # Get ratings from user
        return pd.DataFrame(data={'movieId':user.umovieids,'rating':user.uratings}).set_index("movieId")
         
    def get_vector_norm(self,a,b): #Calculates vector norm
        return np.linalg.norm(a - b)
    
    def get_user_distance(self, user, userId2, minimum = 10): #Calculates the distance between two users
        user2_ratings = ratings[(ratings["userId"] == userId2)][['movieId','rating']].set_index("movieId")
        ratings_diff = self.user_ratings.join(user2_ratings, lsuffix="_l", rsuffix="_r").dropna()
        if(len(ratings_diff) < minimum): # If there are not enough movies in common between the pair, return None or big number.
            return None #[user.userId, userId2, 100000] 
        distance = self.get_vector_norm(ratings_diff['rating_l'],ratings_diff['rating_r'])
        return [user.userId, userId2, distance]  
    
    def modified_knn(self,user): #KNN method
        distances = [self.get_user_distance(user, userId2) for userId2 in self.all_users]
        distances = list(filter(None, distances)) #Filter the pairs with not enough information
        distances = pd.DataFrame(distances, columns = ["userId", "userId2", "distance"])
        distances = distances.sort_values("distance")
        distances = distances.set_index("userId2")
        return distances.head(self.k)
    
    def print_recommendations(self,n=10):
        print(f"Here are the top {n} movies people like you enjoyed:")
        display(self.__recommendations.drop(['neighbors_ratings','neighbors_count','mean_rating','total_votes','genres'],axis=1).head(n))
    

    def print_border(self,n=10):
        print(f"Here are the top {n} movies on the border of your bubble:")
        display(self.__border_recommendations.drop(['neighbors_ratings','neighbors_count','mean_rating','total_votes','genres'],axis=1).head(n))
        
        

In [9]:
rec = bubble_rec(user = new_user,k=100)

In [10]:
rec.print_recommendations(10)

Here are the top 10 movies people like you enjoyed:


Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
1221,"Godfather: Part II, The (1974)"
7361,Eternal Sunshine of the Spotless Mind (2004)
1089,Reservoir Dogs (1992)
4973,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ..."
4878,Donnie Darko (2001)


In [11]:
rec.print_border(10)

Here are the top 10 movies on the border of your bubble:


Unnamed: 0_level_0,title
movieId,Unnamed: 1_level_1
2324,Life Is Beautiful (La Vita è bella) (1997)
924,2001: A Space Odyssey (1968)
912,Casablanca (1942)
1136,Monty Python and the Holy Grail (1975)
2329,American History X (1998)
2502,Office Space (1999)
111,Taxi Driver (1976)
541,Blade Runner (1982)
750,Dr. Strangelove or: How I Learned to Stop Worr...
17,Sense and Sensibility (1995)


# <font color=blue>Fourth Recommender: Collaborative Filtering</font>
*** 

This recommender uses matrix factorization to fill out the blanks on a scarse matrix (user ratings per 
movie). The K number of latent features may be selected based on grid search hyperparameter optmization,
as it depends on the function parameters (top user limits or empty values limit). The number of steps 
is highly dependent on K.

This model uses stochastic gradient descent, as opposed to the batch algorithm further below. This means
that the matrices that contain the latent features values are updated for each element of the pivot table,
instead of calculating the gradient descent for all values. Therefore, epoch is used as a term instead of
steps, and in each epoch the algorithm runs through all non-null elements of the pivot table. Due to the relatively small dataset for SGD, we need to run through all elements more than 10 epochs.

The model calculates the root mean squared error, but it may also be evaluated by hiding part of the user 
ratings and comparing the model predictions. This requires the model to re-run due to the nature of matrix 
factorization. Precision@k and recall@k are returned. The best way to assert relevancy is user feedback.
In the absence of this, a value of 3.5 was defined as to determine whether the recommendation is 
relevant or not to the user. This model may be further developed by including L2 regularization. 

In [12]:
class StochasticGradientDescentRecommender:

    def __init__(self, user,K=30,epochs=10):
        self.user = user.userId
        self.new_user_ratings = user.uratings_df
        self.new_user_ratings['userId'] = self.user
        self.new_user_ratings = self.new_user_ratings.reset_index()
        self.K = K
        self.n = 0
        self.sse = 0
        self.mse = 0
        self.epochs = epochs
        self.original_alpha = 0.01        #The learning rate
        self.alpha = self.original_alpha 
        self.gamma = 0.99                 #The learning rate schedule
        self.epochcount = 0
        self.epochcount = self.epochs
        self.empty_values_limit = 12
        self.top_users_limit = 25
        self.break_limit = 0.2
        
    # Method for start model processing.
    
    def run(self, user_ratings=None, print_epochs=True, evaluating=False,ignore_break=False):
        
        #Resets alpha in case of second+ run.
        self.alpha_setter(self.original_alpha)
        
        #This is to allow for both the first run and evaluation run.
        if type(user_ratings) == type(None): 
            user_ratings = self.new_user_ratings
            
        # Get pivot table.
        self.ratings_df, self.pivot_df = self.__dataframe_reduction(user_ratings, ratings, movies)
        
        
        # This loop limits the amount of empty values in the pivot table in order to facilitate a converging descent.
        # However, this also limits the number of movies available for recommendation.
        for column in self.pivot_df.columns:
            try:
                if self.pivot_df[column].value_counts()[0]>self.empty_values_limit:
                    self.pivot_df = self.pivot_df.drop(column, axis=1)
            except:
                pass
         
        # Run the matrix factorization and calculate the complete table
        self.new_P, self.new_Q = self.__SGD_MatFac(self.pivot_df,print_epochs,ignore_break)
        self.new_R = np.dot(self.new_P,self.new_Q.T)
        
        #Prepare recommendation tables
        self.__recommend(self.new_R, self.pivot_df)
        
        if not evaluating:
            self.get_recommendations(n=10)

    #Calculates the pivot table based on user ratings. Due to the size of the original dataframes, the number of movies
    # and users considered is reduced to decrease processing time. 
    
    def __dataframe_reduction(self, user_ratings, ratings_df, movies_df): 
              
        ratings_df = ratings_df.drop(ratings_df[ratings_df["movieId"].isin(movies_df[movies_df.total_votes<20].index)].index)
        top_users = ratings.groupby(['userId'])['userId'].count()
        top_users = top_users.sort_values(ascending=False)
        top_users = top_users.head(self.top_users_limit)
        ratings_df = ratings_df[ratings_df['userId'].isin(top_users.index)]
        ratings_df = ratings_df.append(user_ratings)
        pivot_df = ratings_df.pivot(index='userId',columns='movieId', values='rating').fillna(0)
        return ratings_df, pivot_df

    
    # Stochastic Gradient Descent Matrix Factorization
    
    def __SGD_MatFac(self,R,print_epochs=True,ignore_break=False):
    
        print('Model run start.')
        if type(R) == pd.core.frame.DataFrame:
            R = R.values
            
        N = len(R)
        M = len(R[0])
        P = np.random.rand(N,self.K)*(5/self.K)
        Q = np.random.rand(M,self.K)*(5/self.K)
        Q=Q.T
        self.n = np.count_nonzero(R==0) #N for calculation of MSE/RMSE
        
        self.sample_index_list = []
        for i in range(N):
            for j in range(M):
                self.sample_index_list.append([i,j])
        np.random.shuffle(self.sample_index_list)
        
        
        for epoch in range(self.epochs):   
            e=0
            for i,j in self.sample_index_list:  
                if R[i][j]>0: #The error is only calculated for the known values.
                    eij = R[i][j]-np.dot(P[i,:],Q[:,j])
                    for k in range(self.K):
                        P[i][k] = P[i][k] + self.alpha * (eij * Q[k][j])
                        Q[k][j] = Q[k][j] + self.alpha * (eij * P[i][k])            
                    e = e+pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
            self.sse = e
            self.mse = self.sse/self.n
            self.rmse = np.sqrt(self.mse)
            if not ignore_break:
                if self.rmse<self.break_limit:
                    print('Breaking at Epoch: {}, SSE {:.2f}, RMSE: {:.4f}'.format(epoch,self.sse,self.rmse))
                    break
            if print_epochs:
                print('Epoch: {}, RMSE {:.4f}'.format(epoch,self.rmse))
            self.alpha_setter(self.alpha*self.gamma)
        print('Run complete.')
        return P, Q.T

    # Prepares recommendation matrix
    
    def __recommend(self, predicted_matrix, original_matrix):
        user_rated = original_matrix.loc[self.user]
        user_rated = user_rated.reset_index()
        user_predictions = pd.DataFrame(pd.DataFrame(predicted_matrix, index= original_matrix.index).loc[self.user])
        recommend_matrix = pd.DataFrame(user_rated).join(user_predictions,lsuffix='l', rsuffix='r').set_index('movieId')
        recommend_matrix.columns = ['original','predictions']
        self.recommend_matrix = recommend_matrix
        recommendations = recommend_matrix[recommend_matrix['original']==0].drop('original',axis=1).sort_values('predictions',ascending=False)
        recommendations_final = recommendations.join(movies).drop(['predictions','total_votes'],axis=1).head(10)
        self.recommendations_final = recommendations_final
        self.comparison_matrix = self.recommend_matrix[self.recommend_matrix.original!=0]
        
    # Setter for alpha (aka learning rate)
    
    def alpha_setter(self,new_alpha):
        self.alpha=new_alpha
        
    # Method to continue processing the model with additional epochs. 
    
    def continue_optimization(self,more_epochs,print_epochs=True,ignore_break=False):
        
        print('Restarting model.')
        R = self.pivot_df.values
        N = len(R)
        M = len(R[0])
        P = self.new_P
        Q = self.new_Q
        Q=Q.T
        
        
        for epoch in range(more_epochs): 
            e=0
            for i,j in self.sample_index_list:  
                    if R[i][j]>0: #The error is only calculated for the known values.
                        eij = R[i][j]-np.dot(P[i,:],Q[:,j])
                        for k in range(self.K):
                            P[i][k] = P[i][k] + self.alpha * (eij * Q[k][j])
                            Q[k][j] = Q[k][j] + self.alpha * (eij * P[i][k])
                        e = e+pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
            self.sse = e
            self.mse = self.sse/self.n
            self.rmse = np.sqrt(self.mse)
            if not ignore_break:
                if self.rmse<self.break_limit:
                    print('Breaking at epoch: {}, SSE {:.2f}, RMSE: {:.4f}'.format(epoch,self.sse,self.rmse))
                    break
            if print_epochs:
                print('Epoch: {}, RMSE {:.4f}'.format(self.epochcount,self.rmse))
            self.epochcount+=1
            self.alpha_setter(self.alpha*self.gamma)
        
        self.new_P = P
        self.new_Q = Q.T
        self.new_R = np.dot(self.new_P,self.new_Q.T)
        self.__recommend(self.new_R, self.pivot_df)
        print("Run complete.")
   
    # Prints results.
    
    def get_recommendations(self,n=10):        
        print("-------------------------------------------")
        print("Based on our community votes database, you may like:")
        display(self.recommendations_final.head(n))
    
    # Creates evaluation matrix by separting the available user ratings in "train" and "test" splits.
    
    def create_evaluation_matrix(self, sample_size,print_epochs,ignore_break=False):
        ratings_sample = self.new_user_ratings.sample(int(self.new_user_ratings.shape[0]*sample_size),random_state=42)['movieId']
        test_ratings = self.new_user_ratings.copy()
        for i in ratings_sample.values:
            test_ratings.loc[test_ratings['movieId'] == i,'rating']=0
        self.run(test_ratings, print_epochs=print_epochs,evaluating=True,ignore_break=ignore_break)
        self.evaluation_matrix = self.recommend_matrix[self.recommend_matrix['original']==0].sort_values('predictions',ascending=False)
        self.evaluation_matrix  = self.evaluation_matrix.join(self.new_user_ratings.set_index('movieId'), how='left').drop(['userId','original'],axis=1).dropna()       
        ##### Interpolation may be used if the values are not expanding towards the extremes (0 and 5).
#         interpolation_list = self.evaluation_matrix['predictions'].to_numpy()
#         self.evaluation_matrix['predictions'] = np.interp(interpolation_list, (interpolation_list.min(),interpolation_list.max()),(0,5))
        #####
        self.evaluation_values = self.evaluation_matrix.values
      
    # Method to evaluate the model based on Precision@k and Recall@k.
    
    def evaluate_model(self, relevancy=3.5, sample_size=0.5, epochs=50,print_epochs=False,ignore_break=False):                
        self.alpha_setter(self.original_alpha)
        self.epochs = epochs
        self.create_evaluation_matrix(sample_size,print_epochs,ignore_break)
        TP, FP, FN, TN = [0,0,0,0]
        k = int(self.new_user_ratings.shape[0]*sample_size)
        for i in range(len(self.evaluation_values)):
            if self.evaluation_values[i, 0]>=relevancy and self.evaluation_values[i, 1]>=relevancy:
                TP+=1
            elif self.evaluation_values[i, 0]>=relevancy and self.evaluation_values[i, 1]<relevancy:
                FN+=1
            elif self.evaluation_values[i, 0]<relevancy and self.evaluation_values[i, 1]>=relevancy:
                FP+=1
            else:
                TN+=1
            try:
                if i==9:
                    recall_at_10 = TP/(TP+FN)
                    precision_at_10 = TP/(TP+FP)
            except:
                pass
                
        recall_at_k = TP/(TP+FN)
        precision_at_k = TP/(TP+FP)
        print("-------------------------------------------")
        print("K: {},TP: {}, FP: {}, FN: {}, TN:{}".format(k, TP,FP,FN,TN))
        print("For a relevancy of {}, our model has a Precision@{} of {:.2f}% and a Recall@{} of {:.2f}%".format(relevancy, k, precision_at_k*100, k, recall_at_k*100))
        if k>15:
            try:
                print("For a relevancy of {}, our model has a Precision@10 of {:.2f}% and a Recall@10 of {:.2f}%".format(relevancy, precision_at_10*100, recall_at_10*100))
            except:
                pass
    
    # Returns the highest prediction value. Also, if there are too many empty
    # values in the original pivot table, the probabillity of one of the values of the complete table being too high
    # increases, due to a random chance of the dot product of the matrices containg the latent features not being
    # properly optmized.
    
    def get_max_prediction(self):
        print(f"Max user prediction {self.recommend_matrix[self.recommend_matrix['original']==0]['predictions'].max()}")
        print(f"Max general prediction {self.new_R.max()}")

In [16]:
SGDR1 = StochasticGradientDescentRecommender(new_user,epochs=50,K=20)


In [17]:
SGDR1.run()

Model run start.
Epoch: 0, RMSE 2.6481
Epoch: 1, RMSE 1.0007
Epoch: 2, RMSE 0.9846
Epoch: 3, RMSE 0.9813
Epoch: 4, RMSE 0.9786
Epoch: 5, RMSE 0.9753
Epoch: 6, RMSE 0.9707
Epoch: 7, RMSE 0.9646
Epoch: 8, RMSE 0.9563
Epoch: 9, RMSE 0.9455
Epoch: 10, RMSE 0.9319
Epoch: 11, RMSE 0.9152
Epoch: 12, RMSE 0.8955
Epoch: 13, RMSE 0.8731
Epoch: 14, RMSE 0.8485
Epoch: 15, RMSE 0.8224
Epoch: 16, RMSE 0.7955
Epoch: 17, RMSE 0.7682
Epoch: 18, RMSE 0.7409
Epoch: 19, RMSE 0.7138
Epoch: 20, RMSE 0.6871
Epoch: 21, RMSE 0.6610
Epoch: 22, RMSE 0.6355
Epoch: 23, RMSE 0.6110
Epoch: 24, RMSE 0.5874
Epoch: 25, RMSE 0.5649
Epoch: 26, RMSE 0.5434
Epoch: 27, RMSE 0.5231
Epoch: 28, RMSE 0.5038
Epoch: 29, RMSE 0.4856
Epoch: 30, RMSE 0.4684
Epoch: 31, RMSE 0.4522
Epoch: 32, RMSE 0.4369
Epoch: 33, RMSE 0.4224
Epoch: 34, RMSE 0.4088
Epoch: 35, RMSE 0.3960
Epoch: 36, RMSE 0.3840
Epoch: 37, RMSE 0.3727
Epoch: 38, RMSE 0.3621
Epoch: 39, RMSE 0.3522
Epoch: 40, RMSE 0.3429
Epoch: 41, RMSE 0.3342
Epoch: 42, RMSE 0.3261
Epoc

Unnamed: 0_level_0,title,genres,mean_rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1206,"Clockwork Orange, A (1971)",Crime|Drama|Sci-Fi|Thriller,3.995833
923,Citizen Kane (1941),Drama|Mystery,4.043478
1221,"Godfather: Part II, The (1974)",Crime|Drama,4.25969
924,2001: A Space Odyssey (1968),Adventure|Drama|Sci-Fi,3.894495
2502,Office Space (1999),Comedy|Crime,4.090426
1089,Reservoir Dogs (1992),Crime|Mystery|Thriller,4.20229
750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War,4.268041
51662,300 (2007),Action|Fantasy|War|IMAX,3.68125
1208,Apocalypse Now (1979),Action|Drama|War,4.219626
1199,Brazil (1985),Fantasy|Sci-Fi,4.177966


In [21]:
SGDR1.evaluate_model(relevancy=3.5, epochs=50,sample_size=0.5,print_epochs=True,ignore_break=True)

Model run start.
Epoch: 0, RMSE 2.6512
Epoch: 1, RMSE 0.9938
Epoch: 2, RMSE 0.9770
Epoch: 3, RMSE 0.9727
Epoch: 4, RMSE 0.9698
Epoch: 5, RMSE 0.9672
Epoch: 6, RMSE 0.9641
Epoch: 7, RMSE 0.9601
Epoch: 8, RMSE 0.9548
Epoch: 9, RMSE 0.9478
Epoch: 10, RMSE 0.9388
Epoch: 11, RMSE 0.9273
Epoch: 12, RMSE 0.9131
Epoch: 13, RMSE 0.8959
Epoch: 14, RMSE 0.8757
Epoch: 15, RMSE 0.8527
Epoch: 16, RMSE 0.8272
Epoch: 17, RMSE 0.7999
Epoch: 18, RMSE 0.7715
Epoch: 19, RMSE 0.7426
Epoch: 20, RMSE 0.7140
Epoch: 21, RMSE 0.6861
Epoch: 22, RMSE 0.6594
Epoch: 23, RMSE 0.6341
Epoch: 24, RMSE 0.6103
Epoch: 25, RMSE 0.5880
Epoch: 26, RMSE 0.5672
Epoch: 27, RMSE 0.5477
Epoch: 28, RMSE 0.5296
Epoch: 29, RMSE 0.5126
Epoch: 30, RMSE 0.4968
Epoch: 31, RMSE 0.4819
Epoch: 32, RMSE 0.4678
Epoch: 33, RMSE 0.4546
Epoch: 34, RMSE 0.4421
Epoch: 35, RMSE 0.4302
Epoch: 36, RMSE 0.4190
Epoch: 37, RMSE 0.4082
Epoch: 38, RMSE 0.3980
Epoch: 39, RMSE 0.3882
Epoch: 40, RMSE 0.3789
Epoch: 41, RMSE 0.3699
Epoch: 42, RMSE 0.3613
Epoc

## Batch gradient descent (without stochastic approach)

In [None]:
class GradientDescentRecommender:
    ''' This recommender uses matrix factorization to fill out the blanks on a scarse matrix (user ratings per 
        movie). The K number of latent features may be selected based on grid search hyperparameter optmization,
        as it depends on the function parameters (top user limits or empty values limit). The number of steps 
        is highly dependent on K. Increasing the number of steps may cause the gradient descent to diverge. Before 
        that, it also causes the algorithm to perform poorly on predictions. 
        
        The model calculates the root mean squared error, but it may also be evaluated by hiding part of the user 
        ratings and comparing the model predictions. This requires the model to re-run due to the nature of matrix 
        factorization. Precision@k and recall@k are returned. The best way to assert relevancy is user feedback.
        In the absence of this, a value of 3.5 was defined as to determine whether the recommendation is 
        relevant or not to the user. This model may be further developed by including L2 regularization. 
        '''
    
    
    def __init__(self, user,K=20,steps=500):
        self.user = user.userId
        self.new_user_ratings = user.uratings_df
        self.new_user_ratings['userId'] = self.user
        self.new_user_ratings = self.new_user_ratings.reset_index()
        self.K = K
        self.n = 0
        self.sse = 0
        self.mse = 0
        self.steps = steps
        self.alpha = 0.0005
        self.stepcount = 0
        self.stepcount = self.steps
        self.empty_values_limit = 25
        self.top_users_limit = 40
        
    # Method for start model processing.
    
    def run(self, user_ratings=None, print_steps=True, evaluating=False):
        
        #T his is to allow for both the first run and evaluation run.
        if type(user_ratings) == type(None): 
            user_ratings = self.new_user_ratings
            
        # Get pivot table.
        self.ratings_df, self.pivot_df = self.__dataframe_reduction(user_ratings, ratings, movies)
        
        
        # This loop limits the amount of empty values in the pivot table in order to facilitate a converging descent.
        # However, this also limits the number of movies available for recommendation.
        for column in self.pivot_df.columns:
            if self.pivot_df[column].value_counts()[0]>self.empty_values_limit:
                self.pivot_df = self.pivot_df.drop(column, axis=1)
         
        # Run the matrix factorization and calculate the complete table
        self.new_P, self.new_Q = self.__GD_MatFac(self.pivot_df,print_steps)
        self.new_R = np.dot(self.new_P,self.new_Q.T)
        
        #Prepare recommendation tables
        self.__recommend(self.new_R, self.pivot_df)
        
        if not evaluating:
            self.get_recommendations(n=10)

    #Calculates the pivot table based on user ratings. Due to the size of the original dataframes, the number of movies
    # and users considered is reduced to decrease processing time. 
    
    def __dataframe_reduction(self, user_ratings, ratings_df, movies_df): 
              
        ratings_df = ratings_df.drop(ratings_df[ratings_df["movieId"].isin(movies_df[movies_df.total_votes<50].index)].index)
        top_users = ratings.groupby(['userId'])['userId'].count()
        top_users = top_users.sort_values(ascending=False)
        top_users = top_users.head(self.top_users_limit)
        ratings_df = ratings_df[ratings_df['userId'].isin(top_users.index)]
        ratings_df = ratings_df.append(user_ratings)
        pivot_df = ratings_df.pivot(index='userId',columns='movieId', values='rating').fillna(0)
        return ratings_df, pivot_df

    
    #Gradient Descent Matrix Factorization
    
    def __GD_MatFac(self,R,print_steps=True):
    
        print('Model run start.')
        if type(R) == pd.core.frame.DataFrame:
            R = R.values
            
        N = len(R)
        M = len(R[0])
        P = np.random.rand(N,self.K)*(5/self.K)
        Q = np.random.rand(M,self.K)*(5/self.K)
        Q=Q.T
        self.n = np.count_nonzero(R==0) #N for calculation of MSE/RMSE
        
        for step in range(self.steps):
            for i in range(N):
                for j in range(M):
                    if R[i][j]>0: #The error is only calculated for the known values.
                        eij = R[i][j]-np.dot(P[i,:],Q[:,j])
                        for k in range(self.K):
                            P[i][k] = P[i][k] + self.alpha * (eij * Q[k][j])
                            Q[k][j] = Q[k][j] + self.alpha * (eij * P[i][k])
            e=0
            for i in range(N):
                for j in range(M):
                    if R[i][j]>0:
                        e = e+pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
            self.sse = e
            self.mse = self.sse/self.n
            self.rmse = np.sqrt(self.mse)
            
            if self.rmse<0.5:
                print('Breaking at Step: {}, SSE {:.2f}, RMSE: {:.4f}'.format(step,self.sse,self.rmse))
                break
        
            if print_steps:
                print('Step: {}, SSE {:.2f}, RMSE: {:.4f}'.format(step,self.sse,self.rmse))

        print('Run complete.')
        return P, Q.T

    # Prepares recommendation matrix
    
    def __recommend(self, predicted_matrix, original_matrix):
        user_rated = original_matrix.loc[self.user]
        user_rated = user_rated.reset_index()
        user_predictions = pd.DataFrame(pd.DataFrame(predicted_matrix, index= original_matrix.index).loc[self.user])
        recommend_matrix = pd.DataFrame(user_rated).join(user_predictions,lsuffix='l', rsuffix='r').set_index('movieId')
        recommend_matrix.columns = ['original','predictions']
        self.recommend_matrix = recommend_matrix
        recommendations = recommend_matrix[recommend_matrix['original']==0].drop('original',axis=1).sort_values('predictions',ascending=False)
        recommendations_final = recommendations.join(movies).drop(['predictions','total_votes'],axis=1).head(10)
        self.recommendations_final = recommendations_final
        self.comparison_matrix = self.recommend_matrix[self.recommend_matrix.original!=0]
        
    # Setter for alpha (aka learning rate)
    
    def alpha_setter(self,new_alpha):
        self.alpha=new_alpha
        
    # Method to continue processing the model with additional steps. 
    
    def continue_optimization(self,more_steps,print_steps=True):
        
        print('Restarting model.')
        R = self.pivot_df.values
        N = len(R)
        M = len(R[0])
        P = self.new_P
        Q = self.new_Q
        Q=Q.T
        
        
        for step in range(more_steps):
            
            for i in range(N):
                for j in range(M):
                    if R[i][j]>0:
                        eij = R[i][j]-np.dot(P[i,:],Q[:,j])
                        for k in range(self.K):

                            P[i][k] = P[i][k] + self.alpha * (eij * Q[k][j])
                            Q[k][j] = Q[k][j] + self.alpha * (eij * P[i][k])
            e=0
            for i in range(N):
                for j in range(M):
                    if R[i][j]>0:
                        e = e+pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
            self.sse = e
            self.mse = self.sse/self.n
            self.rmse = np.sqrt(self.mse)
                        
            if self.rmse<0.5:
                break
            if print_steps:
                print('Step: {}, SSE {:.2f}, RMSE: {:.4f}'.format(self.stepcount,self.sse,self.rmse))
            self.stepcount+=1
        
        self.new_P = P
        self.new_Q = Q.T
        self.new_R = np.dot(self.new_P,self.new_Q.T)
        self.__recommend(self.new_R, self.pivot_df)
        print("Run complete.")
   
    #Prints results.
    
    def get_recommendations(self,n=10):        
        print("-------------------------------------------")
        print("Based on our community votes database, you may like:")
        display(self.recommendations_final.head(n))
    
    # Creates evaluation matrix by separting the available user ratings in "train" and "test" splits.
    
    def create_evaluation_matrix(self, sample_size,print_steps):
        ratings_sample = self.new_user_ratings.sample(int(self.new_user_ratings.shape[0]*sample_size),random_state=42)['movieId']
        test_ratings = self.new_user_ratings.copy()
        for i in ratings_sample.values:
            test_ratings.loc[test_ratings['movieId'] == i,'rating']=0
        self.run(test_ratings, print_steps=print_steps,evaluating=True)
        self.evaluation_matrix = self.recommend_matrix[self.recommend_matrix['original']==0].sort_values('predictions',ascending=False)
        self.evaluation_matrix  = self.evaluation_matrix.join(self.new_user_ratings.set_index('movieId'), how='left').drop(['userId','original'],axis=1).dropna()       
        self.evaluation_values = self.evaluation_matrix.values
      
    # Method to evaluate the model based on Precision@k and Recall@k.
    
    def evaluate_model(self, relevancy=3.5, sample_size=0.5, steps=50,print_steps=False):                
        self.steps = steps
        self.create_evaluation_matrix(sample_size,print_steps)
        TP, FP, FN, TN = [0,0,0,0]
        k = int(self.new_user_ratings.shape[0]*sample_size)
        for i in range(len(self.evaluation_values)):
            if self.evaluation_values[i, 0]>=relevancy and self.evaluation_values[i, 1]>=relevancy:
                TP+=1
            elif self.evaluation_values[i, 0]>=relevancy and self.evaluation_values[i, 1]<relevancy:
                FN+=1
            elif self.evaluation_values[i, 0]<relevancy and self.evaluation_values[i, 1]>=relevancy:
                FP+=1
            else:
                TN+=1
            try:
                if i==9:
                    recall_at_10 = TP/(TP+FN)
                    precision_at_10 = TP/(TP+FP)
            except:
                pass
                
        recall_at_k = TP/(TP+FN)
        precision_at_k = TP/(TP+FP)
        print("-------------------------------------------")
        print("K: {},TP: {}, FP: {}, FN: {}, TN:{}".format(k, TP,FP,FN,TN))
        print("For a relevancy of {}, our model has a Precision@{} of {:.2f}% and a Recall@{} of {:.2f}%".format(relevancy, k, precision_at_k*100, k, recall_at_k*100))
        if k>15:
            try:
                print("For a relevancy of {}, our model has a Precision@10 of {:.2f}% and a Recall@10 of {:.2f}%".format(relevancy, precision_at_10*100, recall_at_10*100))
            except:
                pass
    
    # Returns the highest prediction value. Also, if there are too many empty
    # values in the original pivot table, the probabillity of one of the values of the complete table being too high
    # increases, due to a random chance of the dot product of the matrices containg the latent features not being
    # properly optmized.
    
    def get_max_prediction(self):
        print(f"Max user prediction {self.recommend_matrix[self.recommend_matrix['original']==0]['predictions'].max()}")
        print(f"Max general prediction {self.new_R.max()}")

# <font color=blue>Fifth Recommender: Since you watched (Content Based)</font>
***
# (WIP)