<a href="https://colab.research.google.com/github/gopal2812/mlblr/blob/master/recommendationengine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')
import numpy as np
import pandas as pd
import pickle


!ls -al 'gdrive/My Drive/recommendation'
#root_path = 'gdrive/My Drive/Ex_Files_Machine_Learning_EssT_ValueEstimate/'  #change dir to your project folder
data = pd.read_csv('gdrive/My Drive/recommendation/movie_ratings_data_set.csv')
data.describe()

Because the ratings matrix is equal to the result of multiplying the user attributes matrix by the movie attributes matrix, we can work backwards using matrix factorization to find values for U and M. In code, we use an algorithm called, low rank matrix factorization, to do this. Let's look at how this algorithm works. Matrix factorization is the idea that a large matrix can be broken down into smaller matrices. So, assuming that we have a large matrix of numbers, and assuming that we want to be able to find two smaller matrices that multiply together to result in that large matrix, our goal is to find two smaller matrices that satisfy that requirement. If you happen to be an expert in linear algebra, you might know that there are standard ways to factor a matrix, such as using a process called, singular value decomposition. But this is a special case where that won't work. The problem is that we only know some of the values in the large matrix. Many of the entries in the large matrix are blank, or users haven't yet reviewed particular movies. So, instead of trying to directly factor the ratings array into two smaller matrices, we'll estimate values for the smaller matrices using an iterative algorithm. We'll guess and check until we get close to the right answer. Here's how it works. First, we'll create the U and M matrices, but set all the values to random numbers. Because U and M are full of random numbers, if we multiply U and M right now, the result will be random. The next step is to check how different our calculated ratings matrix is from the real ratings matrix with the current values for U and M. But we'll ignore all the spots in the ratings matrix where we don't have data, and only look at the spots where we have actual user reviews. We'll call this difference the cost. The cost is how wrong we are. Next, we'll use a numeric optimization algorithm to search for the minimum cost. The numerical optimization algorithm will tweak the numbers in U and M a little at a time. The goal is to get the cost function a little closer to zero at each step. The function we'll use is called fmin_cg. It searches for the inputs that make a function return the minimal possible output. It's provided by the SciPy library. Finally, the fmin_cg function will loop hundreds of times until we get the cost as small as possible. When the value of the cost function is as low as we can get it, the final values of U and M at that point are what we'll use. But since they're just approximations, they won't be exactly perfect. When we multiply these U and M matrices to calculate movie ratings, and check it against the original movie ratings, we'll see that there's still a bit of difference. But as long as we get pretty close, the small amount of difference won't matter.

In [0]:
import numpy as np
from scipy.optimize import fmin_cg


def normalize_ratings(ratings):
    """
    Given an array of user ratings, subtract the mean of each product's ratings
    :param ratings: 2d array of user ratings
    :return: (normalized ratings array, the calculated means)
    """
    mean_ratings = np.nanmean(ratings, axis=0)
    return ratings - mean_ratings, mean_ratings


def cost(X, *args):
    """
    Cost function for low rank matrix factorization
    :param X: The matrices being factored (P and Q) rolled up as a contiguous array
    :param args: Array containing (num_users, num_products, num_features, ratings, mask, regularization_amount)
    :return: The cost with the current P and Q matrices
    """
    num_users, num_products, num_features, ratings, mask, regularization_amount = args

    # Unroll P and Q
    P = X[0:(num_users * num_features)].reshape(num_users, num_features)
    Q = X[(num_users * num_features):].reshape(num_products, num_features)
    Q = Q.T

    # Calculate current cost
    return (np.sum(np.square(mask * (np.dot(P, Q) - ratings))) / 2) + ((regularization_amount / 2.0) * np.sum(np.square(Q.T))) + ((regularization_amount / 2.0) * np.sum(np.square(P)))


def gradient(X, *args):
    """
    Calculate the cost gradients with the current P and Q.
    :param X: The matrices being factored (P and Q) rolled up as a contiguous array
    :param args: Array containing (num_users, num_products, num_features, ratings, mask, regularization_amount)
    :return: The gradient with the current X
    """
    num_users, num_products, num_features, ratings, mask, regularization_amount = args

    # Unroll P and Q
    P = X[0:(num_users * num_features)].reshape(num_users, num_features)
    Q = X[(num_users * num_features):].reshape(num_products, num_features)
    Q = Q.T

    # Calculate the current gradients for both P and Q
    P_grad = np.dot((mask * (np.dot(P, Q) - ratings)), Q.T) + (regularization_amount * P)
    Q_grad = np.dot((mask * (np.dot(P, Q) - ratings)).T, P) + (regularization_amount * Q.T)

    # Return the gradients as one rolled-up array as expected by fmin_cg
    return np.append(P_grad.ravel(), Q_grad.ravel())


def low_rank_matrix_factorization(ratings, mask=None, num_features=15, regularization_amount=0.01):
    """
    Factor a ratings array into two latent feature arrays (user features and product features)

    :param ratings: Matrix with user ratings to factor
    :param mask: A binary mask of which ratings are present in the ratings array to factor
    :param num_features: Number of latent features to generate for users and products
    :param regularization_amount: How much regularization to apply
    :return: (P, Q) - the factored latent feature arrays
    """
    num_users, num_products = ratings.shape

    # If no mask is provided, consider all 'NaN' elements as missing and create a mask.
    if mask is None:
        mask = np.invert(np.isnan(ratings))

    # Replace NaN values with zero
    ratings = np.nan_to_num(ratings)

    # Create P and Q and fill with random numbers to start
    np.random.seed(0)
    P = np.random.randn(num_users, num_features)
    Q = np.random.randn(num_products, num_features)

    # Roll up P and Q into a contiguous array as fmin_cg expects
    initial = np.append(P.ravel(), Q.ravel())

    # Create an args array as fmin_cg expects
    args = (num_users, num_products, num_features, ratings, mask, regularization_amount)

    # Call fmin_cg to minimize the cost function and this find the best values for P and Q
    X = fmin_cg(cost, initial, fprime=gradient, args=args, maxiter=3000)

    # Unroll the new P and new Q arrays out of the contiguous array returned by fmin_cg
    nP = X[0:(num_users * num_features)].reshape(num_users, num_features)
    nQ = X[(num_users * num_features):].reshape(num_products, num_features)

    return nP, nQ.T


def RMSE(real, predicted):
    """
    Calculate the root mean squared error between a matrix of real ratings and predicted ratings
    :param real: A matrix containing the real ratings (with 'NaN' for any missing elements)
    :param predicted: A matrix of predictions
    :return: The RMSE as a float
    """
    return np.sqrt(np.nanmean(np.square(real - predicted)))

 Recommendation systems work great when the user's already entered lots of reviews, but for first time users we don't know enough about the user yet to make personalized recommendations. There are three ways we can try to work around this problem. First, we could just not make any recommendations for new users. For some applications it might be okay to wait until the user's reviewed products before making recommendations. A second approach is to use product similarity to suggest similar products to users who haven't rated anything instead of making personalized recommendations. But a third option is to use the average rating of products to make recommendations. In other words, we'll just recommend the products that have the best over-all ratings to new users. This can be helpful because some movies are just generally considered better than other movies. If a movie has a 5 star average rating across all users, that's probably a better movie to recommend to a brand new user than a movie that has a one star rating across all users. To take average ratings into account we just need a small tweak to our recommendation algorithm. Here's how that will work. Here we have five ratings for the same movie from five different users who've reviewed the movie. First we'll calculate the average rating for the movie across all users. In this case, the average rating for the movie is 4.2 out of 5. Next we'll subtract the average rating from each user's rating. For user number one, instead of recording the rating as 4, we'll subtract 4.2 and record it as -.2. The idea is that this user rated the movie 0.2 stars under the average rating. These adjusted ratings are what we'll use to do matrix factorization and to make recommendations. Let's see how that changes things. Let's assume that our system predicts a rating of 0.8 for a specific user. We know that the movie has an average rating of 4.2 so we just need to add back in the average to get the final rating for the user. So the predicted rating for this user is 5 stars. But the cool part is how this works out for brand new users. We can assume that brand new users who haven't reviewed anything yet will get a predicted rating of zero for every movie. But now we'll add back in the average rating and the predicted rating for the movie ends up being 4.2 stars. So even though this user hasn't reviewed anything yet, we can recommend this movie based on how popular it is with other users. Let's open up train_recommender_cold_start.py and see how to do this in code. This file contains the code the factor our review data said. We read the data set using the read_csv function and then we create the ratings matrix using the pivot_table function. Now that we have a review matrix that covers every movie, we want to calculate the average rating of each movie. We can do this using the matrix_factorization_utilities.normalized_ratings function. This function takes in an array of ratings to average. So we'll pass in the ratings_df data set. We call the as_matrix function to make sure the ratings data frame is passed in as a NumPy array data type. This function also returns two results. First it returns the means, or average ratings for each movie. And second it returns a new copy of the ratings matrix called normalized_ratings. This copy has the average rating subtracted from every user review. Next we factor the matrix to create U and M, and then multiply U and M to get the predicted ratings. Then here, after we predict ratings for all users, we need to add back in the average rating for each movie. Finally at the bottom, we use the pickle.dump function to save a copy of the means to a file called means.dat. Let's run the program. Right-click, choose Run, and we can see it created the file here. Now let's switch over to cold_start_recommendations.py our goal in this file is to recommend movies to a brand new user. First we use pickle.load to load the means.dat file. Then we load the movie list csv file so we can print out movie titles. Next we use the mean ratings as the user's predicted ratings. And then finally we recommend the movies to the user by returning the movies in order of their average rating. Let's run it and see the result. Right-click, choose run. The user's recommended the top five hightest average rated products we have. Always recommending the highest rated movies to new users might not be perfect, but it's a good place to start until the user reviews some products.

In [0]:
# Convert the running list of user ratings into a matrix
ratings_df = pd.pivot_table(data, index='user_id', columns='movie_id', aggfunc=np.max)

# Normalize the ratings (center them around their mean)
normalized_ratings, means = normalize_ratings(ratings_df.as_matrix())

# Apply matrix factorization to find the latent features
U, M = low_rank_matrix_factorization(normalized_ratings,
                                                                    num_features=11,
                                                                    regularization_amount=1.1)

# Find all predicted ratings by multiplying U and M
predicted_ratings = np.matmul(U, M)

# Add back in the mean ratings for each product to de-normalize the predicted results
predicted_ratings = predicted_ratings + means

# Save features and predicted ratings to files for later use
pickle.dump(U, open("user_features.dat", "wb"))
pickle.dump(M, open("product_features.dat", "wb"))
pickle.dump(predicted_ratings, open("predicted_ratings.dat", "wb" ))
pickle.dump(means, open("means.dat", "wb" ))


In [0]:
import pickle
import pandas as pd

# Load prediction rules from data files
U = pickle.load(open("user_features.dat", "rb"))
M = pickle.load(open("product_features.dat", "rb"))
predicted_ratings = pickle.load(open("predicted_ratings.dat", "rb"))

# Load movie titles
movies_df = pd.read_csv('gdrive/My Drive/recommendation/movies.csv', index_col='movie_id')
#movies_df = pd.read_csv('movies.csv', index_col='movie_id')

print("Enter a user_id to get recommendations (Between 1 and 100):")
user_id_to_search = int(input())

print("Movies we will recommend:")

user_ratings = predicted_ratings[user_id_to_search - 1]
movies_df['rating'] = user_ratings
movies_df = movies_df.sort_values(by=['rating'], ascending=False)

print(movies_df[['title', 'genre', 'rating']].head(5))

In [0]:
import pickle
import pandas as pd
import numpy as np

# Load prediction rules from data files
M = pickle.load(open("product_features.dat", "rb"))

# Swap the rows and columns of product_features just so it's easier to work with
M = np.transpose(M)

# Load movie titles
#movies_df = pd.read_csv('movies.csv', index_col='movie_id')

# Choose a movie to find similar movies to. Let's find movies similar to movie #5:
movie_id = 5

# Get movie #1's name and genre
movie_information = movies_df.loc[movie_id]

print("We are finding movies similar to this movie:")
print("Movie title: {}".format(movie_information.title))
print("Genre: {}".format(movie_information.genre))

# Get the features for movie #1 we found via matrix factorization
current_movie_features = M[movie_id - 1]

print("The attributes for this movie are:")
print(current_movie_features)

# The main logic for finding similar movies:

# 1. Subtract the current movie's features from every other movie's features
difference = M - current_movie_features

# 2. Take the absolute value of that difference (so all numbers are positive)
absolute_difference = np.abs(difference)

# 3. Each movie has several features. Sum those features to get a total 'difference score' for each movie
total_difference = np.sum(absolute_difference, axis=1)

# 4. Create a new column in the movie list with the difference score for each movie
movies_df['difference_score'] = total_difference

# 5. Sort the movie list by difference score, from least different to most different
sorted_movie_list = movies_df.sort_values('difference_score')

# 6. Print the result, showing the 5 most similar movies to movie_id #1
print("The five most similar movies are:")
print(sorted_movie_list[['title', 'difference_score']][0:5])
