
<h1 id="Project--Recommendation-Systems:-Amazon-product-reviews">Project- Recommendation Systems: Amazon product reviews<a class="anchor-link" href="#Project--Recommendation-Systems:-Amazon-product-reviews">¶</a></h1><p>Welcome to the project on Recommendation Systems. We will work with the Amazon product reviews dataset for this project. The dataset contains ratings of different electronic products. It does not include information about the products or reviews to avoid bias while building the model.</p>
<hr/>
<h3 id="Context:">Context:<a class="anchor-link" href="#Context:">¶</a></h3><hr/>
<p>E-commerce websites like Amazon, Flipkart uses different recommendation models to provide personalized suggestions to different users. Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.</p>
<hr/>
<h3 id="Objective:">Objective:<a class="anchor-link" href="#Objective:">¶</a></h3><hr/>
<p>Build a recommendation system to recommend products to customers based on their previous ratings for other products.</p>
<hr/>
<h3 id="Dataset:">Dataset:<a class="anchor-link" href="#Dataset:">¶</a></h3><hr/>
<p>The Amazon dataset contains the following attributes:</p>
<ul>
<li><strong>userId:</strong> Every user identified with a unique id</li>
<li><strong>productId:</strong> Every product identified with a unique id</li>
<li><strong>Rating:</strong> Rating of the corresponding product by the corresponding user</li>
<li><strong>timestamp:</strong> Time of the rating (ignore this column for this exercise)</li>
</ul>


In [None]:


#You can use the following code to mount the drive if you use Google Colab for this project. It is not necessary to use Colab for this project.
# from google.colab import drive
# drive.mount('/content/drive')




In [None]:


#You must install the surprise package in Google Colab in order to use the same
!pip install surprise





<h3 id="Importing-Libraries">Importing Libraries<a class="anchor-link" href="#Importing-Libraries">¶</a></h3>


In [None]:


import warnings #Used to ignore the warning given as output of the code.
warnings.filterwarnings('ignore')

import numpy as np # Basic libraries of python for numeric and dataframe computations.
import pandas as pd

import matplotlib.pyplot as plt #Basic library for data visualization.
import seaborn as sns #Slightly advanced library for data visualization

# from sklearn.metrics.pairwise import cosine_similarity #To compute the cosine similarity between two vectors.
from collections import defaultdict #A dictionary output that does not raise a key error

from sklearn.metrics import mean_squared_error # A performance metrics in sklearn.





<h3 id="Loading-data">Loading data<a class="anchor-link" href="#Loading-data">¶</a></h3>


In [None]:


#Import the data set
df = pd.read_csv('ratings_Electronics.csv', header=None) #There are no headers in the data file

df.columns = ['user_id', 'prod_id', 'rating', 'timestamp'] #Adding column names

df = df.drop('timestamp', axis=1) #Dropping timestamp

df_copy = df.copy(deep=True) #Copying the data to another dataframe





<p><strong>As this dataset is very large and has 78,24,482 observations, it is not computationally possible to build a model using this. Moreover, many users have only rated a few products and also some products are rated by very few users. Hence we can reduce the dataset by considering certain Logical assumptions.</strong></p>
<p>Here, We will be taking users who have given at least 50 ratings, and the products that have at least 5 ratings, as when we shop online we prefer to have some number of ratings of a product.</p>


In [None]:


# Get the column containing the users
users = df.user_id
# Create a dictionary from users to their number of ratings
ratings_count = dict()
for user in users:
    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1
    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1    




In [None]:


# We want our users to have at least 50 ratings to be considred
RATINGS_CUTOFF = 50
remove_users = []
for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)
df = df.loc[~df.user_id.isin(remove_users)]




In [None]:


# Get the column containing the users
users = df.prod_id
# Create a dictionary from users to their number of ratings
ratings_count = dict()
for user in users:
    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1
    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1    




In [None]:


# We want our item to have at least 5 ratings to be considred
RATINGS_CUTOFF = 5
remove_users = []
for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)
df_final = df.loc[~df.prod_id.isin(remove_users)]




In [None]:


# see a few rows of the imported dataset
df_final.head()





<h3 id="Exploratory-Data-Analysis-(6-marks)">Exploratory Data Analysis (6 marks)<a class="anchor-link" href="#Exploratory-Data-Analysis-(6-marks)">¶</a></h3>



<p><strong>Please fill in the blanks and complete the code(and provide observations) for the following cells wherever mentioned.</strong></p>



<h4 id="Shape-of-the-data">Shape of the data<a class="anchor-link" href="#Shape-of-the-data">¶</a></h4>


In [None]:


# Check the number of rows and columns and provide observations
rows, columns = df_final.shape
print("No of rows: ", rows) 
print("No of columns: ", columns) 





<p><strong>Obervations:There are 3 columns: User_id, prod_id, and rating. After cleaning up the data a little be we now have 65,290 rows of data, meaning that many reviews of items. </strong></p>



<h4 id="Data-types">Data types<a class="anchor-link" href="#Data-types">¶</a></h4>


In [None]:


#Check Data types and provide observations
df_final.dtypes





<p><strong>Obervations:user_id and prod_id are both objects and rating is a float which makes sense for a rating. </strong></p>



<h4 id="Checking-for-missing-values">Checking for missing values<a class="anchor-link" href="#Checking-for-missing-values">¶</a></h4>


In [None]:


# Check for missing values present and provide observations
df_final.isnull()





<p><strong>Obervations:It appears that we no longer have any missing values. </strong></p>



<h4 id="Summary-Statistics">Summary Statistics<a class="anchor-link" href="#Summary-Statistics">¶</a></h4>


In [None]:


# Summary statistics of 'rating' variable and provide observations
df_final['rating'].describe().T





<p><strong>Obervations:The ratings run from 1 to 5 as expected. It appears the the majority of people rate things 4 and above as the 25% is 4 and the average is 4.29. </strong></p>



<h4 id="Checking-the-rating-distribution">Checking the rating distribution<a class="anchor-link" href="#Checking-the-rating-distribution">¶</a></h4>


In [None]:


#Create the bar plot and provide observations

plt.figure(figsize = (12,6))
df_final['rating'].value_counts(1).plot(kind='bar')
plt.show()





<p><strong>Observations:Most people are giving a rating of a 5 with the 2nd highest being a 4. Very few people are leaving a 2 or a 1 rating. </strong></p>



<h4 id="Checking-the-number-of-unique-users-and-items-in-the-dataset">Checking the number of unique users and items in the dataset<a class="anchor-link" href="#Checking-the-number-of-unique-users-and-items-in-the-dataset">¶</a></h4>


In [None]:


# Number of unique user id and product id in the data
print('Number of unique USERS in Raw data = ', df_final['user_id'].nunique())
print('Number of unique ITEMS in Raw data = ', df_final['prod_id'].nunique())





<ul>
<li>There are <strong>1540 unique users and 5689 products</strong> in the dataset</li>
</ul>



<h4 id="Users-with-most-number-of-ratings">Users with most number of ratings<a class="anchor-link" href="#Users-with-most-number-of-ratings">¶</a></h4>


In [None]:


# Top 10 users based on rating
most_rated = df_final.groupby('user_id').size().sort_values(ascending=False)[:10]
most_rated





<ul>
<li>The highest number of <strong>ratings by a user is 295</strong> which is far from the actual number of products present in the data. We can build a recommendation system to recommend products to users which they have not interacted with.</li>
</ul>



<h3 id="Data-preparation--(2-Marks)">Data preparation  (2 Marks)<a class="anchor-link" href="#Data-preparation--(2-Marks)">¶</a></h3>


In [None]:


#Check the number of unique USERS and PRODUCTS in the final data and provide observations

print('The number of observations in the final data =', len(df_final))
print('Number of unique USERS in the final data = ', df_final['user_id'].nunique())
print('Number of unique PRODUCTS in the final data = ', df_final['prod_id'].nunique())





<p><strong>Observations:We have 1,540 different users and 5,689 different products. Thats a potential total of 8,761,060 different obersvations. However, we have significantly less with only 65,290. </strong></p>



<p>Now that we have explored and preprocessed the data, let's build the first recommendation system</p>



<h3 id="Rank-Based-Recommendation-System-(10-marks)">Rank Based Recommendation System (10 marks)<a class="anchor-link" href="#Rank-Based-Recommendation-System-(10-marks)">¶</a></h3>


In [None]:


df_final.head()




In [None]:


#Calculate the average rating for each product 
average_rating = df_final.groupby('prod_id').mean()['rating']

#Calculate the count of ratings for each product
count_rating = df_final.groupby('prod_id').count()['rating']

#Create a dataframe with calculated average and count of ratings
final_rating = pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating})

#Sort the dataframe by average of ratings in the descending order
final_rating = final_rating.sort_values('avg_rating', ascending = False)

final_rating.head()




In [None]:


#defining a function to get the top n products based on highest average rating and minimum interactions
def top_n_products(final_rating, n, min_interaction):
    
    #Finding products with minimum number of interactions
    recommendations = final_rating[final_rating['rating_count'] > min_interaction]
    
    #Sorting values w.r.t average rating 
    recommendations = recommendations.sort_values(by='avg_rating', ascending=False)
    
    return recommendations.index[:n]





<h4 id="Recommending-top-5-products-with-50-minimum-interactions-based-on-popularity">Recommending top 5 products with 50 minimum interactions based on popularity<a class="anchor-link" href="#Recommending-top-5-products-with-50-minimum-interactions-based-on-popularity">¶</a></h4>


In [None]:


list(top_n_products(final_rating, 5, 50))





<h4 id="Recommending-top-5-products-with-100-minimum-interactions-based-on-popularity">Recommending top 5 products with 100 minimum interactions based on popularity<a class="anchor-link" href="#Recommending-top-5-products-with-100-minimum-interactions-based-on-popularity">¶</a></h4>


In [None]:


list(top_n_products(final_rating, 5, 100))





<p>We have recommended the <strong>top 5</strong> products by using the popularity recommendation system. Now, let's build a recommendation system using collaborative filtering</p>



<h3 id="Collaborative-Filtering-Based-Recommendation-System">Collaborative Filtering Based Recommendation System<a class="anchor-link" href="#Collaborative-Filtering-Based-Recommendation-System">¶</a></h3>



<p>In this type of recommendation system, <code>we do not need any information</code> about the users or items. We only need user item interaction data to build a collaborative recommendation system. For example -</p>
<ol>
<li><b>Ratings</b> provided by users. For example - ratings of books on goodread, movie ratings on imdb etc</li>
<li><b>Likes</b> of users on different facebook posts, likes on youtube videos</li>
<li><b>Use/buying</b> of a product by users. For example - buying different items on e-commerce sites</li>
<li><b>Reading</b> of articles by readers on various blogs</li>
</ol>



<h4 id="Types-of-Collaborative-Filtering">Types of Collaborative Filtering<a class="anchor-link" href="#Types-of-Collaborative-Filtering">¶</a></h4>



<ul>
<li>Similarity/Neighborhood based<ul>
<li>User User Similarity Based  </li>
<li>Item Item similarity based</li>
</ul>
</li>
<li>Model based</li>
</ul>



<h4 id="Building-a-baseline-user-user-similarity-based-recommendation-system">Building a baseline user user similarity based recommendation system<a class="anchor-link" href="#Building-a-baseline-user-user-similarity-based-recommendation-system">¶</a></h4>



<ul>
<li>Below we are building <strong>similarity-based recommendation systems</strong> using <code>cosine</code> similarity and using <strong>KNN to find similar users</strong> which are the nearest neighbor to the given user.  </li>
<li>We will be using a new library - <code>surprise</code> to build the remaining models, let's first import the necessary classes and functions from this library</li>
<li>Please use the following code to <code>install the surprise</code> library. You don't need to run the following code if the surprise library is installed on your system.</li>
</ul>
<p><strong>!pip install surprise</strong></p>


In [None]:


# To compute the accuracy of models
from surprise import accuracy

# class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader

# class for loading datasets
from surprise.dataset import Dataset

# for tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# for splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split

# for implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# for implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# for implementing KFold cross-validation
from surprise.model_selection import KFold

#For implementing clustering-based recommendation system
from surprise import CoClustering





<h4 id="Before-building-the-recommendation-systems,-let's--go-over-some-some-basic-terminologies-we-are-going-to-use:">Before building the recommendation systems, let's  go over some some basic terminologies we are going to use:<a class="anchor-link" href="#Before-building-the-recommendation-systems,-let's--go-over-some-some-basic-terminologies-we-are-going-to-use:">¶</a></h4>



<p><strong>Relevant item</strong> - An item (product in this case) that is actually <strong>rated higher than the threshold rating (here 3.5)</strong> is relevant, if the <strong>actual rating is below the threshold then it is a non-relevant item</strong>.</p>
<p><strong>Recommended item</strong> - An item that's <strong>predicted rating is higher than the threshold (here 3.5) is a recommended item</strong>, if the <strong>predicted rating is below the threshold then that product will not be recommended to the user</strong>.</p>



<p><strong>False Negative (FN)</strong> - It is the <strong>frequency of relevant items that are not recommended to the user</strong>. If the relevant items are not recommended to the user, then the user might not buy the product/item. This would result in the <strong>loss of opportunity for the service provider</strong> which they would like to minimize.</p>
<p><strong>False Positive (FP)</strong> - It is the <strong>frequency of recommended items that are actually not relevant</strong>. In this case, the recommendation system is not doing a good job of finding and recommending the relevant items to the user. This would result in <strong>loss of resources for the service provider</strong> which they would also like to minimize.</p>



<p><strong>Recall</strong> - It is the <strong>fraction of actually relevant items that are recommended to the user</strong> i.e. if out of 10 relevant products, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.</p>
<p><strong>Precision</strong> - It is the <strong>fraction of recommended items that are relevant actually</strong> i.e. if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.</p>



<p><strong>While making a recommendation system it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are the two most used performance metrics used in the assessment of recommendation systems.</strong></p>



<h3 id="Precision@k-and-Recall@-k">Precision@k and Recall@ k<a class="anchor-link" href="#Precision@k-and-Recall@-k">¶</a></h3>



<p><strong>Precision@k</strong> - It is the <strong>fraction of recommended items that are relevant in <code>top k</code> predictions</strong>. Value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.</p>
<p><strong>Recall@k</strong> - It is the <strong>fraction of relevant items that are recommended to the user in <code>top k</code> predictions</strong>.</p>
<p><strong>F1-Score@k</strong> - It is the <strong>harmonic mean of Precision@k and Recall@k</strong>. When <strong>precision@k and recall@k both seem to be important</strong> then it is useful to use this metric because it is representative of both of them.</p>



<h3 id="Some-useful-functions">Some useful functions<a class="anchor-link" href="#Some-useful-functions">¶</a></h3>



<ul>
<li>Below function takes the <strong>recommendation model</strong> as input and gives the <strong>precision@k and recall@k</strong> for that model.  </li>
<li>To compute <strong>precision and recall</strong>, <strong>top k</strong> predictions are taken under consideration for each user.</li>
</ul>


In [None]:


def precision_recall_at_k(model, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    
    #Making predictions on the test data
    predictions=model.test(testset)
    
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    
    #Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)),3)
    #Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)),3)
    
    accuracy.rmse(predictions)
    print('Precision: ', precision) #Command to print the overall precision
    print('Recall: ', recall) #Command to print the overall recall
    print('F_1 score: ', round((2*precision*recall)/(precision+recall),3)) # Formula to compute the F-1 score.





<p>Below we are loading the <strong><code>rating</code> dataset</strong>, which is a <strong>pandas dataframe</strong>, into a <strong>different format called <code>surprise.dataset.DatasetAutoFolds</code></strong> which is required by this library. To do this we will be <strong>using the classes <code>Reader</code> and <code>Dataset</code></strong></p>


In [None]:


df_final.head()




In [None]:


# instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))

# loading the rating dataset
data = Dataset.load_from_df(df_final[['user_id', 'prod_id', 'rating']], reader)

# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)





<ul>
<li>Now we are <strong>ready to build the first baseline similarity-based recommendation system</strong> using the cosine similarity.</li>
<li><strong>KNNBasic</strong> is an algorithm that is also <strong>associated with the surprise package</strong>, it is used to find the <strong>desired similar items among a given set of items</strong>.</li>
</ul>



<ul>
<li>To compute <strong>precision and recall</strong>, a <strong>threshold of 3.5 and k value of 10 is taken for the recommended and relevant ratings</strong>. </li>
<li>In the <strong>present case precision and recall both need to be optimized as the service provider would like to minimize both the losses discussed</strong> above. Hence, the correct performance measure is the <strong>F_1 score</strong>. </li>
</ul>



<h3 id="Question:-Build-the-user-user-similarity-based-recommendation-system-(5-Marks)">Question: Build the user-user similarity-based recommendation system (5 Marks)<a class="anchor-link" href="#Question:-Build-the-user-user-similarity-based-recommendation-system-(5-Marks)">¶</a></h3><ul>
<li>Initialize the KNNBasic model using sim_options provided, Verbose=False, and setting random_state=1 (1 Mark)</li>
<li>Fit the model on the training data (1 Mark)</li>
<li>Use the precision_recall_at_k function to calculate the metrics on the test data (1 Mark)</li>
<li>Provide your observations on the output (2 Marks) </li>
</ul>


In [None]:


#Declaring the similarity options.
sim_options = {'name': 'cosine',
               'user_based': True}

#Initialize the KNNBasic model using sim_options provided, Verbose=False, and setting random_state=1
sim_user_user = KNNBasic(sim_options=sim_options, verbose=False, random_state=1)

# Fit the model on the training data
sim_user_user.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score
precision_recall_at_k(sim_user_user)





<p><strong>Observations:Precision is 86% which means 86% of the recomended products are relevent.Recall of 86% means of all relevent products, 86% are reccomended. F1 Score of 86% means the majority of recommended products were relevant and relevant products were recommended </strong></p>



<p>Let's now <strong>predict rating for a user with <code>userId=A3LDPF5FMB782Z</code> and <code>productId=1400501466</code></strong> as shown below. Here the user has already interacted or watched the product with productId '1400501466'.</p>


In [None]:


#predicting rating for a sample user with an interacted product.
sim_user_user.predict("A3LDPF5FMB782Z", "1400501466", r_ui=5, verbose=True)





<p>Below is the <strong>list of users who have not seen the product with product id "1400501466"</strong>.</p>


In [None]:


df_final[df_final.prod_id=="1400501466"].user_id.unique()





<ul>
<li>It can be observed from the <strong>above list that user "A34BZM6S9L7QI4" has not seen a product with productId "1400501466"</strong>.</li>
</ul>



<p>Below we are <strong>predicting rating for the same <code>userId=A34BZM6S9L7QI4</code> but for a product which this user has not seen yet i.e. <code>prod_id=1400501466</code></strong></p>


In [None]:


#predicting rating for a sample user with a non interacted product.
sim_user_user.predict("A34BZM6S9L7QI4", "1400501466", verbose=True)





<h4 id="Improving-similarity-based-recommendation-system-by-tuning-its-hyper-parameters">Improving similarity-based recommendation system by tuning its hyper-parameters<a class="anchor-link" href="#Improving-similarity-based-recommendation-system-by-tuning-its-hyper-parameters">¶</a></h4>



<p>Below we will be tuning hyperparameters for the <code>KNNBasic</code> algorithms. Let's try to understand some of the hyperparameters of KNNBasic algorithm:</p>



<ul>
<li><strong>k</strong> (int) – The (max) number of neighbors to take into account for aggregation. Default is 40.</li>
<li><strong>min_k</strong> (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.</li>
<li><strong>sim_options</strong> (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise - <ul>
<li>cosine</li>
<li>msd (default)</li>
<li>Pearson</li>
<li>Pearson baseline</li>
</ul>
</li>
</ul>


In [None]:


# setting up parameter grid to tune the hyperparameters
param_grid = {'k': [20, 30, 40], 'min_k': [3, 6, 9],
              'sim_options': {'name': ['msd', 'cosine'],
                              'user_based': [True]}
              }

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting the data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])





<p>Once the grid search is <strong>complete</strong>, we can get the <strong>optimal values for each of those hyperparameters</strong> as shown above</p>



<p>Now let's build the <strong>final model by using tuned values of the hyperparameters</strong> which we received by using <strong>grid search cross-validation</strong></p>


In [None]:


# using the optimal similarity measure for user-user based collaborative filtering
sim_options = {'name': 'cosine',
               'user_based': True}

# creating an instance of KNNBasic with optimal hyperparameter values
sim_user_user_optimized = KNNBasic(sim_options=sim_options, k=40, min_k=6, random_state=1, verbose=False)

# training the algorithm on the trainset
sim_user_user_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =10.
precision_recall_at_k(sim_user_user_optimized)





<ul>
<li>We can see from above that after tuning hyperparameters, <strong>F_1 score of the tuned model is slightly better than the baseline model</strong>. Along with this <strong>the RMSE of the model has gone down as compared to the model before hyperparameter tuning</strong>. Hence, we can say that the model performance has improved slightly after hyperparameter tuning.</li>
</ul>



<h3 id="Question:">Question:<a class="anchor-link" href="#Question:">¶</a></h3><ul>
<li>Predict rating for the user with <code>userId</code>="A3LDPF5FMB782Z", and <code>prod_id</code>= 1400501466 using the optimized model (1 Mark)</li>
<li>Predict rating for the same userId="A34BZM6S9L7QI4" but for a product which this user has not interacted before i.e. prod_id = 1400501466, by using the optimized model (1 Mark)</li>
<li>Compare the output with the output from the baseline model (2 Marks)</li>
</ul>


In [None]:


#Use sim_user_user_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId 1400501466.
sim_user_user_optimized.predict("A3LDPF5FMB782Z", "1400501466", r_ui=5, verbose=True)




In [None]:


#Use sim_user_user_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"
sim_user_user_optimized.predict("A34BZM6S9L7QI4", "1400501466", r_ui = 5, verbose=True)





<p><strong>Observations:</strong>The estimated is still not quite close to the actual for both userIDs</p>



<h2 id="Identifying-similar-users-to-a-given-user-(nearest-neighbors)">Identifying similar users to a given user (nearest neighbors)<a class="anchor-link" href="#Identifying-similar-users-to-a-given-user-(nearest-neighbors)">¶</a></h2>



<p>We can also find out <strong>similar users to a given user</strong> or its <strong>nearest neighbors</strong> based on this KNNBasic algorithm. Below we are finding the 5 most similar users to the first user in the list with internal id 0, based on the <code>msd</code> distance metric</p>


In [None]:


sim_user_user_optimized.get_neighbors(0,5) #Here 0 is the inner id of the above user.





<h4 id="Implementing-the-recommendation-algorithm-based-on-optimized-KNNBasic-model">Implementing the recommendation algorithm based on optimized KNNBasic model<a class="anchor-link" href="#Implementing-the-recommendation-algorithm-based-on-optimized-KNNBasic-model">¶</a></h4>



<p>Below we will be implementing a function where the input parameters are -</p>
<ul>
<li>data: a <strong>rating</strong> dataset</li>
<li>user_id: a user id <strong>against which we want the recommendations</strong></li>
<li>top_n: the <strong>number of products we want to recommend</strong></li>
<li>algo: the algorithm we want to use <strong>for predicting the ratings</strong></li>
<li>The output of the function is a <strong>set of top_n items</strong> recommended for the given user_id based on the given algorithm</li>
</ul>


In [None]:


def get_recommendations(data, user_id, top_n, algo):
    
    # creating an empty list to store the recommended product ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot(index='user_id', columns='prod_id', values='rating')
    
    # extracting those product ids which the user_id has not interacted yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # looping through each of the product ids which user_id has not interacted yet
    for item_id in non_interacted_products:
        
        # predicting the ratings for those non interacted product ids by this user
        est = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        recommendations.append((item_id, est))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n] # returing top n highest predicted rating products for this user





<h4 id='Predicted-top-5-products-for-userId="A3LDPF5FMB782Z"-with-similarity-based-recommendation-system'>Predicted top 5 products for userId="A3LDPF5FMB782Z" with similarity based recommendation system<a class="anchor-link" href='#Predicted-top-5-products-for-userId="A3LDPF5FMB782Z"-with-similarity-based-recommendation-system'>¶</a></h4>


In [None]:


#Making top 5 recommendations for user_id "A3LDPF5FMB782Z" with a similarity-based recommendation engine.
recommendations = get_recommendations(df_final, "A3LDPF5FMB782Z", 5, sim_user_user_optimized)




In [None]:


#Building the dataframe for above recommendations with columns "prod_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['prod_id', 'predicted_ratings'])





<h3 id="Correcting-the-Ratings-and-Ranking-the-above-products">Correcting the Ratings and Ranking the above products<a class="anchor-link" href="#Correcting-the-Ratings-and-Ranking-the-above-products">¶</a></h3>



<p>While comparing the ratings of two products, it is not only the <strong>ratings</strong> that describe the <strong>likelihood of the user to that product</strong>. Along with the rating the <strong>number of users who have seen that product</strong> also becomes important to consider. Due to this, we have calculated the <strong>"corrected_ratings"</strong> for each product. Commonly higher the <strong>"rating_count" of a product more it is liked by users</strong>. To interpret the above concept, a <strong>product rated 4 with rating_count 3 is less liked in comparison to a product rated 3 with a rating count of 50</strong>. It has been <strong>empirically found that the likelihood of the product is directly proportional to the inverse of the square root of the rating_count of the product</strong>.</p>


In [None]:


def ranking_products(recommendations, final_rating):
  # sort the products based on ratings count
  ranked_products = final_rating.loc[[items[0] for items in recommendations]].sort_values('rating_count', ascending=False)[['rating_count']].reset_index()

  # merge with the recommended products to get predicted ratings
  ranked_products = ranked_products.merge(pd.DataFrame(recommendations, columns=['prod_id', 'predicted_ratings']), on='prod_id', how='inner')

  # rank the products based on corrected ratings
  ranked_products['corrected_ratings'] = ranked_products['predicted_ratings'] - 1 / np.sqrt(ranked_products['rating_count'])

  # sort the products based on corrected ratings
  ranked_products = ranked_products.sort_values('corrected_ratings', ascending=False)
  
  return ranked_products





<p><strong>Note:</strong> In the <strong>above-corrected rating formula</strong>, we can add the <strong>quantity <code>1/np.sqrt(n)</code> instead of subtracting it to get more optimistic predictions</strong>. But here we are <strong>subtracting this quantity</strong>, as there are some products with ratings 5 and <strong>we can't have a rating more than 5 for a product</strong>.</p>


In [None]:


#Applying the ranking products function and sorting it based on corrected ratings. 
ranking_products(recommendations, final_rating)





<h3 id="Item-Item-Similarity-based-collaborative-filtering-recommendation-systems">Item Item Similarity-based collaborative filtering recommendation systems<a class="anchor-link" href="#Item-Item-Similarity-based-collaborative-filtering-recommendation-systems">¶</a></h3>



<ul>
<li>Above we have seen <strong>similarity-based collaborative filtering</strong> where similarity has seen <strong>between users</strong>. Now let us look into similarity-based collaborative filtering where similarity is seen <strong>between items</strong>. </li>
</ul>


In [None]:


#Declaring the similarity options.
sim_options = {'name': 'cosine',
               'user_based': False}

#KNN algorithm is used to find desired similar items.
sim_item_item = KNNBasic(sim_options=sim_options, random_state=1, verbose=False)

# Train the algorithm on the trainset, and predict ratings for the testset
sim_item_item.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k =10.
precision_recall_at_k(sim_item_item)





<ul>
<li>The baseline model is giving a good F_1 score of ~84%. We will try to <strong>improve this later by using GridSearchCV</strong> by tuning different hyperparameters of this algorithm.</li>
</ul>



<p>Let's now <strong>predict a rating for a user with <code>userId=A3LDPF5FMB782Z</code> and <code>prod_Id=1400501466</code></strong> as shown below. Here the user has already interacted or watched the product with productId "1400501466".</p>


In [None]:


#predicting rating for a sample user with an interacted product.
sim_item_item.predict("A3LDPF5FMB782Z", "1400501466", r_ui=5, verbose=True)





<p>Below we are <strong>predicting rating for the same <code>userId=A34BZM6S9L7QI4</code> but for a product which this user has not interacted yet i.e. <code>prod_id=1400501466</code></strong></p>


In [None]:


#predicting rating for a sample user with a non interacted product.
sim_item_item.predict("A34BZM6S9L7QI4", "1400501466", verbose=True)





<h4 id="Improving-similarity-based-recommendation-system-by-tuning-its-hyper-parameters">Improving similarity-based recommendation system by tuning its hyper-parameters<a class="anchor-link" href="#Improving-similarity-based-recommendation-system-by-tuning-its-hyper-parameters">¶</a></h4>



<p>Below we will be <strong>tuning hyperparameters for the <code>KNNBasic</code> algorithms</strong>. Let's try to understand <strong>some of the hyperparameters</strong> of the KNNBasic algorithm:</p>



<ul>
<li><strong>k</strong> (int) – The (max) number of neighbors to take into account for aggregation. Default is 40.</li>
<li><strong>min_k</strong> (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.</li>
<li><strong>sim_options</strong> (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise - <ul>
<li>cosine</li>
<li>msd (default)</li>
<li>Pearson</li>
<li>Pearson baseline</li>
</ul>
</li>
</ul>



<h3 id="Question:-Hyperparameter-tuning-the-item-item-similarity-based-model-(6-marks)">Question: Hyperparameter tuning the item-item similarity-based model (6 marks)<a class="anchor-link" href="#Question:-Hyperparameter-tuning-the-item-item-similarity-based-model-(6-marks)">¶</a></h3><ul>
<li>Use the following values for the param_grid and tune the model (3 Marks)<ul>
<li>'k':[10, 20, 30]</li>
<li>'min_k': [3, 6, 9]</li>
<li>'sim_options': {'name': ['msd', 'cosine']</li>
<li>'user_based': [False]</li>
</ul>
</li>
<li>Use GridSearchCV() to tune the model using the 'rmse' measure (2 Marks)</li>
<li>Print the best score and best parameters (1 Mark)</li>
</ul>


In [None]:


# setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ['msd', 'cosine'],
                              'user_based': [False]}
              }

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'])

# fitting the data
gs.fit(data)

# Find the best RMSE score
print(gs.best_score['rmse'])

# Find the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])





<p>Once the <strong>grid search</strong> is complete, we can get the <strong>optimal values for each of those hyperparameters as shown above</strong></p>



<p>Now let's build the <strong>final model</strong> by using <strong>tuned values of the hyperparameters</strong> which we received by using grid search cross-validation</p>



<h3 id="Question:-Use-the-best-parameters-from-GridSearchCV-to-build-the-optimized-item-item-similarity-based-model.-Compare-the-performance-of-the-optimized-model-with-the-baseline-model.-(5-Marks)">Question: Use the best parameters from GridSearchCV to build the optimized item-item similarity-based model. Compare the performance of the optimized model with the baseline model. (5 Marks)<a class="anchor-link" href="#Question:-Use-the-best-parameters-from-GridSearchCV-to-build-the-optimized-item-item-similarity-based-model.-Compare-the-performance-of-the-optimized-model-with-the-baseline-model.-(5-Marks)">¶</a></h3>


In [None]:


# using the optimal similarity measure for item-item based collaborative filtering
sim_options = {'name': 'msd', 'user_based': False}

# creating an instance of KNNBasic with optimal hyperparameter values
sim_item_item_optimized = KNNBasic(sim_options=sim_options , k=3, min_l=9 , random_state=1, verbose=False)

# training the algorithm on the trainset
sim_item_item_optimized.fit(trainset)

# Let us compute precision@k and recall@k, f1_score@k and RMSE
precision_recall_at_k(sim_user_user_optimized)





<p><strong>Observations:All out metrics improved with this tuned model. The Precision, Recall and F_1 score all went up. The RMSE went down. </strong></p>



<p>Let's us now predict <strong>rating for an user with <code>userId=A3LDPF5FMB782Z</code> and for <code>prod_id=1400501466</code></strong> with the <strong>optimized model</strong> as shown below</p>


In [None]:


sim_item_item_optimized.predict("A3LDPF5FMB782Z", "1400501466", r_ui=5, verbose=True)





<p>Below we are <strong>predicting rating</strong> for the same <strong><code>userId=A34BZM6S9L7QI4</code></strong> but for a product which this user <strong>has not interacted before</strong> i.e. <code>prod_id==1400501466</code>, by using the optimized model as shown below -</p>


In [None]:


sim_item_item_optimized.predict("A34BZM6S9L7QI4", "1400501466", verbose=True)





<h4 id="Identifying-similar-users-to-a-given-user-(nearest-neighbors)">Identifying similar users to a given user (nearest neighbors)<a class="anchor-link" href="#Identifying-similar-users-to-a-given-user-(nearest-neighbors)">¶</a></h4>



<p>We can also find out <strong>similar users</strong> to a given user or its nearest neighbors based on this <strong>KNNBasic algorithm</strong>. Below we are finding 5 most similar users to the user with internal id 0 based on the <code>msd</code> distance metric</p>


In [None]:


sim_item_item_optimized.get_neighbors(0, k=5)





<h4 id='Predicted-top-5-products-for-userId="A1A5KUIIIHFF4U"-with-similarity-based-recommendation-system'>Predicted top 5 products for userId="A1A5KUIIIHFF4U" with similarity based recommendation system<a class="anchor-link" href='#Predicted-top-5-products-for-userId="A1A5KUIIIHFF4U"-with-similarity-based-recommendation-system'>¶</a></h4>


In [None]:


#Making top 5 recommendations for user_id A1A5KUIIIHFF4U with similarity-based recommendation engine.
recommendations = get_recommendations(df_final, "A1A5KUIIIHFF4U", 5, sim_item_item_optimized)




In [None]:


#Building the dataframe for above recommendations with columns "prod_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['prod_id', 'predicted_ratings'])




In [None]:


#Applying the ranking products function and sorting it based on corrected ratings. 
ranking_products(recommendations, final_rating)





<ul>
<li>Now as we have seen <strong>similarity-based collaborative filtering algorithms</strong>, let us now get into <strong>model-based collaborative filtering algorithms</strong>.</li>
</ul>



<h3 id="Model-Based-Collaborative-Filtering---Matrix-Factorization">Model Based Collaborative Filtering - Matrix Factorization<a class="anchor-link" href="#Model-Based-Collaborative-Filtering---Matrix-Factorization">¶</a></h3>



<p>Model-based Collaborative Filtering is a <strong>personalized recommendation system</strong>, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use <strong>latent features</strong> to find recommendations for each user.</p>



<h4 id="Singular-Value-Decomposition-(SVD)">Singular Value Decomposition (SVD)<a class="anchor-link" href="#Singular-Value-Decomposition-(SVD)">¶</a></h4>



<p>SVD is used to <strong>compute the latent features</strong> from the <strong>user-item matrix</strong>. But SVD does not work when we <strong>miss values</strong> in the <strong>user-item matrix</strong>.</p>



<h3 id="Question:-Build-the-matrix-factorization-recommendation-system-(using-random_state=1)-and-provide-your-observations-on-performance-of-the-model-(4-Marks)">Question: Build the matrix factorization recommendation system (using random_state=1) and provide your observations on performance of the model (4 Marks)<a class="anchor-link" href="#Question:-Build-the-matrix-factorization-recommendation-system-(using-random_state=1)-and-provide-your-observations-on-performance-of-the-model-(4-Marks)">¶</a></h3>


In [None]:


# using SVD matrix factorization
svd = SVD(random_state=1)

# training the algorithm on the trainset
svd.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score@k, and RMSE
precision_recall_at_k(svd)





<p><strong>Observations:Precision, recall, and F_1 score are all about the same as the previous model. However, the RMSE has dropped from .9526 to .8882</strong></p>



<ul>
<li>Let's now predict the rating for a user with <code>userId="A3LDPF5FMB782Z"</code> and <code>prod_id="1400501466</code> as shown below</li>
<li>Here the user has already rated the product.</li>
</ul>


In [None]:


#Making prediction.
svd.predict("A3LDPF5FMB782Z", "1400501466", r_ui=5, verbose=True)





<p>Below we are predicting rating for the same <code>userId=A34BZM6S9L7QI4</code> but for a product which this user has not interacted before i.e. <code>productId=1400501466</code>, as shown below -</p>


In [None]:


#Making prediction. 
svd.predict("A34BZM6S9L7QI4", "1400501466", verbose=True)





<h4 id="Improving-matrix-factorization-based-recommendation-system-by-tuning-its-hyper-parameters">Improving matrix factorization based recommendation system by tuning its hyper-parameters<a class="anchor-link" href="#Improving-matrix-factorization-based-recommendation-system-by-tuning-its-hyper-parameters">¶</a></h4>



<p>In SVD, rating is predicted as -</p>



$$\hat{r}_{u i}=\mu+b_{u}+b_{i}+q_{i}^{T} p_{u}$$



<p>If user $u$ is unknown, then the bias $b_{u}$ and the factors $p_{u}$ are assumed to be zero. The same applies for item $i$ with $b_{i}$ and $q_{i}$.</p>



<p>To estimate all the unknown, we minimize the following regularized squared error:</p>



$$\sum_{r_{u i} \in R_{\text {train }}}\left(r_{u i}-\hat{r}_{u i}\right)^{2}+\lambda\left(b_{i}^{2}+b_{u}^{2}+\left\|q_{i}\right\|^{2}+\left\|p_{u}\right\|^{2}\right)$$



<p>The minimization is performed by a very straightforward <strong>stochastic gradient descent</strong>:</p>



$$\begin{aligned} b_{u} &amp; \leftarrow b_{u}+\gamma\left(e_{u i}-\lambda b_{u}\right) \\ b_{i} &amp; \leftarrow b_{i}+\gamma\left(e_{u i}-\lambda b_{i}\right) \\ p_{u} &amp; \leftarrow p_{u}+\gamma\left(e_{u i} \cdot q_{i}-\lambda p_{u}\right) \\ q_{i} &amp; \leftarrow q_{i}+\gamma\left(e_{u i} \cdot p_{u}-\lambda q_{i}\right) \end{aligned}$$



<p>There are many hyperparameters to tune in this algorithm, you can find a full list of hyperparameters <a href="https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD">here</a></p>



<p>Below we will be tuning only three hyperparameters -</p>
<ul>
<li><strong>n_epochs</strong>: The number of iteration of the SGD algorithm</li>
<li><strong>lr_all</strong>: The learning rate for all parameters</li>
<li><strong>reg_all</strong>: The regularization term for all parameters</li>
</ul>


In [None]:


# set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.2, 0.4, 0.6]}

# performing 3-fold gridsearch cross validation
gs_ = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting data
gs_.fit(data)

# best RMSE score
print(gs_.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_.best_params['rmse'])





<p>Once the <strong>grid search</strong> is complete, we can get the <strong>optimal values</strong> for each of those hyperparameters as shown above</p>



<p>Now we will <strong>the build final model</strong> by using <strong>tuned values</strong> of the hyperparameters which we received by using grid search cross-validation</p>



<h3 id="Question:-Fit-the-SVD-model-using-the-hyperparameters-from-GridSearchCV-(use-random_state=1)-and-compare-the-output-with-the-baseline-model-(5-Marks)">Question: Fit the SVD model using the hyperparameters from GridSearchCV (use random_state=1) and compare the output with the baseline model (5 Marks)<a class="anchor-link" href="#Question:-Fit-the-SVD-model-using-the-hyperparameters-from-GridSearchCV-(use-random_state=1)-and-compare-the-output-with-the-baseline-model-(5-Marks)">¶</a></h3>


In [None]:


# Build the optimized SVD model using optimal hyperparameter search
svd_optimized = SVD(n_epochs=20, lr_all=0.01, reg_all=0.2, random_state=1)

# Train the algorithm on the trainset
svd_optimized = svd_optimized.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score@k, and RMSE
precision_recall_at_k(svd_optimized)





<p><strong>Observations:After tuning the hyperparameter the model hasn't improved that much.</strong></p>



<p>Let's now predict a rating for a user with <code>userId=A3LDPF5FMB782Z</code> and <code>productId=1400501466</code> with the optimized model as shown below</p>



<h3 id="Question:">Question:<a class="anchor-link" href="#Question:">¶</a></h3><ul>
<li>Using the optimized svd model, predict rating for the user with <code>userId</code>="A3LDPF5FMB782Z", and <code>prod_id</code>= 1400501466 (1 Mark)</li>
<li>Predict rating for the same userId="A34BZM6S9L7QI4" but for a product which this user has not interacted before i.e. prod_id = 1400501466, by using the optimized model (1 Mark)</li>
<li>Compare the output with the output from the baseline model (2 Marks)</li>
</ul>


In [None]:


#Use svd_algo_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId 1400501466.
svd_optimized.predict("A3LDPF5FMB782Z", 1400501466, r_ui=5, verbose=True)




In [None]:


#Use svd_algo_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"
svd_optimized.predict("A34BZM6S9L7QI4", 1400501466, verbose=True)





<p><strong>Observations:</strong>We can see that the model does an okay job of predicting the estimates.</p>


In [None]:


df_final.head()





<h3 id="Question:">Question:<a class="anchor-link" href="#Question:">¶</a></h3><ul>
<li>Get 5 recommendations for the user with user_id = 'A2XIOXRRYX0KZY' using the svd_optimized model. Hint: Use get_recommendations() function (2 Marks)</li>
<li>Rank the recommendations  on the basis of the correct ratings. Hint: Use ranking_products() function (2 Marks)</li>
</ul>


In [None]:


#Get top 5 recommendations for user_id A2XIOXRRYX0KZY using "svd_optimized" algorithm.
svd_recommendations = get_recommendations(df_final, 'A2XIOXRRYX0KZY', 5, svd_optimized)




In [None]:


#Ranking products based on above recommendations
ranking_products(svd_recommendations, final_rating)





<h3 id="Conclusion-(5-Marks)">Conclusion (5 Marks)<a class="anchor-link" href="#Conclusion-(5-Marks)">¶</a></h3><p>**Conclusion:
We built models using user-user similarity, item-item similarity, and personalized(using matrix factorization) recommendations systems.</p>
<p>Overall, the personalized recommendation system has given the best performance in terms of the F1-Score and lowes RMSE.</p>
<p>We can try to further improve the performance of these models using hyperparameter tuning.</p>
<p>We can also combine different recommendation techniques to build hybrid recommendation systems.**</p>
