# Practice Project - Recommendation Systems: Amazon Product Reviews

## **Marks: 40**

__________

Welcome to the Practice Project on Recommendation Systems. We will work with the Amazon product reviews dataset for this project. The dataset contains ratings of different electronic products. It does not include information about the products or reviews to avoid bias while building the model. 

--------------
### Context: 
--------------

E-commerce websites like Amazon, Flipkart uses different recommendation models to provide personalized suggestions to different users. Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.

----------------
### Objective:
----------------

Build a recommendation system to recommend products to customers based on their previous ratings for other products.

--------------
### Dataset:
--------------

The Amazon dataset contains the following attributes:

- **userId:** Every user identified with a unique id
- **productId:** Every product identified with a unique id
- **Rating:** Rating of the corresponding product by the corresponding user
- **timestamp:** Time of the rating (ignore this column for this exercise)

In [None]:
# uncomment if you are using google colab

# from google.colab import drive
# drive.mount('/content/drive')

### Importing Libraries
#### One of the first steps in any data science task is importing the necessary tools you will use.

In [None]:
# installing surprise library, only do it for first time
!pip install surprise

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from surprise import accuracy

# class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader

# class for loading datasets
from surprise.dataset import Dataset

# for model tuning model hyper-parameters
from surprise.model_selection import GridSearchCV

# for splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split

# for implementing similarity based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# for implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

from collections import defaultdict

# for implementing cross validation
from surprise.model_selection import KFold

### Loading the data

In [None]:
#col_names = ['user_id', 'item_id', 'rating', 'timestamp']
rating = pd.read_csv('ratings_Electronics.csv', names=['user_id', 'item_id', 'rating', 'timestamp'])
rating = rating.drop('timestamp', axis=1)

Let's check the **info** of the data

In [None]:
rating.info()

- There are **7824481 observations** and **4 columns** in the data
- The data type of the timestamp column is int64 which is not correct. We can convert this to DateTime format but **we don't need timestamp for our analysis**. Hence, **we can drop this column**

In [None]:
#Dropping timestamp column
#rating = rating.drop(['timestamp'], axis=1)

### **Question 1: Exploring the dataset (7 Marks)**

#### Q 1.1 Print the top 5 rows of the dataset (1 Mark)

In [None]:
#printing the top 5 rows of the dataset Hint use .head()
#remove _______- and complete the code
______________________

#### Q 1.2 Describe the distribution of ratings. (1 Mark)

In [None]:
plt.figure(figsize = (12, 4))

#remove ____________ and complete the code
sns.countplot(_______________)

plt.tick_params(labelsize = 10)
plt.title("Distribution of Ratings ", fontsize = 10)
plt.xlabel("Ratings", fontsize = 10)
plt.ylabel("Number of Ratings", fontsize = 10)
plt.ticklabel_format(useOffset=False, style='plain', axis='y')
plt.show()

**Write your Answer here:_______**

In [None]:
df=rating.copy()

**As this dataset is very large and has 7824482 observations, it is not computationally possible to build a model using this.Moreover, there are many users who have only rated a few products and also there are products which are rated by very less users. Hence we can reduce the dataset by considering certain Logical assumption.**

Here, We will be taking users who have given at least 50 rating, and the products who has at least 5 rating, as when we shop online we prefer to have some number of rating of a product. 

In [None]:
# Get the column containing the users
users = rating.user_id
# Create a dictionary from users to their number of ratings
ratings_count = dict()
for user in users:
    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1
    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1    

In [None]:
# We want our users to have at least 50 ratings to be considred
RATINGS_CUTOFF = 50
remove_users = []
for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)
rating = rating.loc[~rating.user_id.isin(remove_users)]

In [None]:
rating.shape

In [None]:
# Get the column containing the users
users = rating.item_id
# Create a dictionary from users to their number of ratings
ratings_count = dict()
for user in users:
    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1
    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1    

In [None]:
# We want our users to have at least 5 ratings to be considred
RATINGS_CUTOFF = 5
remove_users = []
for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)
rating = rating.loc[~rating.item_id.isin(remove_users)]

In [None]:
rating.shape

#### Q 1.3 What is the total number of unique users and unique items? (1 Mark)

In [None]:
#Finding number of unique users
#remove _______- and complete the code

rating['user_id']._____________

**Write your Answer here:**

- There are **1540 users** in the dataset

In [None]:
#Finding number of unique items
#remove _______- and complete the code
rating['item_id'].__________

**Write your Answer here:_____________**

#### Q 1.4 Is there any item that has been interacted with more than once by the same user? (1 Mark)

In [None]:
rating.groupby(['user_id', 'item_id']).count()

In [None]:
rating.groupby(['user_id', 'item_id']).count()['rating'].sum()

**Write your Answer here:____________**

#### Q 1.5  Which one is the most interacted item in the dataset?(1 Mark)

In [None]:
#remove _______- and complete the code
rating['item_id']._________________

**Write your Answer here:______________**

In [None]:
#Plotting distributions of ratings for 74 interactions with itemid B0088CJT4U
plt.figure(figsize=(7,7))

rating[rating['item_id'] == 'B0088CJT4U']['rating'].value_counts().plot(kind='bar')

plt.xlabel('Rating')

plt.ylabel('Count')

plt.show()

**Write your Answer here:______**

#### Q 1.6 Which user interacted the most with any item in the dataset? (1 Mark)

In [None]:
#remove _______- and complete the code
rating['user_id'].value_counts()

**Write your Answer here:___________**

#### Q 1.7 What is the distribution of the user-item interactions in this dataset?(1 Mark)

In [None]:
#Finding user-item interactions distribution
count_interactions = rating.groupby('user_id').count()['item_id']
count_interactions

In [None]:
#Plotting user-item interactions distribution

plt.figure(figsize=(15,7))

#remove _______- and complete the code
sns.histplot(____________)

plt.xlabel('Number of Interactions by Users')

plt.show()

**Write your Answer here:_________**

#### As we have now explored the data, let's start building Recommendation systems

### **Question 2: Create Rank-Based Recommendation System (3 Marks)**

### Model 1: Rank Based Recommendation System

Rank-based recommendation systems provide recommendations based on the most popular items. This kind of recommendation system is useful when we have **cold start** problems. Cold start refers to the issue when we get a new user into the system and the machine is not able to recommend items to the new user, as the user did not have any historical interactions in the dataset. In those cases, we can use rank-based recommendation system to recommend items to the new user.

To build the rank-based recommendation system, we take **average** of all the ratings provided to each item and then rank them based on their average rating.

In [None]:
#remove _______- and complete the code

#Calculating average ratings
average_rating = rating.groupby('item_id').___________

#Calculating the count of ratings
count_rating = rating.groupby('item_id')._____________

#Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating})

In [None]:
final_rating.head()

Now, let's create a function to find the **top n items** for a recommendation based on the average ratings of items. We can also add a **threshold for a minimum number of interactions** for a item to be considered for recommendation. 

In [None]:
def top_n_items(data, n, min_interaction=30):
    
    #Finding items with minimum number of interactions
    recommendations = data[data['rating_count'] > min_interaction]
    
    #Sorting values w.r.t average rating 
    recommendations = recommendations.sort_values(by='avg_rating', ascending=False)
    
    return recommendations.index[:n]

We can **use this function with different n's and minimum interactions** to get items to recommend

#### Recommending top 5 items with 50 minimum interactions based on popularity

In [None]:
#remove _______- and complete the code
list(top_n_items(_____________))

#### Now that we have seen how to apply the Rank-Based Recommendation System, let's apply Collaborative Filtering Based Recommendation System

### Model 2: Collaborative Filtering Based Recommendation System (7 Marks)

In this type of recommendation system, `we do not need any information` about the users or items. We only need user item interaction data to build a collaborative recommendation system. For example - 
<ol>
    <li><b>Ratings</b> provided by users. For example - ratings of books on goodread, movie ratings on imdb etc</li>
    <li><b>Likes</b> of users on different facebook posts, likes on youtube videos</li>
    <li><b>Use/buying</b> of a product by users. For example - buying different items on e-commerce sites</li>
    <li><b>Reading</b> of articles by readers on various blogs</li>
</ol>

#### Types of Collaborative Filtering

- Similarity/Neighborhood based
- Model based

Below we are building similarity based recommendation system using `cosine` similarity and using KNN to find similar users which are nearest neighbor to the given user. 

We will be using a new library - `surprise` to build the remaining models, let's first import the necessary classes and functions from this library

Below we are loading the `rating` dataset, which is a pandas dataframe, into a different format called `surprise.dataset.DatasetAutoFolds` which is required by this library. To do this we will be using the classes `Reader` and `Dataset`

In [None]:
from sklearn.preprocessing import LabelEncoder
df=rating[['user_id','item_id']].apply(LabelEncoder().fit_transform)
df['rating']=rating['rating']
df.head()

#### Making the dataset into surprise dataset and splitting it into train and test set

In [None]:
# instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))

# loading the rating dataset
data = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']], reader)

# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.7, random_state=42)

### Now we are ready to build the first baseline similary based recommendation system using cosine similarity and KNN

In [None]:
#remove _______- and complete the code

algo_knn_user = KNNBasic(__________________)

# Train the algorithm on the trainset, and predict ratings for the testset
algo_knn_user.fit(__________)
predictions = algo_knn_user.test(__________)

# Then compute RMSE
accuracy.rmse(_______________)

#### Q 3.1 What is the RMSE for baseline user based collaborative filtering recommendation system? (1 Mark)

**Wite your Answer here:__________**

#### Q 3.2 What is the Predicted  rating for an user with userId=0 and for itemId= 3906 and itemId=100? (1 Mark)

Let's us now predict rating for an user with `userId=0` and for `itemId=3906` as shown below

In [None]:
#remove _______- and complete the code
algo_knn_user.predict(____,____, r_ui=4, verbose=True)

**Write your Answer here:________**

Below we are predicting rating for the same `userId=0` but for a item which this user has not interacted before i.e. `itemId=100`, as shown below - 

In [None]:
algo_knn_user.predict(0,100, verbose=True)

**Write your Answer here:_____**

### Improving similarity based recommendation system by tuning its hyper-parameters

Below we will be tuning hyper-parmeters for the `KNNBasic` algorithms. Let's try to understand different hyperparameters of KNNBasic algorithm - 

- **k** (int) – The (max) number of neighbors to take into account for aggregation (see this note). Default is 40.
- **min_k** (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.
- **sim_options** (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise - 
    - cosine
    - msd (default)
    - pearson
    - pearson baseline
    
For more details please refer the official documentation https://surprise.readthedocs.io/en/stable/knn_inspired.html

#### Q 3.3 Perform hyperparameter tuning for the baseline user based collaborative filtering recommendation system and find the RMSE for tuned user based collaborative filtering recommendation system? (3 Marks)

In [None]:
#remove _______- and complete the code

# setting up parameter grid to tune the hyperparameters
param_grid = {__________}

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(_________, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)

# fitting the data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above

Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters

In [None]:
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df.head()

Now we will building final model by using tuned values of the hyperparameters which we received by using grid search cross validation

In [None]:
#remove _______- and complete the code

# using the optimal similarity measure for user-user based collaborative filtering
# creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized = KNNBasic(___________,Verbose=False)

# training the algorithm on the trainset
similarity_algo_optimized.fit(___________)

# predicting ratings for the testset
predictions = similarity_algo_optimized.test(__________)

# computing RMSE on testset
accuracy.rmse(predictions)

**Write your Answer here:__________**


#### Q 3.4  What is the Predicted  rating for an user with userId=0 and for itemId= 3906 and itemId=100 using tuned user based collaborative filtering? (1 Mark)

Let's us now predict rating for an user with `userId=0` and for `itemId=3906` with the optimized model as shown below

In [None]:
#remove _______- and complete the code
similarity_algo_optimized.predict(___,___, r_ui=4, verbose=True)

**Write your Answer here:__________**

Below we are predicting rating for the same `userId=0` but for a item which this user has not interacted before i.e. `itemId=100`, by using the optimized model as shown below - 

In [None]:
#remove _______- and complete the code
similarity_algo_optimized.predict(__,___, verbose=True)

**Write your Answer here:__________**

#### Identifying similar users to a given user (nearest neighbors)

We can also find out the similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar user to the `userId=0` based on the `msd` distance metric

In [None]:
similarity_algo_optimized.get_neighbors(0, k=5)

#### Implementing the recommendation algorithm based on optimized KNNBasic model

Below we will be implementing a function where the input parameters are - 

- data: a rating dataset
- user_id: an user id against which we want the recommendations
- top_n: the number of items we want to recommend
- algo: the algorithm we want to use to predict the ratings

In [None]:
def get_recommendations(data, user_id, top_n, algo):
    
    # creating an empty list to store the recommended item ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot(index='user_id', columns='item_id', values='rating')
    
    # extracting those item ids which the user_id has not interacted yet
    non_interacted_items = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # looping through each of the item id which user_id has not interacted yet
    for item_id in non_interacted_items:
        
        # predicting the ratings for those non interacted item ids by this user
        est = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        recommendations.append((item_id, est))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n] # returing top n highest predicted rating items for this user

#### Predicted top 5 items for userId=4 with similarity based recommendation system

In [None]:
#remove _______- and complete the code
recommendations = get_recommendations(df,__,___,__________)

#### Q 3.5 Predict the top 5 items for userId=4 with similarity based recommendation system (1 Mark)

In [None]:
recommendations

### Model 3 Item based Collaborative Filtering Recommendation System (7 Marks)

In [None]:
#remove _______- and complete the code

#definfing similarity measure
sim_options = {__________}

#defining Nearest neighbour algorithm
algo_knn_item = KNNBasic(__________,verbose=False)

# Train the algorithm on the trainset or fitting the model on train dataset 
algo_knn_item.fit(___________)

#predict ratings for the testset
predictions = algo_knn_item.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

#### Q 4.1 What is the RMSE for baseline item based collaborative filtering recommendation system ? (1 Mark)

**Write your Answer here:______**

#### Let's us now predict rating for an user with `userId=0` and for `itemId=3906` and `itemId=100`

In [None]:
#remove _______- and complete the code
algo_knn_item.predict(___, ___, r_ui=4, verbose=True)

As we can see - the actual rating for this user-item pair is 4 and predicted rating is 4.29 by this similarity based baseline model

#### Let's predict  the rating for the same `userId=0` but for a item which this user has not interacted before i.e. `itemId=22607`

In [None]:
#remove _______- and complete the code
algo_knn_item.predict(___,___, verbose=True)

As we can see the estimated rating for this user-item pair is 5.00 based on this similarity based baseline model

#### Q 4.3 Perform hyperparameter tuning for the baseline item based collaborative filtering recommendation system and find the RMSE for tuned item based collaborative filtering recommendation system?  (3 Marks)

In [None]:
#remove _______- and complete the code

# setting up parameter grid to tune the hyperparameters
param_grid = {__________}

# performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(________,_________, measures=['rmse', 'mae'], cv=3)

# fitting the data
grid_obj.fit(_________)

# best RMSE score
print(grid_obj.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above:

Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters

In [None]:
results_df = pd.DataFrame.from_dict(grid_obj.cv_results)
results_df.head()

In [None]:
#remove _______- and complete the code

# creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_item = KNNBasic(sim_options=________, k=__, min_k=_,verbose=False)

# training the algorithm on the trainset
similarity_algo_optimized_item.fit(___________)

# predicting ratings for the testset
predictions = similarity_algo_optimized_item.test(____________)

# computing RMSE on testset
accuracy.rmse(predictions)

**Write your Answer here:__________**

#### Q 4.4 What is the Predicted rating for an item with userId=0 and for itemId= 3906 and itemId=100 using tuned item based collaborative filtering? (1 Mark)

#### Let's us now predict rating for an user with `userId=0` and for `itemId=3906` with the optimized model as shown below

In [None]:
#remove _______- and complete the code
similarity_algo_optimized_item.predict(__,__, r_ui=4, verbose=True)

**Write your Answer here:______**

#### Let's predict the rating for the same `userId=0` but for a item which this user has not interacted before i.e. `itemId=100`, by using the optimized model:

In [None]:
#remove _______- and complete the code
similarity_algo_optimized_item.predict(___,___, verbose=True)

**Write your Answer here:_______**

#### Identifying similar users to a given user (nearest neighbors)
We can also find out the similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar user to the `userId=4` based on the `msd` distance metric

In [None]:
#remove _______- and complete the code
similarity_algo_optimized_item.get_neighbors(___, k=5)

#### Predicted top 5 items for userId=4 with similarity based recommendation system

In [None]:
#remove _______- and complete the code
recommendations = get_recommendations(df,___,___, similarity_algo_optimized_item)

#### Q 4.5 Predict the top 5 items for userId=4 with similarity based recommendation system ? (1 Mark) 

In [None]:
recommendations

### Model 4 Based Collaborative Filtering - Matrix Factorization using SVD (7 Marks)

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

**Latent Features:** The features that are not present in the empirical data but can be inferred from the data. For example:

#### Singular Value Decomposition (SVD)

SVD is used to compute the latent features from the user-item matrix that we already learned earlier. But SVD does not work when we missing values in the user-item matrix.

#### Building a baseline matrix factorization recommendation system

In [None]:
#remove _______- and complete the code

# using SVD matrix factorization
algo_svd = SVD()

# training the algorithm on the trainset
algo_svd.fit(_________)

# predicting ratings for the testset
predictions = algo_svd.test(___________)

# computing RMSE on the testset
accuracy.rmse(predictions)

#### Q 5.1 What is the RMSE for baseline SVD based collaborative filtering recommendation system? (1 Mark)

**Write your Answer here:_________**

#### Q 5.2 What is the Predicted  rating for an user with userId =0 and for itemId= 3906 and itemId=100? (1 Mark)

Let's us now predict rating for an user with `userId=0` and for `itemId=3906` as shown below

In [None]:
#remove _______- and complete the code
algo_svd.predict(____,____, r_ui=4, verbose=True)

**Write your Answer here:_______**

Below we are predicting rating for the same `userId=0` but for a item which this user has not interacted before i.e. `userId=100`, as shown below - 

In [None]:
#remove _______- and complete the code
algo_svd.predict(____,___, verbose=True)

**Write your Answer here:______**

#### Improving matrix factorization based recommendation system by tuning its hyper-parameters

In SVD, rating is predicted as - 

$$\hat{r}_{u i}=\mu+b_{u}+b_{i}+q_{i}^{T} p_{u}$$

If user $u$ is unknown, then the bias $b_{u}$ and the factors $p_{u}$ are assumed to be zero. The same applies for item $i$ with $b_{i}$ and $q_{i}$.

To estimate all the unknown, we minimize the following regularized squared error:

$$\sum_{r_{u i} \in R_{\text {train }}}\left(r_{u i}-\hat{r}_{u i}\right)^{2}+\lambda\left(b_{i}^{2}+b_{u}^{2}+\left\|q_{i}\right\|^{2}+\left\|p_{u}\right\|^{2}\right)$$

The minimization is performed by a very straightforward **stochastic gradient descent**:

$$\begin{aligned} b_{u} & \leftarrow b_{u}+\gamma\left(e_{u i}-\lambda b_{u}\right) \\ b_{i} & \leftarrow b_{i}+\gamma\left(e_{u i}-\lambda b_{i}\right) \\ p_{u} & \leftarrow p_{u}+\gamma\left(e_{u i} \cdot q_{i}-\lambda p_{u}\right) \\ q_{i} & \leftarrow q_{i}+\gamma\left(e_{u i} \cdot p_{u}-\lambda q_{i}\right) \end{aligned}$$

There are many hyperparameters to tune in this algorithm, you can find a full list of hyperparameters [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)

Below we will be tuning only three hyperparameters -
- **n_epochs**: The number of iteration of the SGD algorithm
- **lr_all**: The learning rate for all parameters
- **reg_all**: The regularization term for all parameters

#### Q 5.3 Perform hyperparameter tuning for the baseline SVD based collaborative filtering recommendation system and find the RMSE for tuned SVD based collaborative filtering recommendation system? (3 Marks)

In [None]:
#remove _______- and complete the code

# set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.2, 0.4, 0.6]}

# performing 3-fold gridsearch cross validation
gs = GridSearchCV(_______, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)

# fitting data
gs.fit(_______)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above

Below we are analysing evaluation metrics - RMSE and MAE at each and every split to analyze the impact of each value of hyperparameters

In [None]:
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df.head()

Now we will building final model by using tuned values of the hyperparameters which we received by using grid search cross validation

In [None]:
#remove _______- and complete the code

# building the optimized SVD model using optimal hyperparameter search
svd_algo_optimized = SVD(n_epochs=20, lr_all=0.01, reg_all=0.2)

# training the algorithm on the trainset
svd_algo_optimized.fit(_______)

# predicting ratings for the testset
predictions = svd_algo_optimized.test(_________)

# computing RMSE
accuracy.rmse(predictions)

#### Q 5.4 What is the Predicted rating for an user with userId=0 and for itemId= 3906 and itemId=100 using SVD based collaborative filtering? (1 Mark)

Let's us now predict rating for an user with `userId=0` and for `itemId=3906` with the optimized model as shown below

In [None]:
#remove _______- and complete the code
svd_algo_optimized.predict(___,___, r_ui=4, verbose=True)

**Write your Answer here:_________**

In [None]:
#remove _______- and complete the code
svd_algo_optimized.predict(___,___, verbose=True)

#### Q 5.5 Predict the top 5 items for userId=4 with SVD based recommendation system ?

In [None]:
get_recommendations(df,___,___, svd_algo_optimized)

### Predicting ratings for already interacted items

Below we are comparing the rating predictions of users for those items which has been already watched by an user. This will help us to understand how well are predictions are as compared to the actual ratings provided by users

In [None]:
def predict_already_interacted_ratings(data, user_id, algo):
    
    # creating an empty list to store the recommended item ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot(index='user_id', columns='item_id', values='rating')
    
    # extracting those item ids which the user_id has interacted already
    interacted_items = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].notnull()].index.tolist()
    
    # looping through each of the item id which user_id has interacted already
    for item_id in interacted_items:
        
        # extracting actual ratings
        actual_rating = user_item_interactions_matrix.loc[user_id, item_id]
        
        # predicting the ratings for those non interacted item ids by this user
        predicted_rating = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        recommendations.append((item_id, actual_rating, predicted_rating))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return pd.DataFrame(recommendations, columns=['itemId', 'actual_rating', 'predicted_rating']) # returing top n highest predicted rating items for this user

Here we are comparing the predicted ratings by `similarity based recommendation` system against actual ratings for `userId=4`

In [None]:
predicted_ratings_for_interacted_items = predict_already_interacted_ratings(df,4, similarity_algo_optimized)
data = predicted_ratings_for_interacted_items.melt(id_vars='itemId', value_vars=['actual_rating', 'predicted_rating'])
sns.displot(data=data, x='value', hue='variable', kde=True);

**Write your Answer here:**

We can see that distribution of predicted ratings is closely following the distribution of actual ratings. The total bins for predicted ratings is higher as compared to total bins for actual ratings. This is expected, as actual ratings always have discreet values like 1, 2, 3, 4, 5, but predicted ratings can have continuous values as we are taking aggregated ratings from the nearest neighbors of a given user. But over the predictions looks good as compared to the distribution of actual ratings.

Below we are comparing the predicted ratings by `matrix factorization based recommendation` system against actual ratings for `userId=4`

In [None]:
predicted_ratings_for_interacted_items = predict_already_interacted_ratings(df,4, svd_algo_optimized)
data = predicted_ratings_for_interacted_items.melt(id_vars='itemId', value_vars=['actual_rating', 'predicted_rating'])
sns.displot(data=data, x='value', hue='variable', kde=True);

In [None]:
# instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))

# loading the rating dataset
data = Dataset.load_from_df(rating[['user_id', 'item_id', 'rating']], reader)

# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

## Precision and Recall @ k

RMSE is not the only metric we can use here. We can also examine two fundamental measures, precision and recall. We also add a parameter k which is helpful in understanding problems with multiple rating outputs.

See the Precision and Recall @ k section of your notebook and follow the instructions to compute various precision/recall values at various values of k.

To know more about precision recall in Recommendation systems refer to these links : 

https://surprise.readthedocs.io/en/stable/FAQ.html

https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54

### Question6: Compute the precision and recall, for each of the 4 models, at k = 5 and 10. This is 2 x 2 x 4 = 16 numerical values. Do you note anything interesting about these values?  (4 Marks)

In [None]:
from collections import defaultdict
#function can be found on surprise documentation FAQs
def precision_recall_at_k(predictions, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

In [None]:
# Make list of k values
#A basic cross-validation iterator.
from surprise.model_selection import KFold
kf = KFold(n_splits=5)
K = [5, 10]

#remove _______- and complete the code
# Make list of models
models = [_________________________________________________________]

for k in K:
    for model in models:
        print('> k={}, model={}'.format(k,model.__class__.__name__))
        p = []
        r = []
        for trainset, testset in kf.split(data):
            model.fit(trainset)
            predictions = model.test(testset, verbose=False)
            precisions, recalls = precision_recall_at_k(predictions, k=k, threshold=3.5)

            # Precision and recall can then be averaged over all users
            p.append(sum(prec for prec in precisions.values()) / len(precisions))
            r.append(sum(rec for rec in recalls.values()) / len(recalls))
        print('-----> Precision: ', round(sum(p) / len(p), 3))
        print('-----> Recall: ', round(sum(r) / len(r), 3))

### Question 7 ( 5 Marks)
#### 7.1 Compare the results from the base line user-user and item-item based models.
#### 7.2 How do these baseline models compare to each other with respect to the tuned user-user and item-item models?
#### 7.3 The matrix factorization model is different from the collaborative filtering models. Briefly describe this difference. Also, compare the RMSE and precision recall for the models.
#### 7.4 Does it improve? Can you offer any reasoning as to why that might be?

**Write your Answer here:__________**

### Conclusions

In this case study, we saw three different ways of building recommendation systems: 
- rank-based using averages
- similarity-based collaborative filtering
- model-based (matrix factorization) collaborative filtering

We also understood advantages/disadvantages of these recommendation systems and when to use which kind of recommendation systems. Once we build these recommendation systems, we can use **A/B Testing** to measure the effectiveness of these systems.

Here is an article explaining how [Amazon use **A/B Testing**](https://aws.amazon.com/blogs/machine-learning/using-a-b-testing-to-measure-the-efficacy-of-recommendations-generated-by-amazon-personalize/) to measure effectiveness of its recommendation systems.