For this assignment, we will practice collaborative filtering and the different model based recommendation methods. We will be giving book recommendation this time. The data set can be found [here](https://github.com/zygmuntz/goodbooks-10k).

The 4 methods we will use are as follows:
- User-Based Collaborative Filtering
- Item-Based Collaborative Filtering
- Matrix Factorization
- SVD++

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

In [2]:
# Load the datasets
books = pd.read_csv('books.csv') # Book metadata
ratings = pd.read_csv('ratings.csv') # User ratings

In [3]:
# Show you what the data looks like
books.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [4]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


There should be a total of 53424 unique users and 10000 books in this dataset. You can verify that if you wish.

## Preprocessing

The first step is to perform some preprocessing of the data. In particular, we will format the ratings data into the nice matrix we have seen in class. I will first merge the two files, so I will eliminate any ratings that does not have book metadata information (if any).

In [5]:
# Merge the two datasets
merged_data = pd.merge(books, ratings, on='book_id')[['user_id', 'book_id', 'rating', 'original_title']]

In [6]:
# Let's see what the merged data looks like
merged_data.head()

Unnamed: 0,user_id,book_id,rating,original_title
0,2886,1,5,The Hunger Games
1,6158,1,5,The Hunger Games
2,3991,1,4,The Hunger Games
3,5281,1,5,The Hunger Games
4,5721,1,5,The Hunger Games


It turns out that if we work with this data, you might run into memory issue. Hence I am going to keep only the user with ID less than or equal to 10000.

In [7]:
merged_data = merged_data[merged_data.user_id <= 10000]

#### Your tasks starts here. First create the rating matrix. Replace any missing values with 0 afterwards. 

In [8]:
rat = merged_data.copy()
#data['counts'] = np.ones(data.shape[0])
rat = rat.pivot_table(index = 'user_id', columns = 'book_id')
rat.columns = rat.columns.droplevel()  # drop the extra column name 'rating'
rat.fillna(0, inplace = True)   # replace missing value with 0
rat.head()

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,5.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,5.0,0.0,4.0,4.0,0.0,4.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## User-Based Collaborative Filtering
The first model to use will be the user-based collaborative filtering.

1. While this is not the best practice, I want you to use Euclidean distance to measure the similarity between users. Think carefully when you use this measure during the implementation.
2. Use 100 neighbors when calculating the predicted scores.
3. Give me the top 15 recommendations for user with user_id 1839. Give me the book titles and predicted ratings.
4. Also store the recommendations in a variable. We will compare this result with other models later.

In [9]:
# Calculate user similarity using Euclidean Distance
user_sim = pd.DataFrame(euclidean_distances(rat), index = rat.index, columns = rat.index)
user_sim 

user_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,51.778374,42.918527,52.962251,57.026310,56.026779,62.016127,51.264022,52.239832,52.172790,...,52.962251,54.129474,51.874849,51.768716,72.601653,48.754487,52.459508,50.665570,53.385391,51.604263
2,51.778374,0.000000,39.937451,51.478151,54.763126,54.221767,59.991666,50.169712,50.833060,53.693575,...,50.119856,53.823787,51.439285,52.583267,68.992753,45.541190,49.264592,49.457052,53.525695,47.916594
3,42.918527,39.937451,0.000000,46.593991,45.077711,45.066617,51.264022,39.572718,44.147480,46.303348,...,40.261644,45.453273,43.231933,43.382024,64.521314,34.914181,44.452222,44.237993,45.011110,42.860238
4,52.962251,51.478151,46.593991,0.000000,60.852280,60.033324,61.830413,54.083269,54.129474,56.648036,...,56.797887,58.711157,55.910643,57.384667,72.194183,51.923020,48.218254,51.749396,57.922362,50.219518
5,57.026310,54.763126,45.077711,60.852280,0.000000,58.625933,64.015623,55.263008,58.352378,59.983331,...,54.525224,58.086143,57.818682,56.833089,73.082146,51.078371,58.906706,58.711157,58.154965,58.369513
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,48.754487,45.541190,34.914181,51.923020,51.078371,51.710734,57.140179,47.042534,48.187135,51.739733,...,47.010637,50.328918,47.265209,49.284886,66.558245,0.000000,49.909919,50.438081,47.968740,48.785244
9997,52.459508,49.264592,44.452222,48.218254,58.906706,57.192657,57.602083,50.139805,50.921508,53.009433,...,54.212545,58.258047,55.344376,54.166410,71.854019,49.909919,0.000000,51.913389,54.129474,51.951901
9998,50.665570,49.457052,44.237993,51.749396,58.711157,56.709788,62.072538,53.432200,52.668776,54.046276,...,54.589376,55.018179,52.535702,53.150729,73.293929,50.438081,51.913389,0.000000,55.865911,50.970580
9999,53.385391,53.525695,45.011110,57.922362,58.154965,58.455111,62.080593,55.533774,54.433446,57.515215,...,54.101756,56.035703,55.054518,56.409219,72.034714,47.968740,54.129474,55.865911,0.000000,53.413481


In [10]:
neighbors = 100 # Define the number of neighbors to use
nearest_neighbors = user_sim[1839].sort_values(ascending = False)[1:(neighbors+1)]   # select column '1839'

In [11]:
# Get predicted ratings for all unseen books
unseen_book_index = rat.columns[rat.loc[1839] == 0]
missing_ratings = []
for book_id in unseen_book_index:
    neighbors_ratings = rat.loc[nearest_neighbors.index,book_id]       # If it's item-base CF, loc[] need to be flipped --> loc[user_id, nearest_neighbors.index]
    missing_ratings.append(sum(nearest_neighbors * neighbors_ratings) / sum(nearest_neighbors))

missing_ratings = pd.Series(missing_ratings, index=unseen_book_index).sort_values(ascending = False)

In [12]:
# Get the top 15 recommendations 
top15_rating_ub  = pd.DataFrame(missing_ratings[:15])
top15_rating_ub.columns = ['pre_ratings']

In [13]:
# Get the book title and combine with the correspoding predicted rating
top15_rating_ub['Book_id'] = top15_rating_ub.index
top15_rating_ub  = pd.merge(top15_rating_ub, books,on='book_id')
top15_rating_ub  = top15_rating_ub [['Book_id','original_title', 'pre_ratings']]
top15_rating_ub.set_index('Book_id', inplace=True)
top15_rating_ub = top15_rating_ub.rename(columns={'original_title': 'original_title_ub', 'pre_ratings': 'pre_ratings_ub'})
top15_rating_ub 

Unnamed: 0_level_0,original_title_ub,pre_ratings_ub
Book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
23,Harry Potter and the Chamber of Secrets,2.013137
59,Charlotte's Web,1.993616
18,Harry Potter and the Prisoner of Azkaban,1.860408
2,Harry Potter and the Philosopher's Stone,1.858571
24,Harry Potter and the Goblet of Fire,1.845202
15,Het Achterhuis: Dagboekbrieven 14 juni 1942 - ...,1.826924
21,Harry Potter and the Order of the Phoenix,1.752573
27,Harry Potter and the Half-Blood Prince,1.71361
32,Of Mice and Men,1.709761
7,The Hobbit or There and Back Again,1.654587


## Item-Based Collaborative Filtering
Next we will use item-based collaborative filtering. 

1. This time I want you to use cosine similarity to measure the similarity between items.
2. Use 100 neighbors when calculating the predicted scores.
3. Give me the top 15 recommendations for user with user_id 1839. Give me the book titles and predicted ratings.
4. Also store the recommendations in a variable.

In [14]:
# Calculate item similarity using cosine similarity
item_sim = cosine_similarity(rat.T)
item_sim = pd.DataFrame(item_sim, index = rat.columns, columns = rat.columns)
item_sim

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.426320,0.469690,0.378218,0.341510,0.397019,0.294236,0.289858,0.325832,0.325946,...,0.016655,0.017877,0.024360,0.050888,0.037132,0.012379,0.006301,0.013394,0.057893,0.001413
2,0.426320,1.000000,0.486069,0.542027,0.456410,0.188067,0.511171,0.445516,0.445610,0.467768,...,0.024669,0.000000,0.013380,0.035157,0.032404,0.009075,0.005433,0.028263,0.022892,0.030204
3,0.469690,0.486069,1.000000,0.384906,0.303699,0.226173,0.283795,0.279462,0.367358,0.387361,...,0.011719,0.000000,0.007863,0.035419,0.021788,0.021900,0.015733,0.006696,0.025669,0.000000
4,0.378218,0.542027,0.384906,1.000000,0.594790,0.208651,0.451769,0.594925,0.375186,0.488981,...,0.049607,0.004629,0.024440,0.015482,0.053446,0.004258,0.015499,0.011847,0.037395,0.017243
5,0.341510,0.456410,0.303699,0.594790,1.000000,0.186000,0.420847,0.607781,0.320497,0.464527,...,0.049721,0.000000,0.015070,0.019910,0.060987,0.003833,0.025701,0.013794,0.031513,0.031184
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,0.012379,0.009075,0.021900,0.004258,0.003833,0.010635,0.000000,0.007059,0.005018,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000
9997,0.006301,0.005433,0.015733,0.015499,0.025701,0.000000,0.019745,0.018004,0.000000,0.016680,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.107720,0.000000,0.042117
9998,0.013394,0.028263,0.006696,0.011847,0.013794,0.004239,0.046626,0.019787,0.017226,0.017967,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.107720,1.000000,0.000000,0.000000
9999,0.057893,0.022892,0.025669,0.037395,0.031513,0.054846,0.009122,0.027701,0.007501,0.020706,...,0.000000,0.000000,0.000000,0.000000,0.012210,0.000000,0.000000,0.000000,1.000000,0.000000


In [15]:
# Try to use function for this part
def IBCF(userid, n_neighbors, top_n, similarity):
    '''
    Input:
    userid: The user of interest
    n_neighbors: Number of neighbors for similarity count
    top_n: Top n recommendations to return
    similarity: The similarity matrix
    
    Output: 
    The top n recommendations
    '''
    
    # Obtain unseen movie indices
    unseen_book_index = rat.columns[rat.loc[userid] == 0]
    missing_ratings = []
    # Calculate predicted rating for each new movie
    for book_id in unseen_book_index:
        nearest_neighbors = similarity[book_id].sort_values(ascending = False)[1:(n_neighbors+1)]
        neighbors_ratings = rat.loc[userid, nearest_neighbors.index]
        missing_ratings.append(sum(nearest_neighbors * neighbors_ratings) / sum(nearest_neighbors))
    
    # Sort the predictions
    missing_ratings = pd.Series(missing_ratings, index=unseen_book_index).sort_values(ascending = False)
    
    # Extract only the top n movies
    recommend_books = pd.DataFrame(missing_ratings[:top_n])
    recommend_books.columns = ['pre_ratings']
    
    # Merge the predicted results with the books table to get the book titles
    recommend_books['Book_id'] = recommend_books.index
    recommend_books = pd.merge(recommend_books,books,on='book_id')
    recommend_books = recommend_books[['Book_id','original_title', 'pre_ratings']]
    recommend_books.set_index('Book_id', inplace=True)
    recommend_books = recommend_books.rename(columns={'original_title': 'original_title_ib', 'pre_ratings': 'pre_ratings_ib'})
    
    return recommend_books
    
#     # Print the recommendations
#     for i in range(top_n):
#         rec_book = recommend_books[i]
#         print("my number ", i+1, " recommendation is ", titles[titles['id'] == str(rec_book)]['title'].values[0], 
#               ", with a predicted rating of", missing_ratings.iloc[i])

In [16]:
top15_rating_ib = IBCF(1839, 100, 15, item_sim)
top15_rating_ib

Unnamed: 0_level_0,original_title_ib,pre_ratings_ib
Book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
7213,,1.137208
6185,Secret Prey,1.106118
8853,Sudden Prey,1.096361
9388,Night Prey,1.081804
9468,Mortal Prey,1.068413
8824,Mind Prey,1.067748
5109,Chosen Prey,1.012271
6172,Heat Lightning,0.922163
5284,Bad Blood,0.818423
6698,,0.78646


## Matrix Factorization
Now we will turn to model based methods. First we will look at Matrix Factorization. You can use the code I presented in class. 

1. Use 3 latent factors.
2. Set the learning rate at 0.001 and beta at 0.01. Since it will take a while to run, 10 iterations will be fine.
3. Fit the model (it will take a while to run).
4. Give me the top 15 recommendations for user with user_id 1839. Return boths book names and predicted ratings.
5. Store the recommendations in a variable.

In [17]:
def matrix_factorization(R, P, Q, K, steps=10, alpha=0.001, beta=0.01):
    '''
    Inputs:
    R     : The ratings (of dimension M x N)
    P     : an initial matrix of dimension M x K
    Q     : an initial matrix of dimension N x K
    K     : the number of latent features
    steps : the maximum number of steps to perform the optimization
    alpha : the learning rate
    beta  : the regularization parameter

    Outputs:
    the final matrices P and Q
    '''

    for step in range(steps):
        for i in range(R.shape[0]):     # 10000
            for j in range(R.shape[1]):   # 9963
                if R[i][j] > 0: # Skipping over missing ratings (the one replaced with 0)
                    eij = R[i][j] - np.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = np.dot(P,Q)
        e = 0
        for i in range(R.shape[0]):
            for j in range(R.shape[1]):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta/2) * ( pow(P[i][k],2) + pow(Q[k][j],2) )
        if e < 0.001: # tolerance
            break
    return P, Q

In [18]:
np.random.seed(862)

# Initializations
M = rat.shape[0] # Number of users
N = rat.shape[1] # Number of items
K = 3 # Number of latent features

# Initial estimate of P and Q
P = np.random.rand(M,K)
Q = np.random.rand(K,N)
rating_np = np.array(rat)

In [19]:
# Fitting
P, Q = matrix_factorization(rating_np, P, Q, K)

In [20]:
predicted_rating = np.matmul(P, Q)
predicted_rating = pd.DataFrame(predicted_rating, index = rat.index, columns = rat.columns)
predicted_rating_1839 = predicted_rating.loc[1839]
predicted_rating_1839

book_id
1        4.149693
2        3.926659
3        3.333679
4        4.012280
5        3.547584
           ...   
9996     2.323818
9997     3.659235
9998     3.489974
9999     3.473351
10000    3.439645
Name: 1839, Length: 9963, dtype: float64

In [21]:
unseen_book_index = rat.columns[rat.loc[1839] == 0]        # get the book index which user did not rate yet
missing_ratings = []

# find the predicted rating results for the missing ratings
for i in unseen_book_index:
    missing_ratings.append(predicted_rating_1839[i])
    
missing_ratings = pd.Series(missing_ratings, index=unseen_book_index).sort_values(ascending = False)    # sort the rating value
top15_rating_mf = pd.DataFrame(missing_ratings[:15])     # get the top 15 ratings
top15_rating_mf.columns = ['pre_ratings']

# Get the book title and combine with the correspoding predicted rating
top15_rating_mf['Book_id'] = top15_rating_mf.index
top15_rating_mf = pd.merge(top15_rating_mf,books,on='book_id')
top15_rating_mf = top15_rating_mf[['Book_id','original_title', 'pre_ratings']]
top15_rating_mf.set_index('Book_id', inplace=True)
top15_rating_mf = top15_rating_mf.rename(columns={'original_title': 'original_title_mf', 'pre_ratings': 'pre_ratings_mf'})
top15_rating_mf 

Unnamed: 0_level_0,original_title_mf,pre_ratings_mf
Book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
8946,دیوان‎‎ [Dīvān],4.858059
4868,Jesus the Christ: A Study of the Messiah and H...,4.842447
3628,The Complete Calvin and Hobbes,4.794923
7401,The Brothers K,4.762942
3491,Just Mercy: A Story of Justice and Redemption,4.706392
6590,The Authoritative Calvin and Hobbes,4.701359
1010,The Essential Calvin and Hobbes: A Calvin and ...,4.673762
6920,The Indispensable Calvin and Hobbes: A Calvin ...,4.672812
6902,Standing for Something: 10 Neglected Virtues T...,4.668288
5207,The Days Are Just Packed: A Calvin and Hobbes ...,4.662562


## SVD++

While we briefly introduced the SVD++ model in class, we didn't see how to use that in Python. Here is your chance to practice this. First, you will need to install the [surprise](http://surpriselib.com/) library (if you havn't yet).

In lecture we described the factorization algorithm as SVD++. However, the surprise library called it SVD instead (and use SVD++ for a different yet similar algorithm). Your task here is to implement the [SVD](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD) algorithm from the surprise library. I will walk you through as much as I can.

In order to use the surprise library, we need to first put the data into its accepted format. [Here](https://surprise.readthedocs.io/en/stable/getting_started.html#load-dom-dataframe-py) is an example on how it work. In general, you need to do the following:

1. Set up a Reader class
2. Load the dataframe 
3. Build the data set using the build_full_trainset() method (see [here](https://surprise.readthedocs.io/en/stable/trainset.html) or [here](https://stackoverflow.com/questions/49263964/datasetautofolds-object-has-no-attribute-global-mean-on-python-surprise))


In [22]:
# Load the libraries
from surprise import Reader
from surprise import Dataset
from surprise.prediction_algorithms.matrix_factorization import SVD

In [23]:
# Step 1: Set up the reader class
reader = Reader(rating_scale=(1,5))

In [24]:
# Step 2: Load the dataframe. Use the merged data from above (not the pivoted data)
data = Dataset.load_from_df(merged_data[['user_id', 'book_id', 'rating']], reader)

In [25]:
# Step 3: Build the train set
svd_data = data.build_full_trainset()

Now we have prepared the data set, you task is then to build the model. I have already imported the SVD algorithm for you. The usage is similar to any sklearn model: you first instantiate a model and set any hyperparamters, then but the model. For this model, use 5 latent factors, a learning rate of 0.01 for all parameters, and a regularization parameter of 0.1 for all parameters. Set a random state of 862.

In [26]:
model = SVD(n_factors=5, lr_all=0.01, reg_all=0.1, random_state=862)
model.fit(svd_data)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fe9d6318dd0>

Now we have fitted the model, we can perform prediction. There are severals you can do this:

1. Calculate the individual ratings $r_{ui}$ by using the given equation in lecture or [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)
2. Calculate the overall rating matrix by doing some matrix multiplications and manipulations
3. Probably the easiest, is to use the predict function (see an example [here](https://surprise.readthedocs.io/en/stable/getting_started.html#predict-ratings2-py) and [here](https://predictivehacks.com/how-to-run-recommender-systems-in-python/). You may not need to use the str() function)


I will let you decide which you want to do, but the goal is the same, provide the top 15 recommendations (based on the predicted ratings) for user with user_id 1839. Show me the recommendations and the predicted values. Store the recommendations.

In [27]:
# # get the list of the ids that the userid 1839 has rated
# bids1839 = merged_data.loc[merged_data['user_id']==1839, 'book_id']

# # remove the rated books for the recommendations
# books_to_predict = np.setdiff1d(1839,bids1839)

In [28]:
books_to_predict = rat.columns[rat.loc[1839] == 0]   # Get the index of books which are not predicted by user yet 

my_recs = []
for bid in books_to_predict:
    my_recs.append((bid, model.predict(uid=1839,iid=bid).est))
    
top15_rating_svd = pd.DataFrame(my_recs, columns=['book_id', 'pre_ratings']).sort_values('pre_ratings', ascending=False).head(15)

In [29]:
# Get the book title and combine with the correspoding predicted rating
top15_rating_svd = pd.merge(top15_rating_svd,books,on='book_id')
top15_rating_svd = top15_rating_svd[['book_id','original_title', 'pre_ratings']]
top15_rating_svd.set_index('book_id', inplace=True)
top15_rating_svd = top15_rating_svd.rename(columns={'original_title': 'original_title_svd', 'pre_ratings': 'pre_ratings_svd'})
top15_rating_svd 

Unnamed: 0_level_0,original_title_svd,pre_ratings_svd
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3628,The Complete Calvin and Hobbes,4.630109
8946,دیوان‎‎ [Dīvān],4.607762
7029,I Want My Hat Back,4.595853
4868,Jesus the Christ: A Study of the Messiah and H...,4.568464
7883,The Sandman: King of Dreams,4.565016
5919,,4.525111
4653,,4.508285
9076,Preach My Gospel (A Guide to Missionary Service),4.505468
6361,There's Treasure Everywhere: A Calvin and Hobb...,4.479181
6089,الرحيق المختوم: بحث في السيرة النبوية على صاح...,4.478206


## Comparison

We have tried to provide recommendations to user 1839 using 4 methods. You last task is to put these 4 recommendations in a dataframe, with the column names the methods you used, and print out the dataframe.

In [30]:
# Adjust the dataframes from 4 methods
top15_rating_ub = top15_rating_ub.reset_index().drop(columns=['Book_id'])
top15_rating_ib = top15_rating_ib.reset_index().drop(columns=['Book_id'])
top15_rating_mf = top15_rating_mf.reset_index().drop(columns=['Book_id'])
top15_rating_svd = top15_rating_svd.reset_index().drop(columns=['book_id'])

In [31]:
top15_rating = [top15_rating_ub, top15_rating_ib, top15_rating_mf, top15_rating_svd]  # List of dataframes
recommendations = pd.concat(top15_rating, axis=1)
recommendations

Unnamed: 0,original_title_ub,pre_ratings_ub,original_title_ib,pre_ratings_ib,original_title_mf,pre_ratings_mf,original_title_svd,pre_ratings_svd
0,Harry Potter and the Chamber of Secrets,2.013137,,1.137208,دیوان‎‎ [Dīvān],4.858059,The Complete Calvin and Hobbes,4.630109
1,Charlotte's Web,1.993616,Secret Prey,1.106118,Jesus the Christ: A Study of the Messiah and H...,4.842447,دیوان‎‎ [Dīvān],4.607762
2,Harry Potter and the Prisoner of Azkaban,1.860408,Sudden Prey,1.096361,The Complete Calvin and Hobbes,4.794923,I Want My Hat Back,4.595853
3,Harry Potter and the Philosopher's Stone,1.858571,Night Prey,1.081804,The Brothers K,4.762942,Jesus the Christ: A Study of the Messiah and H...,4.568464
4,Harry Potter and the Goblet of Fire,1.845202,Mortal Prey,1.068413,Just Mercy: A Story of Justice and Redemption,4.706392,The Sandman: King of Dreams,4.565016
5,Het Achterhuis: Dagboekbrieven 14 juni 1942 - ...,1.826924,Mind Prey,1.067748,The Authoritative Calvin and Hobbes,4.701359,,4.525111
6,Harry Potter and the Order of the Phoenix,1.752573,Chosen Prey,1.012271,The Essential Calvin and Hobbes: A Calvin and ...,4.673762,,4.508285
7,Harry Potter and the Half-Blood Prince,1.71361,Heat Lightning,0.922163,The Indispensable Calvin and Hobbes: A Calvin ...,4.672812,Preach My Gospel (A Guide to Missionary Service),4.505468
8,Of Mice and Men,1.709761,Bad Blood,0.818423,Standing for Something: 10 Neglected Virtues T...,4.668288,There's Treasure Everywhere: A Calvin and Hobb...,4.479181
9,The Hobbit or There and Back Again,1.654587,,0.78646,The Days Are Just Packed: A Calvin and Hobbes ...,4.662562,الرحيق المختوم: بحث في السيرة النبوية على صاح...,4.478206
