### Introduction

This notebook demonstrates utilizing different approaches for Book Recommendation.
Performed are memory and model based Collaborative Filtering(CF) and recommendations are provided for user 1839 using the 4 methods.
    1. Memory-based -> Content-Based CF 
    a. User-based with Eucledean Distance measure
    b. Item-based with Cosine Similarity measure
    
    2. Model-based -> Matrix Factorization based CF
    a. Matrix Factorization
    b. SVD++
The data set can be found [here](https://github.com/zygmuntz/goodbooks-10k).
#### Result 
The comparison shows that model based recommendations with Matrix Factorization and SVD++ somewhat match (8 out of 15) match.
However, memory based recommendations do not match among themselves and even with model based recommendations.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Load the datasets
books = pd.read_csv('books.csv') # Book metadata
ratings = pd.read_csv('ratings.csv') # User ratings

In [3]:
# Show you what the data looks like
books.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [4]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


There should be a total of 53424 unique users and 10000 books in this dataset.

## Preprocessing

The first step is to perform some preprocessing of the data. In particular, format the ratings data into a matrix. 
Firstly, merge the two files, to eliminate any ratings that does have book metadata information (if any).

In [5]:
# Merge the two datasets
merged_data = pd.merge(books, ratings, on='book_id')[['user_id', 'book_id', 'rating', 'original_title']]

In [6]:
# Let's see what the merged data looks like
merged_data.head()

Unnamed: 0,user_id,book_id,rating,original_title
0,2886,1,5,The Hunger Games
1,6158,1,5,The Hunger Games
2,3991,1,4,The Hunger Games
3,5281,1,5,The Hunger Games
4,5721,1,5,The Hunger Games


It turns out that if we work with this data, it might run into memory issue. Hence for this exercise, let's keep only the user with ID less than or equal to 10000.

In [7]:
merged_data = merged_data[merged_data.user_id <= 10000]

In [8]:
# find Nulls
nulls = {"Feature":[i for i in merged_data.columns] ,"Total records":merged_data.count() , "No: of nulls" : merged_data.isnull().sum() }
description = pd.DataFrame(data = nulls)
print(description)

                       Feature  Total records  No: of nulls
user_id                user_id        1169033             0
book_id                book_id        1169033             0
rating                  rating        1169033             0
original_title  original_title        1144769         24264


**Observation: We will ignore the Nan values in original title for this assignment**

#### First create the rating matrix. Replace any missing values with 0 afterwards. 

In [9]:
# Use as many boxes as you need 
# Create pivot table
ratings = merged_data.pivot_table(index = 'user_id', columns = 'book_id')

# Some more preprocessing
ratings.columns = ratings.columns.droplevel()
#ratings.index.astype('int64') 
ratings.head()

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,5.0,,,,,,4.0,...,,,,,,,,,,
2,,5.0,,,5.0,,,4.0,,5.0,...,,,,,,,,,,
3,,,,3.0,,,,,,,...,,,,,,,,,,
4,,5.0,,4.0,4.0,,4.0,4.0,,5.0,...,,,,,,,,,,
5,,,,,,4.0,,,,,...,,,,,,,,,,


In [10]:
ratings.shape

(10000, 9963)

In [11]:
# Replace all missing values with 0.
ratings.fillna(0, inplace = True)

In [12]:
ratings.head()

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,5.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,5.0,0.0,4.0,4.0,0.0,4.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Match the book id with actual title

In [13]:
# Reload the book metadata to match the book id with actual title
titles = pd.read_csv('books.csv')
titles = titles.loc[:,['book_id', 'original_title']]
print(titles.shape)
titles.head()

(10000, 2)


Unnamed: 0,book_id,original_title
0,1,The Hunger Games
1,2,Harry Potter and the Philosopher's Stone
2,3,Twilight
3,4,To Kill a Mockingbird
4,5,The Great Gatsby


In [14]:
# find Nulls
nulls = {"Feature":[i for i in titles.columns] ,"Total records":titles.count() , "No: of nulls" : titles.isnull().sum() }
description = pd.DataFrame(data = nulls)
print(description)

                       Feature  Total records  No: of nulls
book_id                book_id          10000             0
original_title  original_title           9415           585


**Observation: As pointed earlier, we will ignore the Nan values in original title for this assignment.**

## User-Based Collaborative Filtering
The first model to use will be the user-based collaborative filtering.

1. While this is not the best practice, let's use Euclidean distance to measure the similarity between users. The lesser the distance the better.
2. Use 100 neighbors to calculate the predicted scores.
3. Get the top 15 recommendations for user with user_id 1839. Get the book titles and predicted ratings.
4. Also store the recommendations in a variable to compare this result with other models later.

In [15]:
# For Predicted rating and recommendation
user_id = 1839 
neighbors = 100
recommendations = 15

In [16]:
# Calculate the similarity score using Euclidean distance
from sklearn.metrics.pairwise import euclidean_distances

user_sim = euclidean_distances(ratings)
user_sim = pd.DataFrame(user_sim, index = ratings.index, columns = ratings.index)
user_sim

user_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,51.778374,42.918527,52.962251,57.026310,56.026779,62.016127,51.264022,52.239832,52.172790,...,52.962251,54.129474,51.874849,51.768716,72.601653,48.754487,52.459508,50.665570,53.385391,51.604263
2,51.778374,0.000000,39.937451,51.478151,54.763126,54.221767,59.991666,50.169712,50.833060,53.693575,...,50.119856,53.823787,51.439285,52.583267,68.992753,45.541190,49.264592,49.457052,53.525695,47.916594
3,42.918527,39.937451,0.000000,46.593991,45.077711,45.066617,51.264022,39.572718,44.147480,46.303348,...,40.261644,45.453273,43.231933,43.382024,64.521314,34.914181,44.452222,44.237993,45.011110,42.860238
4,52.962251,51.478151,46.593991,0.000000,60.852280,60.033324,61.830413,54.083269,54.129474,56.648036,...,56.797887,58.711157,55.910643,57.384667,72.194183,51.923020,48.218254,51.749396,57.922362,50.219518
5,57.026310,54.763126,45.077711,60.852280,0.000000,58.625933,64.015623,55.263008,58.352378,59.983331,...,54.525224,58.086143,57.818682,56.833089,73.082146,51.078371,58.906706,58.711157,58.154965,58.369513
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,48.754487,45.541190,34.914181,51.923020,51.078371,51.710734,57.140179,47.042534,48.187135,51.739733,...,47.010637,50.328918,47.265209,49.284886,66.558245,0.000000,49.909919,50.438081,47.968740,48.785244
9997,52.459508,49.264592,44.452222,48.218254,58.906706,57.192657,57.602083,50.139805,50.921508,53.009433,...,54.212545,58.258047,55.344376,54.166410,71.854019,49.909919,0.000000,51.913389,54.129474,51.951901
9998,50.665570,49.457052,44.237993,51.749396,58.711157,56.709788,62.072538,53.432200,52.668776,54.046276,...,54.589376,55.018179,52.535702,53.150729,73.293929,50.438081,51.913389,0.000000,55.865911,50.970580
9999,53.385391,53.525695,45.011110,57.922362,58.154965,58.455111,62.080593,55.533774,54.433446,57.515215,...,54.101756,56.035703,55.054518,56.409219,72.034714,47.968740,54.129474,55.865911,0.000000,53.413481


In [17]:
def UBCF(userid, n_neighbors, top_n, similarity,titles):
    '''
    Input:
    userid: The user of interest
    n_neighbors: Number of neighbors for similarity count
    top_n: Top n recommendations to return
    similarity: The similarity matrix
    titles: df of target id with actual title
    
    Output: 
    The top n recommendations with predicted rating in a dataframe
    '''
    # Get the nearest neighbors
    nearest_neighbors = similarity[userid].sort_values(ascending = True)[1:(n_neighbors+1)] # ascending = true for lesser the distance the better
    
    # Obtain predicted ratings for unseen target
    unseen_book_index = ratings.columns[ratings.loc[userid] == 0]
    missing_ratings = []
    for book_id in unseen_book_index:
        neighbors_ratings = ratings.loc[nearest_neighbors.index,book_id]
        
        # Store set of bookid, associated title and predicted rating
        missing_ratings.append((book_id, titles[titles['book_id'] == book_id]['original_title'].values[0],
                                 sum(nearest_neighbors * neighbors_ratings) / sum(nearest_neighbors)))
    
    # Sort the predictions 
    ranked_rec = pd.DataFrame(missing_ratings, columns=['book_id','Book_UBCF','Rating']).sort_values('Rating',ascending=False) 
    ranked_rec.reset_index(drop=True,inplace=True) # Reset Index
    
    return ranked_rec[:top_n] # Extract only top_n recommendations

In [18]:
ubcf = UBCF(user_id, neighbors, recommendations, user_sim, titles)
ubcf

Unnamed: 0,book_id,Book_UBCF,Rating
0,26,The Da Vinci Code,1.5208
1,35,O Alquimista,1.395913
2,2,Harry Potter and the Philosopher's Stone,1.226538
3,18,Harry Potter and the Prisoner of Azkaban,1.195236
4,21,Harry Potter and the Order of the Phoenix,1.177363
5,11,The Kite Runner,1.0842
6,24,Harry Potter and the Goblet of Fire,1.065464
7,27,Harry Potter and the Half-Blood Prince,1.014619
8,80,Le Petit Prince,1.005699
9,23,Harry Potter and the Chamber of Secrets,0.99457


In [19]:
# Another Method for reference only
"""
def UBCF(userid, n_neighbors, top_n, similarity,titles):
    '''
    Input:
    userid: The user of interest
    n_neighbors: Number of neighbors for similarity count
    top_n: Top n recommendations to return
    similarity: The similarity matrix
    titles: df of target id with actual title
    
    Output: 
    The top n recommendations with predicted rating in a dataframe
    '''
    # Get the nearest neighbors
    nearest_neighbors = similarity[userid].sort_values(ascending = True)[1:(n_neighbors+1)]
    
    # Obtain predicted ratings for unseen target
    unseen_book_index = ratings.columns[ratings.loc[userid] == 0]
    missing_ratings = []
    for book_id in unseen_book_index:
        neighbors_ratings = ratings.loc[nearest_neighbors.index,book_id]
        missing_ratings.append(sum(nearest_neighbors * neighbors_ratings) / sum(nearest_neighbors))
    
    # Sort the predictions
    missing_ratings = pd.Series(missing_ratings, index=unseen_book_index).sort_values(ascending = False)
    
    # Extract only the top n targets
    recommend_books = missing_ratings.index[:top_n]
     
    # Print the recommendations
    rec_number = []
    pred_book = []
    pred_rating =[]
    
    for i in range(top_n):
        rec_bookid = recommend_books[i]
        rec_books = titles[titles['book_id'] == rec_bookid]['original_title'].values[0]
        rec_rating = missing_ratings.iloc[i]
        
        rec_number.append(i+1)
        pred_book.append(rec_books)
        pred_rating.append(rec_rating)
        
    zipped = list(zip(rec_number, pred_book, pred_rating))
    ranked_rec = pd.DataFrame(zipped, columns = ['Recommendations', 'Book_UBCF', 'Rating'])
        
    return ranked_rec 
"""

"\ndef UBCF(userid, n_neighbors, top_n, similarity,titles):\n    '''\n    Input:\n    userid: The user of interest\n    n_neighbors: Number of neighbors for similarity count\n    top_n: Top n recommendations to return\n    similarity: The similarity matrix\n    titles: df of target id with actual title\n    \n    Output: \n    The top n recommendations with predicted rating in a dataframe\n    '''\n    # Get the nearest neighbors\n    nearest_neighbors = similarity[userid].sort_values(ascending = True)[1:(n_neighbors+1)]\n    \n    # Obtain predicted ratings for unseen target\n    unseen_book_index = ratings.columns[ratings.loc[userid] == 0]\n    missing_ratings = []\n    for book_id in unseen_book_index:\n        neighbors_ratings = ratings.loc[nearest_neighbors.index,book_id]\n        missing_ratings.append(sum(nearest_neighbors * neighbors_ratings) / sum(nearest_neighbors))\n    \n    # Sort the predictions\n    missing_ratings = pd.Series(missing_ratings, index=unseen_book_index).s

## Item-Based Collaborative Filtering
Next we will use item-based collaborative filtering. 

1. This time let's use cosine similarity to measure the similarity between items.
2. Use 100 neighbors when calculating the predicted scores.
3. Get the top 15 recommendations for user with user_id 1839. Get the book titles and predicted ratings.
4. Also store the recommendations in a variable to compare this result with other models later.

In [20]:
# Use as many boxes as you need
from sklearn.metrics.pairwise import cosine_similarity

item_sim = cosine_similarity(ratings.T)
item_sim = pd.DataFrame(item_sim, index = ratings.columns, columns = ratings.columns)
item_sim

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.426320,0.469690,0.378218,0.341510,0.397019,0.294236,0.289858,0.325832,0.325946,...,0.016655,0.017877,0.024360,0.050888,0.037132,0.012379,0.006301,0.013394,0.057893,0.001413
2,0.426320,1.000000,0.486069,0.542027,0.456410,0.188067,0.511171,0.445516,0.445610,0.467768,...,0.024669,0.000000,0.013380,0.035157,0.032404,0.009075,0.005433,0.028263,0.022892,0.030204
3,0.469690,0.486069,1.000000,0.384906,0.303699,0.226173,0.283795,0.279462,0.367358,0.387361,...,0.011719,0.000000,0.007863,0.035419,0.021788,0.021900,0.015733,0.006696,0.025669,0.000000
4,0.378218,0.542027,0.384906,1.000000,0.594790,0.208651,0.451769,0.594925,0.375186,0.488981,...,0.049607,0.004629,0.024440,0.015482,0.053446,0.004258,0.015499,0.011847,0.037395,0.017243
5,0.341510,0.456410,0.303699,0.594790,1.000000,0.186000,0.420847,0.607781,0.320497,0.464527,...,0.049721,0.000000,0.015070,0.019910,0.060987,0.003833,0.025701,0.013794,0.031513,0.031184
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,0.012379,0.009075,0.021900,0.004258,0.003833,0.010635,0.000000,0.007059,0.005018,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000
9997,0.006301,0.005433,0.015733,0.015499,0.025701,0.000000,0.019745,0.018004,0.000000,0.016680,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.107720,0.000000,0.042117
9998,0.013394,0.028263,0.006696,0.011847,0.013794,0.004239,0.046626,0.019787,0.017226,0.017967,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.107720,1.000000,0.000000,0.000000
9999,0.057893,0.022892,0.025669,0.037395,0.031513,0.054846,0.009122,0.027701,0.007501,0.020706,...,0.000000,0.000000,0.000000,0.000000,0.012210,0.000000,0.000000,0.000000,1.000000,0.000000


In [21]:
def IBCF(userid, n_neighbors, top_n, similarity, titles):
    '''
    Input:
    userid: The user of interest
    n_neighbors: Number of neighbors for similarity count
    top_n: Top n recommendations to return
    similarity: The similarity matrix
    titles: df of target id with actual title
    
    Output: 
    The top n recommendations with predicted rating in a dataframe
    '''
    
    # Obtain unseen target indices
    unseen_book_index = ratings.columns[ratings.loc[userid] == 0]
    missing_ratings = []
    # Calculate predicted rating for each new target
    for book_id in unseen_book_index:
        nearest_neighbors = similarity[book_id].sort_values(ascending = False)[1:(n_neighbors+1)] # ascending is false as larger the similarity, the better
        neighbors_ratings = ratings.loc[userid, nearest_neighbors.index]
        
        # Store set of bookid, associated title and predicted rating
        missing_ratings.append((book_id, titles[titles['book_id'] == book_id]['original_title'].values[0],
                                sum(nearest_neighbors * neighbors_ratings) / sum(nearest_neighbors)))
    
    # Sort the predictions 
    ranked_rec = pd.DataFrame(missing_ratings, columns=['book_id','Book_IBCF','Rating']).sort_values('Rating',ascending=False) 
    ranked_rec.reset_index(drop=True,inplace=True) # Reset Index
    
    return ranked_rec[:top_n] # Extract only top_n recommendations

In [22]:
ibcf = IBCF(user_id, neighbors, recommendations, item_sim, titles)
ibcf

Unnamed: 0,book_id,Book_IBCF,Rating
0,7213,,1.137208
1,6185,Secret Prey,1.106118
2,8853,Sudden Prey,1.096361
3,9388,Night Prey,1.081804
4,9468,Mortal Prey,1.068413
5,8824,Mind Prey,1.067748
6,5109,Chosen Prey,1.012271
7,6172,Heat Lightning,0.922163
8,5284,Bad Blood,0.818423
9,6698,,0.78646


## Matrix Factorization
Now we will turn to model based methods. First we will look at Matrix Factorization.  

1. Use 3 latent factors.
2. Set the learning rate at 0.001 and beta at 0.01. Since it will take a while to run, 10 iterations will be fine.
3. Fit the model (it will take a while to run).
4. Get the top 15 recommendations for user with user_id 1839. Return boths book names and predicted ratings.
5. Also store the recommendations in a variable to compare this result with other models later.

In [23]:
def matrix_factorization(R, P, Q, K, steps=10, alpha=0.001, beta=0.01):
    '''
    Inputs:
    R     : The ratings (of dimension M x N)
    P     : an initial matrix of dimension M x K
    Q     : an initial matrix of dimension N x K
    K     : the number of latent features
    steps : the maximum number of steps to perform the optimization
    alpha : the learning rate
    beta  : the regularization parameter

    Outputs:
    the final matrices P and Q
    '''

    for step in range(steps):
        for i in range(R.shape[0]):
            for j in range(R.shape[1]):
                if R[i][j] > 0: # Skipping over missing ratings
                    eij = R[i][j] - np.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = np.dot(P,Q)
        e = 0
        for i in range(R.shape[0]):
            for j in range(R.shape[1]):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta/2) * ( pow(P[i][k],2) + pow(Q[k][j],2) )
        if e < 0.001: # tolerance
            break
    return P, Q

In [24]:
np.random.seed(862)

# Initializations
M = ratings.shape[0] # Number of users
N = ratings.shape[1] # Number of items
K = 3 # Number of latent features

# Initial estimate of P and Q
P = np.random.rand(M,K)
Q = np.random.rand(K,N)
rating_np = np.array(ratings)

In [25]:
# Run the fitting.
P, Q = matrix_factorization(rating_np, P, Q, K)

### Get Complete set of rating Matrix

In [26]:
"""
# Perform prediction. 
# Multiply P and Q together to get a complete rating matrix.
predicted_rating = np.matmul(P[userid], Q)
predicted_rating = pd.DataFrame(predicted_rating, index = ratings.index, columns = ratings.columns)
print(predicted_rating)
"""

'\n# Perform prediction. \n# Multiply P and Q together to get a complete rating matrix.\npredicted_rating = np.matmul(P[userid], Q)\npredicted_rating = pd.DataFrame(predicted_rating, index = ratings.index, columns = ratings.columns)\nprint(predicted_rating)\n'

### Get prediction ratings for only particular userid

In [27]:
def MF_rec(userid, top_n, final_P, final_Q, titles):
    '''
    Input:
    userid: The user of interest
    top_n: Top n recommendations to return
    final_P,final_Q: the final matrices P and Q from matrix_factorization function
    titles: df of target id with actual title
    
    Output: 
    The top n recommendations with predicted rating in a dataframe
    '''
    
    # Perform prediction. 
    # Multiply only relevant P and Q together to get the user rating matrix.
    user_predicted_rating= np.matmul(final_P[userid-1], final_Q)
    predicted_rating = pd.Series(user_predicted_rating, index=ratings.columns)

    
    # Obtain unseen target indices
    unseen_book_index = ratings.columns[ratings.loc[userid] == 0]
    missing_ratings = []
    for book_id in unseen_book_index:
        p_ratings = predicted_rating.loc[book_id]
        
        # Store set of bookid, associated title and predicted rating
        missing_ratings.append((book_id, titles[titles['book_id'] == book_id]['original_title'].values[0],p_ratings))

    # Sort the predictions 
    ranked_rec = pd.DataFrame(missing_ratings, columns=['book_id','Book_MF','Rating']).sort_values('Rating',ascending=False) 
    ranked_rec.reset_index(drop=True,inplace=True) # Reset Index
    
    return ranked_rec[:top_n] # Extract only top_n recommendations

In [28]:
mf_rec = MF_rec(user_id, recommendations, P, Q, titles)
mf_rec

Unnamed: 0,book_id,Book_MF,Rating
0,8946,دیوان‎‎ [Dīvān],4.858059
1,4868,Jesus the Christ: A Study of the Messiah and H...,4.842447
2,3628,The Complete Calvin and Hobbes,4.794923
3,7401,The Brothers K,4.762942
4,3491,Just Mercy: A Story of Justice and Redemption,4.706392
5,6590,The Authoritative Calvin and Hobbes,4.701359
6,1010,The Essential Calvin and Hobbes: A Calvin and ...,4.673762
7,6920,The Indispensable Calvin and Hobbes: A Calvin ...,4.672812
8,6902,Standing for Something: 10 Neglected Virtues T...,4.668288
9,5207,The Days Are Just Packed: A Calvin and Hobbes ...,4.662562


## SVD++

Install the [surprise](http://surpriselib.com/) library.

The factorization algorithm used is SVD++. However, the surprise library called it SVD instead (and use SVD++ for a different yet similar algorithm). Let's implement the [SVD](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD) algorithm from the surprise library.

In order to use the surprise library, we need to first put the data into its accepted format. [Here](https://surprise.readthedocs.io/en/stable/getting_started.html#load-dom-dataframe-py) is an example on how it works. In general, following are the steps:

1. Set up a Reader class
2. Load the dataframe 
3. Build the data set using the build_full_trainset() method (see [here](https://surprise.readthedocs.io/en/stable/trainset.html) or [here](https://stackoverflow.com/questions/49263964/datasetautofolds-object-has-no-attribute-global-mean-on-python-surprise))


In [29]:
# Load the libraries
from surprise import Reader
from surprise import Dataset
from surprise.prediction_algorithms.matrix_factorization import SVD

In [30]:
# Step 1: Set up the reader class
reader = Reader(rating_scale=(1,5))


In [31]:
# Step 2: Load the dataframe. Use the merged data from above (not the pivoted data)
data = Dataset.load_from_df(merged_data[['user_id', 'book_id', 'rating']], reader)


In [32]:
# Step 3: Build the train set
svd_data = data.build_full_trainset()


Now we have prepared the data set, let's now build the model. The usage is similar to any sklearn model: first instantiate a model and set any hyperparamters, then build the model. For this model, use 5 latent factors, a learning rate of 0.01 for all parameters, and a regularization parameter of 0.1 for all parameters. Set a random state of 862.

In [33]:
algo = SVD(n_factors=5, lr_all=.01, reg_all= 0.1, random_state=862) # Instantiate and set hyperparamters
algo.fit(svd_data) # Fit the model

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a9a512b7c8>

Now we have fitted the model, we can perform prediction. There are several ways it can be done:

1. Calculate the individual ratings $r_{ui}$ by using the given equation in lecture or [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)
2. Calculate the overall rating matrix by doing some matrix multiplications and manipulations
3. Probably the easiest, is to use the predict function (see an example [here](https://surprise.readthedocs.io/en/stable/getting_started.html#predict-ratings2-py) and [here](https://predictivehacks.com/how-to-run-recommender-systems-in-python/). You may not need to use the str() function)


Can choose any method, but the goal is the same, get the top 15 recommendations (based on the predicted ratings) for user with user_id 1839. Get the recommendations and the predicted values. Also, store the recommendations in a variable to compare this result with other models later.

In [34]:
def svd_cf(userid, top_n, fitted_SVD, titles):
    '''
    Input:
    userid: The user of interest
    top_n: Top n recommendations to return
    titles: df of target id with actual title
    fitted_SVD: Fitted model with SVD algorithm
    
    Output: 
    The top n recommendations with predicted rating in a dataframe
    '''
           
    # Obtain unseen target indices
    unseen_book_index = ratings.columns[ratings.loc[userid] == 0]
    missing_ratings = []
    
    # Calculate predicted rating for each new target
    for book_id in unseen_book_index:
        
        # Store set of bookid, associated title and predicted rating
        missing_ratings.append((book_id, titles[titles['book_id'] == book_id]['original_title'].values[0],
                                fitted_SVD.predict(uid=userid,iid=book_id).est)) # Extract predicted rating est
    
    # Sort the predictions        
    ranked_rec = pd.DataFrame(missing_ratings, columns=['book_id','Book_SVD','Rating']).sort_values('Rating',ascending=False) 
    ranked_rec.reset_index(drop=True,inplace=True) # Reset Index
    return ranked_rec[:top_n] # Extract only top_n recommendations


In [35]:
svdcf= svd_cf(user_id, recommendations, algo, titles)
svdcf

Unnamed: 0,book_id,Book_SVD,Rating
0,3628,The Complete Calvin and Hobbes,4.630109
1,8946,دیوان‎‎ [Dīvān],4.607762
2,7029,I Want My Hat Back,4.595853
3,4868,Jesus the Christ: A Study of the Messiah and H...,4.568464
4,7883,The Sandman: King of Dreams,4.565016
5,5919,,4.525111
6,4653,,4.508285
7,9076,Preach My Gospel (A Guide to Missionary Service),4.505468
8,6361,There's Treasure Everywhere: A Calvin and Hobb...,4.479181
9,6089,الرحيق المختوم: بحث في السيرة النبوية على صاح...,4.478206


## Comparison

We have tried to provide recommendations to user 1839 using 4 methods. Let's put these 4 recommendations in a dataframe, with the column names as the methods used, and print out the dataframe.

**Since there are some books with Nan values for original titles, let's add book_id to compare recommendations across the 4 methods.**

In [36]:
# Use as many boxes as you need.
result = pd.concat([ubcf[['book_id','Book_UBCF']],
                    ibcf[['book_id','Book_IBCF']],
                    mf_rec[['book_id','Book_MF']], 
                    svdcf[['book_id','Book_SVD']]], axis=1)
result

Unnamed: 0,book_id,Book_UBCF,book_id.1,Book_IBCF,book_id.2,Book_MF,book_id.3,Book_SVD
0,26,The Da Vinci Code,7213,,8946,دیوان‎‎ [Dīvān],3628,The Complete Calvin and Hobbes
1,35,O Alquimista,6185,Secret Prey,4868,Jesus the Christ: A Study of the Messiah and H...,8946,دیوان‎‎ [Dīvān]
2,2,Harry Potter and the Philosopher's Stone,8853,Sudden Prey,3628,The Complete Calvin and Hobbes,7029,I Want My Hat Back
3,18,Harry Potter and the Prisoner of Azkaban,9388,Night Prey,7401,The Brothers K,4868,Jesus the Christ: A Study of the Messiah and H...
4,21,Harry Potter and the Order of the Phoenix,9468,Mortal Prey,3491,Just Mercy: A Story of Justice and Redemption,7883,The Sandman: King of Dreams
5,11,The Kite Runner,8824,Mind Prey,6590,The Authoritative Calvin and Hobbes,5919,
6,24,Harry Potter and the Goblet of Fire,5109,Chosen Prey,1010,The Essential Calvin and Hobbes: A Calvin and ...,4653,
7,27,Harry Potter and the Half-Blood Prince,6172,Heat Lightning,6920,The Indispensable Calvin and Hobbes: A Calvin ...,9076,Preach My Gospel (A Guide to Missionary Service)
8,80,Le Petit Prince,5284,Bad Blood,6902,Standing for Something: 10 Neglected Virtues T...,6361,There's Treasure Everywhere: A Calvin and Hobb...
9,23,Harry Potter and the Chamber of Secrets,6698,,5207,The Days Are Just Packed: A Calvin and Hobbes ...,6089,الرحيق المختوم: بحث في السيرة النبوية على صاح...


**Observation:** 
    We performed memory and model based Collaborative Filtering(CF):
    1. Memory-based -> Content-Based CF 
    a. User-based with Eucledean Distance measure
    b. Item-based with Cosine Similarity measure
    
    2. Model-based -> Matrix Factorization based CF
    a. Matrix Factorization
    b. SVD++
Recommendations are provided for user 1839 using the 4 methods. 
The comparison shows that model based recommendations with Matrix Factorization and SVD++ somewhat match (8 out of 15) match.
However, memory based recommendations do not match among themselves and even with model based recommendations.