### Book Recommendation system
In this notebook, we want to implement the collaborative filtering algorithm to deploy a recommender system which gets ratings for some books from user and recommends a few books to the user which are probably good choices to read next!

In [155]:
import pandas as pd

#### Importing data and exploring that

In [156]:
books = pd.read_csv('./data/Books.csv')
users = pd.read_csv('./data/Users.csv')
ratings = pd.read_csv('./data/Ratings.csv')

  books = pd.read_csv('./data/Books.csv')


In [157]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [158]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [159]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [160]:
print('Users shape: ', users.shape)
print('Ratings shape: ', ratings.shape)
print('Books shape: ', books.shape)

Users shape:  (278858, 3)
Ratings shape:  (1149780, 3)
Books shape:  (271360, 8)


#### Filtering data
It's obvious that if we choose to process all data that we have, we can't :))) It's a huge amount of data and we will have hardware issues.

Also, another issue is the cold start problem. At the beggining of the implementing a recommender system, for the users which only rated a few number of books, and for the books which have only a few ratings, we can't have accurate predictions. 

So, we only use ratings which the rated book has at least 250 ratings, and the user also has rated at least 250 books.

In [161]:
active_users = ratings.groupby('User-ID').agg('count')
active_users = active_users[active_users['ISBN'] > 350].reset_index()['User-ID']
active_users.shape

(463,)

In [162]:
most_rated_books = ratings.groupby('ISBN').agg('count')
most_rated_books = most_rated_books[most_rated_books['User-ID'] > 250].reset_index()['ISBN']
most_rated_books.shape

(121,)

In [163]:
filtered_ratings = ratings[(ratings['User-ID'].isin(active_users)) & (ratings['ISBN'].isin(most_rated_books))]
filtered_ratings.shape

(10510, 3)

### Save reduced data

In [164]:
users[users['User-ID'].isin(active_users)].to_csv('./reduced-data/active_users.csv')
books[books['ISBN'].isin(most_rated_books)].to_csv('./reduced-data/most_rated_books.csv')
filtered_ratings.to_csv('./reduced-data/filtered_ratings.csv')

So now we have 463 users and 121 books, which will probably fix the cold start problem for us.

Now, we have to construct the Y matrix, which is one of the core matrices of our algorithm. We do that using `pivot` method of our dataframe.

In [165]:
filtered_ratings = filtered_ratings.pivot(index='ISBN', columns='User-ID', values='Book-Rating')
filtered_ratings.shape

(121, 451)

In [166]:
filtered_ratings.head(10)

User-ID,2276,3363,6251,6543,6575,7158,7346,8681,11601,11676,...,269719,269728,270713,271284,274004,274061,274308,275970,277427,278418
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
60392452,,,10.0,,,,,,,0.0,...,,,,,9.0,,,0.0,,
60502258,,0.0,,,8.0,,,,,8.0,...,,,,,,,,,,
60928336,,0.0,,,8.0,,0.0,0.0,,0.0,...,,,,,,,,0.0,,
60930535,,,0.0,,,,0.0,,,0.0,...,,,,,,,,,0.0,
60934417,,,,,9.0,,0.0,,,0.0,...,,,,,8.0,,,,0.0,
60938455,,,0.0,,0.0,,,,10.0,10.0,...,,,0.0,,10.0,,,,,
60959037,,,,,,,,,,0.0,...,,,,,,,,,,
60976845,,0.0,0.0,,0.0,0.0,,,,0.0,...,,,,,,,,,,
60987103,,0.0,0.0,10.0,9.0,,7.0,0.0,,9.0,...,,,,,,,,,,
61009059,,,7.0,8.0,0.0,,0.0,,,8.0,...,,,,0.0,,,,0.0,9.0,


In [167]:
filtered_ratings.notna().sum().sum()

10510

In [168]:
filtered_ratings.isna().sum().sum()

44061

Also, we have to construct R matrix, which has the same shape as Y matrix, and for each (user, book) pair, if the user has rated the book, we have 1 in this matrix, otherwise 0.

In [169]:
R = filtered_ratings.notna().astype('int')
R

User-ID,2276,3363,6251,6543,6575,7158,7346,8681,11601,11676,...,269719,269728,270713,271284,274004,274061,274308,275970,277427,278418
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0060392452,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,1,0,0
0060502258,0,1,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
0060928336,0,1,0,0,1,0,1,1,0,1,...,0,0,0,0,0,0,0,1,0,0
0060930535,0,0,1,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,1,0
0060934417,0,0,0,0,1,0,1,0,0,1,...,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0805063897,1,1,1,0,1,0,1,0,0,1,...,0,0,1,0,1,0,1,0,0,0
0842329129,0,0,0,0,0,0,0,0,0,1,...,1,0,0,1,0,0,0,0,0,0
0971880107,0,1,1,1,0,1,0,1,0,1,...,1,1,0,0,0,0,0,0,1,0
1400034779,0,0,1,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,1,0


We have to fill `Nan` values of our Y matrix with zero, becacuse if we don't we'll face problems in computing the cost function of our algorithm!

In [170]:
filtered_ratings = filtered_ratings.fillna(0)

In [171]:
import tensorflow as tf
from tensorflow import keras

In [172]:
num_features = 100
NUM_BOOKS, NUM_USERS = filtered_ratings.shape

In [173]:
tf.random.set_seed(1212)

W = tf.Variable(tf.random.normal((NUM_USERS, num_features), dtype=tf.float64), name='W')
X = tf.Variable(tf.random.normal((NUM_BOOKS, num_features), dtype=tf.float64), name='X')
b = tf.Variable(tf.random.normal((1, NUM_USERS), dtype=tf.float64), name='b')

In the next cell, we implement the cost function for our algorithm. Notice that we implemented the vectorized version of this function, instead of implementing that using for loops, which will boost the speed of our algorithm!

In [174]:
def cost_func(W, X, b, Y, R, lambda_):
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y) * R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J 

In [175]:
optimizer = keras.optimizers.Adam(learning_rate=0.1)

In the cell below, we train our algorithm using gradient decent algorithm, to reach to a good combination of values for our matrices.

In [176]:
num_iterations = 300
lambda_ = 1

for i in range(num_iterations):
    with tf.GradientTape() as tape:
        total_cost = cost_func(W, X, b, filtered_ratings, R, lambda_)

    gradients = tape.gradient(total_cost, [X, W, b])
    optimizer.apply_gradients(zip(gradients, [X, W, b]))
    if i % 20 == 0:
        print(f'cost in iteration {i}: {total_cost:0.1f}')


cost in iteration 0: 643036.9
cost in iteration 20: 27016.6
cost in iteration 40: 13407.9
cost in iteration 60: 8676.6
cost in iteration 80: 6406.0
cost in iteration 100: 5169.5
cost in iteration 120: 4444.1
cost in iteration 140: 3995.1
cost in iteration 160: 3704.6
cost in iteration 180: 3508.9
cost in iteration 200: 3371.9
cost in iteration 220: 3272.4
cost in iteration 240: 3197.5
cost in iteration 260: 3139.4
cost in iteration 280: 3093.1


In the cell below, we construct our predictions matrix, which we will use later to decide which books to recommend to the user.

In [177]:
import numpy as np

In [178]:
predicts = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()
predicts = pd.DataFrame(predicts, index=filtered_ratings.index, columns=filtered_ratings.columns)

Just to show that our algorithm is doing well, we compare a few ratings with their predicted values which are obtained using our predicts matrix. They have to be close to each other.

In [179]:
print(f"Original Rating: {filtered_ratings.loc['0061009059', 6251]} , Predicted Rating: {predicts.loc['0061009059', 6251]}")
print(f"Original Rating: {filtered_ratings.iloc[0, 2]}, Predicted Rating: {predicts.iloc[0, 2]}")
print(f"Original Rating: {filtered_ratings.loc['0060502258',11676]} , Predicted Rating: {predicts.loc['0060502258', 11676]}")

Original Rating: 7.0 , Predicted Rating: 6.793114280478201
Original Rating: 10.0, Predicted Rating: 9.743270281088911
Original Rating: 8.0 , Predicted Rating: 7.899537033054528


Now we want to recommend 10 books to the first user of our DataFrame. 

In [180]:
second_user = R.iloc[:, 1]
second_user = pd.DataFrame({'ISBN': second_user.index, 'Rated': second_user.values, 'Rating': filtered_ratings.iloc[:, 1].values, 'Predict': predicts.iloc[:, 1].values})
second_user['ISBN'] = second_user['ISBN'].astype('object')
second_user = second_user.merge(books[['ISBN', 'Book-Title']], on='ISBN')
second_user.head()

Unnamed: 0,ISBN,Rated,Rating,Predict,Book-Title
0,60392452,0,0.0,-0.338143,Stupid White Men ...and Other Sorry Excuses fo...
1,60502258,1,0.0,0.047814,The Divine Secrets of the Ya-Ya Sisterhood: A ...
2,60928336,1,0.0,-0.006038,Divine Secrets of the Ya-Ya Sisterhood: A Novel
3,60930535,0,0.0,-0.003478,The Poisonwood Bible: A Novel
4,60934417,0,0.0,1.458659,Bel Canto: A Novel


In [181]:
rated_books = second_user[second_user['Rated'] == 1][['Rating', 'Book-Title']]
for i in range(len(rated_books)):
    print(f'First user has rated {rated_books.iloc[i, 1]} {rated_books.iloc[i, 0]}')

First user has rated The Divine Secrets of the Ya-Ya Sisterhood: A Novel 0.0
First user has rated Divine Secrets of the Ya-Ya Sisterhood: A Novel 0.0
First user has rated Little Altars Everywhere: A Novel 0.0
First user has rated Wicked: The Life and Times of the Wicked Witch of the West 0.0
First user has rated The Girls' Guide to Hunting and Fishing 0.0
First user has rated The Nanny Diaries: A Novel 0.0
First user has rated Lucky : A Memoir 10.0
First user has rated White Oleander : A Novel (Oprah's Book Club) 0.0
First user has rated White Oleander : A Novel 0.0
First user has rated The Lovely Bones: A Novel 0.0
First user has rated Interview with the Vampire 0.0
First user has rated Rising Sun 0.0
First user has rated Timeline 0.0
First user has rated Midwives: A Novel 0.0
First user has rated Empire Falls 0.0
First user has rated Confessions of a Shopaholic (Summer Display Opportunity) 0.0
First user has rated The Da Vinci Code 0.0
First user has rated Red Dragon 0.0
First user h

In [182]:
not_rated_books = second_user[second_user['Rated'] == 0][['Rating', 'Book-Title', 'Predict']]
top_recommendations = not_rated_books.sort_values(by='Predict', ascending=False).iloc[:10, [1, 2]]
top_recommendations

Unnamed: 0,Book-Title,Predict
35,A Prayer for Owen Meany,3.036735
55,Into Thin Air : A Personal Account of the Mt. ...,3.024536
86,Violets Are Blue,2.888712
11,Bridget Jones's Diary,2.634911
81,The Bridges of Madison County,2.268242
104,Snow Falling on Cedars,2.223582
67,The Runaway Jury,2.211302
19,The Hours: A Novel,2.149533
116,Left Behind: A Novel of the Earth's Last Days ...,2.100236
64,The Client,2.073821
