# Book Recommendation System
___

- model: User-based collaborative filtering + matrix factorization using neural networks
- dataset: [goodreads rating 10k](https://github.com/zygmuntz/goodbooks-10k)
- modules: sklearn, keras

**Outline**:

1. input function for getting rating from a new user
2. calculate the similarity between users and new user using cosine similarity (find the closest user)
3. use matrix factorization to calculate book rating for all the user
4. recommend top 10 books for new user based on similar user's book recommendation from matrix factorization


**reference**:
* matrix-factorization tutuorials: 
    - https://course.fast.ai/videos/?lesson=4
    - https://medium.com/@jdwittenauer/deep-learning-with-keras-recommender-systems-e7b99cb29929
    - https://towardsdatascience.com/building-a-book-recommendation-system-using-keras-1fba34180699
* user-based collaborative filtering: 
    - https://medium.com/@wwwbbb8510/comparison-of-user-based-and-item-based-collaborative-filtering-f58a1c8a3f1d
    - https://medium.com/sfu-cspmp/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from scipy import spatial
import random
import pickle
import warnings
import numpy as np

warnings.filterwarnings("ignore")

## Importing Data 

In [272]:
df_rating = pd.read_csv('goodbooks-10k-master/ratings.csv')
df_books = pd.read_csv('goodbooks-10k-master/books.csv')
df_bookList = pd.read_csv('book_list.csv')
df_booktags = pd.read_csv('goodbooks-10k-master/book_tags.csv')
tags = pd.read_csv('goodbooks-10k-master/tags.csv')

In [273]:
#Descriptive analysis

#unique number of users
print('unique number of users:')
print(df_rating['user_id'].nunique())

#unique number of books
print('unique number of books:')
print(df_rating['book_id'].nunique())

unique number of users:
53424
unique number of books:
10000


In [274]:
# change non-sequential userid factors to sequential factors
user_encoding = LabelEncoder()
df_rating['user_id_enc'] = user_encoding.fit_transform(df_rating['user_id'].values)
n_users = df_rating['user_id_enc'].nunique()

book_encoding = LabelEncoder()
df_rating['book_id_enc'] = book_encoding.fit_transform(df_rating['book_id'].values)
n_books = df_rating['book_id_enc'].nunique()

In [275]:
goodreads_book_id = list(set(df_books.goodreads_book_id))

In [276]:
#add book_tags to df_rating

def booktags(goodreads_book_id):
    book_tags = []
    n = 0

    for book_id in goodreads_book_id:
        tags = []
        for row in df_booktags.itertuples():
            if row[1] == book_id:
                tags.append(row[2])   
        n = n+1        
        book_tags.append([book_id, tags])

        if n%100 == 0:
            print(n,'/',len(goodreads_book_id))
        

In [277]:
#np.save('book_tag.npy', np.array(book_tags))
book_tags = np.load('book_tag.npy', allow_pickle=True)

In [278]:
#add book tags in book dataframe
bt = pd.DataFrame(book_tags)
bt.columns = ['goodreads_book_id', 'tags']
all_book= pd.merge(df_books, bt, on="goodreads_book_id")
book_df = all_book[all_book['language_code'].isin(['en', 'en-CA', 'en-GB', 'en-US','eng'])]

In [347]:
book_df.columns

Index(['book_id', 'goodreads_book_id', 'best_book_id', 'work_id',
       'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year',
       'original_title', 'title', 'language_code', 'average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'image_url', 'small_image_url', 'tags'],
      dtype='object')

In [352]:
sub_book = book_df[['book_id', 'goodreads_book_id', 'isbn', 'isbn13', 'authors',
       'original_title', 'title', 'language_code', 'average_rating',
       'image_url', 'small_image_url', 'tags']]

In [354]:
sub_book.to_csv('goodreads.csv', index=False)

## 1. New User Rating Input Function

1. ask user for their favorite genre
2. create dataset of user's favorite genre books
3. ask user to rate books from the favorite genre book dataset

In [370]:
#book tag code for common genres
genres_name = ['young-adult','business','action', 'adventure', 'classic', 'graphic-novel','detective', 'true-crime','mystery', 'fantasy', 'historical-fiction', 'fiction', 'horror','romance', 'humor', 'science-fiction', 'suspense', 'non-fiction','thrillers', 'biographies', 'essays', 'self-help', 'memoir','history']
genres =[[i[1],i[0]] for g in genres_name for i in tags.values if i[1] == g]

In [371]:
genres =[[i[0],i[1]] for g in genres_name for i in tags.values if i[1] == g]

In [372]:
genres

[[33114, 'young-adult'],
 [5951, 'business'],
 [1540, 'action'],
 [1691, 'adventure'],
 [7404, 'classic'],
 [13547, 'graphic-novel'],
 [9336, 'detective'],
 [31254, 'true-crime'],
 [20939, 'mystery'],
 [11305, 'fantasy'],
 [14487, 'historical-fiction'],
 [11743, 'fiction'],
 [14821, 'horror'],
 [26138, 'romance'],
 [15048, 'humor'],
 [26837, 'science-fiction'],
 [29076, 'suspense'],
 [21689, 'non-fiction'],
 [30386, 'thrillers'],
 [4594, 'biographies'],
 [10891, 'essays'],
 [27095, 'self-help'],
 [19733, 'memoir'],
 [14552, 'history']]

In [341]:
#function that asks user about their favorite book genre
def favorite_genres(genres):
    favorite = []
    fav_code = []
    for cat, code in genres:
        print('Do you like',cat,'?')
        print('answer yes or no')
        
        while True:
            ans = input()
            
            if (ans != 'yes') and (ans != 'no'):
                print('wrong input!')
            
            elif (ans == 'yes'):
                favorite.append(cat)
                fav_code.append(code)
                break
            
            elif (ans == 'no'):
                break
                
    print (favorite)
    return fav_code

In [282]:
#function that creates dataset of user's favorite books
def df_favorite_book(u_fav_genre):
    result = []
    for code in u_fav_genre:
        for row in book_df.itertuples():
            if code in row[24]:
                if list(row)[1:] not in result:
                    result.append(list(row)[1:])

    df = pd.DataFrame(result, columns=list(book_df.columns))
    
    return df

In [324]:
#function that asks user to rate books
def user_rating(top_books):
    user_rating = []
    user_rating_count = 0
    
    bookList = list(top_books[:200])
    
    while user_rating_count < 30:
        twenty_books = random.sample(bookList, 1)

        for bookid in twenty_books:
            item = df_books[df_books['book_id'] == bookid]  
            bookList.remove(bookid)

            title = item['title'].values[0]
            author = item['authors'].values[0]

            print(title,'by', author)
            
            while True:
                rating = input("rating for the book: ")
                if rating == '':
                    break
                    print('rated:',user_rating_count,'/30')
                elif 1 <= int(rating) <=5:
                    user_rating.append([bookid, int(rating)])
                    user_rating_count+=1
                    print('rated:',user_rating_count,'/30')
                    break
                else:
                    print("Invalid rating. Must be between 0-5")
                    print('rated:',user_rating_count,'/30')
    return user_rating

# 2. Cosine Similarity between users and new user

1. change array to matrix with NaN as zero (row=user's rating, column = books)
2. normalize the user rating
3. Fill user's NaN ratings with average rating of user
4. Fill user's NaN ratings with average rating of movie

*I saved the np arrays into pickles for later usage*

In [302]:
# change array to matrix

def embedding():
    rating_emb = []

    for n in range(53424): #53424
        u_rating = df_rating[df_rating['user_id_enc']==n].values.tolist()    
        user_emb = np.array([0 for u_rating in range(10000)], dtype='f')

        for r in u_rating:
            user_emb[r[4]] = r[2] 

        rating_emb.append(user_emb) 

        if n%1000 ==0:
            print(n,'/',53424)
    
    return rating_emb

In [303]:
#np.save('embedding.npy', emb)
# embedding = np.load('embedding.npy', allow_pickle=True)

In [304]:
#normalize user rating
def rating_norm(embedding):
    norm = []
    
    for x,n in enumerate(embedding):
        zero = np.count_nonzero(n==0)
        avg = sum(n)/(len(n)-zero)
            
        for index, i in enumerate(n):
            if i != 0:
                n[index] = i-avg
        
        norm.append(n)

        if x%1000 == 0:
            print(x,'/53424')
                
    return norm


In [305]:
#np.save('r_norm.npy', r_norm)
# r_norm = np.load('r_norm.npy', allow_pickle=True)

In [306]:
def user_avg(rating_emb):
    norm = []
    
    for n in rating_emb:
        zero = np.count_nonzero(n==0)
        avg = sum(n)/(len(n)-zero)
    
        for index, i in enumerate(n):
            if i != 0:
                n[index] = i-avg
            else:
                n[index] = avg
                
        norm.append(n)
    
    return norm

In [307]:
#np.save('user_norm.npy', user_norm)
# user_norm = np.load('user_norm.npy',allow_pickle=True)

In [308]:
def book_average_list(r_norm):
    book_avgs = []

    for i in range(10000):
        x= np.array([row[i] for row in r_norm])
        zero = np.count_nonzero(x==0)
        avg = sum(x)/(len(x)-zero)
        book_avgs.append(avg)

        if i%1000 ==0:
            print(i,'/',10000)
    
    return np.array(book_avgs, dtype='f')

In [309]:
# r_norm = np.load('r_norm.npy', allow_pickle=True)
# book_avgs = book_average_list(r_norm)

In [310]:
# np.save('mov_avg.npy', np.array(mov_avgs, dtype='f'))
# book_avgs = np.load('book_avg.npy', allow_pickle=True)

In [311]:
def book_avg(rating_emb, book_avgs):
    book_emb = []
    
    for index,n in enumerate(rating_emb):
        for i,y in enumerate(n):
            if y == 0:
                n[i] = book_avgs[i] 
        book_emb.append(n)
        
        if index%1000 == 0:
            print(index,'/', len(rating_emb))
    
    return book_emb

In [312]:
# r_norm = np.load('r_norm.npy', allow_pickle=True) 
# book_norm = book_avg(r_norm, book_avgs)

In [313]:
#np.save('book_norm.npy', np.array(book_norm, dtype='f'))
book_norm = np.load('book_norm.npy', allow_pickle=True)

In [314]:
book_norm

array([[ 0.33607832,  0.39728734, -0.7451629 , ...,  0.39692053,
        -0.09444112,  0.13774939],
       [ 0.33607832,  0.5846154 , -0.7451629 , ...,  0.39692053,
        -0.09444112,  0.13774939],
       [ 0.33607832,  0.39728734, -0.7451629 , ...,  0.39692053,
        -0.09444112,  0.13774939],
       ...,
       [-0.21538462,  0.7846154 , -0.7451629 , ...,  0.39692053,
        -0.09444112,  0.13774939],
       [-0.45454547,  0.54545456, -0.7451629 , ...,  0.39692053,
        -0.09444112,  0.13774939],
       [-0.40601504,  0.59398496, -0.40601504, ...,  0.39692053,
        -0.09444112,  0.13774939]], dtype=float32)

In [315]:
#normalizing rating of new user
def norm_newuser(rating_emb):
    zero = np.count_nonzero(rating_emb==0)
    avg = sum(rating_emb)/(len(rating_emb)-zero)
    
    norm = [r-avg for r in rating_emb]
    return norm

In [316]:
#function that returns top 10 similar users
def top10_sim(user_ratings, newUser_input):
    newUser_emb = [rating for book_id, rating in newUser_input]
    newUser_rating = norm_newuser(newUser_emb)
    newUser_bookid = [book_id for book_id,rating in newUser_input]
    
    #create a vector of ratings that new user rated books
    user_vector = []
    for u_rating in user_ratings:
        user= []
        for index,rating in newUser_input:
            user.append(u_rating[index])
        user_vector.append(user)
        
    cosine = []
    for user_id, user in enumerate(user_vector):
        result = 1 - spatial.distance.cosine(user, newUser_rating)
        if np.isnan(result):
            cosine.append([user_id,0])
        else:
            cosine.append([user_id,result])
        
    cosine.sort(key=lambda x: x[1], reverse=True)
    
    return cosine[:10]

# 3. Matrix Factorization

In [294]:
X = df_rating[['user_id_enc', 'book_id_enc']].values 
y = df_rating['rating'].values

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=20)

In [295]:
X_train.shape, X_test.shape, y_train.shape

((4781183, 2), (1195296, 2), (4781183,))

In [None]:
## divide book and user from training data set

# column 0 is users and column 1 is books 
X_train_array = [X_train[:,0], X_train[:, 1]]
X_test_array = [X_test[:,0],X_test[:,1]]

In [None]:
from keras.models import Model
from keras.layers import Embedding, Input, Reshape, Dot, Add, Activation, Lambda
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

In [None]:
# simple neural network w/o bias and activation
# 1. Create embedding for user and movie 
# 2. Dot product user and movie embedding 


def Recommendation(n_users, n_books, n_emb):
    user_input = Input(shape=(1,))
    user_emb = Embedding(n_users, n_emb)(user_input)
    user = Reshape((n_emb,))(user_emb)
    
    book_input = Input(shape=(1,))
    book_emb = Embedding(n_books, n_emb)(book_input)
    book = Reshape((n_emb,))(book_emb)
    
    output = Dot(axes=1)([user,book]) 

    model = Model(inputs=[user_input, book_input], outputs=output)
    opt = Adam(lr=0.001)
    model.compile(loss="mean_squared_error", optimizer=opt)
    
    return model

In [None]:
# for number of embedding 50 was chosen arbitrarily
model = Recommendation(n_users, n_books, n_emb=50)


In [None]:
model.summary()

In [None]:
#set early stopping monitor so the model stops training when it won't improve anymore
early_stopping_monitor = EarlyStopping(patience=2)

#train model
model.fit(x=X_train_array, y=y_train, batch_size=64 , epochs=5, validation_data=(X_test_array, y_test) , callbacks=[early_stopping_monitor])

In [None]:
# neural network w/ bias and activation
# 1. Create embedding for user and movie 
# 2. Dot product user and movie embedding 
# 3. Add bias to the dot product
# 4. use sigmoid as activation function 


def Recommendation2(n_users, n_books, n_emb):
    user_input = Input(shape=(1,))
    user_emb = Embedding(n_users, n_emb)(user_input)
    user = Reshape((n_emb,))(user_emb)
    user_bias_emb = Embedding(n_users, 1)(user_input)
    user_bias = Reshape((1,))(user_bias_emb)
    
    book_input = Input(shape=(1,))
    book_emb = Embedding(n_books, n_emb)(book_input)
    book = Reshape((n_emb,))(book_emb)
    book_bias_emb = Embedding(n_books, 1)(book_input)
    book_bias = Reshape((1,))(book_bias_emb)
    
    output = Dot(axes=1)([user,book]) 
    output = Add()([output, user_bias, book_bias])
    output = Activation('sigmoid')(output)
    output = Lambda(lambda output:output*(5-0)+0)(output) #5 is max rating and 0 is min rating

    model = Model(inputs=[user_input, book_input], outputs=output)
    opt = Adam(lr=0.001)
    model.compile(loss="mean_squared_error", optimizer=opt)
    
    return model

In [None]:
model2 = Recommendation2(n_users, n_books, n_emb=50)
model2.summary()

In [None]:
#set early stopping monitor so the model stops training when it won't improve anymore
early_stopping_monitor = EarlyStopping(patience=2)

#train model
model2.fit(x=X_train_array, y=y_train, batch_size=64 , epochs=5, verbose=1, validation_data=(X_test_array, y_test), callbacks=[early_stopping_monitor])

In [161]:
from keras.models import load_model 

# model2.save('nn_model2.h5') 

model2 = load_model('nn_model.h5')


Using TensorFlow backend.


In [158]:
##Predicting using model 2
# df_books = pd.read_csv('goodbooks-10k-master/books.csv')
books = np.array(list(set(df_rating.book_id_enc)))

all_users = []

for n in range(n_users):
    user = np.array([n for i in range(len(books))])
    all_users.append(user)
    
    if n%1000 ==0:
        print(n,'/',n_users)
        
        

0 / 53424
1000 / 53424
2000 / 53424
3000 / 53424
4000 / 53424
5000 / 53424
6000 / 53424
7000 / 53424
8000 / 53424
9000 / 53424
10000 / 53424
11000 / 53424
12000 / 53424
13000 / 53424
14000 / 53424
15000 / 53424
16000 / 53424
17000 / 53424
18000 / 53424
19000 / 53424
20000 / 53424
21000 / 53424
22000 / 53424
23000 / 53424
24000 / 53424
25000 / 53424
26000 / 53424
27000 / 53424
28000 / 53424
29000 / 53424
30000 / 53424
31000 / 53424
32000 / 53424
33000 / 53424
34000 / 53424
35000 / 53424
36000 / 53424
37000 / 53424
38000 / 53424
39000 / 53424
40000 / 53424
41000 / 53424
42000 / 53424
43000 / 53424
44000 / 53424
45000 / 53424
46000 / 53424
47000 / 53424
48000 / 53424
49000 / 53424
50000 / 53424
51000 / 53424
52000 / 53424
53000 / 53424


In [162]:
# create dataframe with rating
rating_emb = []

for n,u_emb in enumerate(all_users):
    predictions = model2.predict([u_emb, books])
    predictions = np.array([p[0] for p in predictions])
    rating_emb.append(predictions)

    if n%1000 ==0:
        print(n,'/', len(all_users))
    

0 / 53424
1000 / 53424
2000 / 53424
3000 / 53424
4000 / 53424
5000 / 53424
6000 / 53424
7000 / 53424
8000 / 53424
9000 / 53424
10000 / 53424
11000 / 53424
12000 / 53424
13000 / 53424
14000 / 53424
15000 / 53424
16000 / 53424
17000 / 53424
18000 / 53424
19000 / 53424
20000 / 53424
21000 / 53424
22000 / 53424
23000 / 53424
24000 / 53424
25000 / 53424
26000 / 53424
27000 / 53424
28000 / 53424
29000 / 53424
30000 / 53424
31000 / 53424
32000 / 53424
33000 / 53424
34000 / 53424
35000 / 53424
36000 / 53424
37000 / 53424
38000 / 53424
39000 / 53424
40000 / 53424
41000 / 53424
42000 / 53424
43000 / 53424
44000 / 53424
45000 / 53424
46000 / 53424
47000 / 53424
48000 / 53424
49000 / 53424
50000 / 53424
51000 / 53424
52000 / 53424
53000 / 53424


In [317]:
def top10_book_suggestion(rating_emb, method, scores):
    top10_books = []
    userid = method[scores.index(max(scores))]
    top10 = np.argsort(rating_emb[userid])[-10:]
    
    for bookid in top10:
        top10_books.append(df_books[df_books['book_id'] == bookid]['title'])
    
    return top10_books


# Putting all the functions together

In [325]:
u_fav_genre = favorite_genres(genres)

Do you like young-adult ?
answer yes or no
no
Do you like business ?
answer yes or no
yes
Do you like action ?
answer yes or no
no
Do you like adventure ?
answer yes or no
no
Do you like classic ?
answer yes or no
no
Do you like graphic-novel ?
answer yes or no
no
Do you like detective ?
answer yes or no
no
Do you like true-crime ?
answer yes or no
no
Do you like mystery ?
answer yes or no
no
Do you like fantasy ?
answer yes or no
no
Do you like fiction ?
answer yes or no
no
Do you like horror ?
answer yes or no
no
Do you like romance ?
answer yes or no
no
Do you like humor ?
answer yes or no
no
Do you like science-fiction ?
answer yes or no
no
Do you like suspense ?
answer yes or no
no
Do you like non-fiction ?
answer yes or no
no
Do you like thrillers ?
answer yes or no
no
Do you like biographies ?
answer yes or no
no
Do you like essays ?
answer yes or no
no
Do you like self-help ?
answer yes or no
no
Do you like memoir ?
answer yes or no
no
Do you like history ?
answer yes or no
no


In [326]:
top_books = df_favorite_book(u_fav_genre).sort_values(['average_rating'], ascending=False)['book_id']

In [328]:
newUser_input = user_rating(top_books)
newUser_embedding = [n[1] for n in newUser_input]
newUser_emb = norm_newuser(newUser_embedding)

Red Notice: A True Story of High Finance, Murder, and One Man’s Fight for Justice by Bill Browder
rating for the book: 
Barbarians at the Gate: The Fall of RJR Nabisco by Bryan Burrough, John Helyar
rating for the book: 
The Honest Truth About Dishonesty: How We Lie to Everyone - Especially Ourselves by Dan Ariely
rating for the book: 
Made to Stick: Why Some Ideas Survive and Others Die by Chip Heath, Dan Heath
rating for the book: 
Too Big to Fail: The Inside Story of How Wall Street and Washington Fought to Save the Financial System from Crisis — and Themselves by Andrew Ross Sorkin
rating for the book: 
Fermat's Enigma: The Epic Quest to Solve the World's Greatest Mathematical Problem by Simon Singh
rating for the book: 
Good to Great: Why Some Companies Make the Leap... and Others Don't by James C. Collins
rating for the book: 
The Design of Everyday Things by Donald A. Norman
rating for the book: 
The Medium is the Massage by Marshall McLuhan, Quentin Fiore, Jerome Agel
rating fo

In [331]:
r_norm = np.load('r_norm.npy', allow_pickle=True)
zero = top10_sim(r_norm, newUser_input)

user_norm = np.load('user_norm.npy', allow_pickle=True)
user = top10_sim(user_norm, newUser_input)

book_norm = np.load('book_norm.npy')
book = top10_sim(book_norm, newUser_input)

In [332]:
method =[zero[0][0], user[0][0], book[0][0]]
scores = [zero[0][1], user[0][1], book[0][1]]

In [376]:
top10_book_suggestion(rating_emb, method, scores)

[1902    Son (The Giver, #4)
 Name: title, dtype: object, 1616    Without Fail (Jack Reacher, #6)
 Name: title, dtype: object, 2588    Certain Girls (Cannie Shapiro #2)
 Name: title, dtype: object, 2080    Skin Game (The Dresden Files, #15)
 Name: title, dtype: object, 1893    Wedding Night
 Name: title, dtype: object, 502    2001: A Space Odyssey (Space Odyssey, #1)
 Name: title, dtype: object, 1613    We'll Always Have Summer (Summer, #3)
 Name: title, dtype: object, 2147    Monster
 Name: title, dtype: object, 2505    Casino Royale (James Bond, #1)
 Name: title, dtype: object, 3249    Shopgirl
 Name: title, dtype: object]