# Browsing Books - A Recomendation Engine for Books
For CMPT3520 Machine Learning II <br/>
Annabell Rodriguez, Laura Brin, Sandra Alex

## Introduction

### Business Problem

Like most recommendation systems, book recommendations are rooted in commerce and what you are suggested is often based on data collected specifically about you. Ecommerce venues like Amazon, Chapters and Google play books look at what books have you purchased before, what authors and genres you like, what are other people with similar literary tastes are reading. These all go into recommending books for a specific person and can be extemely useful in ecommerce or on book rating sites like goodreads and bookish. 

Public libraries have the same concept, but in aggregate. There are over a million books published around the world each year and along with a resource selection criterion, a book recommendation system could help libraries understand what their readers may want.  In understanding their own regular patrons, libraries can better predict and select new books that will be popular in their areas. This may highlight a customer market for genres, languages, or age groups. 


### Data

Two datasets are used for this solution. The first is a book recommendation dataset from Kaggle. It contains 3 files: books, users, and rating. The books file has roughly 270,000 books identified by ISBN, book title, author, year of publication, publisher, and thumbnail images for the book cover on Amazon. The ratings file has over a million ratings. The user file contains the age and location for 278858 users. The data is sourced from bookcrossing.com. The second dataset is from Github user Zygmunt Zajac with additions from Olivier Simard-Hanley with added fields for description, number of pages and genres for 50,000 books. This data is sourced from goodbooks.com. 


https://github.com/zygmuntz/goodbooks-10k<br/>
https://github.com/malcolmosh/goodbooks-10k-extended<br/> 
https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset<br/>

Additionally, some code formatting is taken from a Kaggle notebook by user Vishorita<br/>
https://www.kaggle.com/code/vishorita/best-recommendation-collabarative-filtering 


### Evaluation Metrics

Accuracy in any of its forms is hard to assess with recommendation systems without having ground truth labels. While similarity measures like cosine or Euclidean can tell us which clusters exist, they cannot tell us if they are meaningful. For this project we will be using a smoke test: selecting a book, or list of books we have existing knowledge of to see if recommended books have expected titles or authors. 

Content based filtering: Content based filtering has advantages over collaborative when it comes to online predictions of new books. The model does not rely on finding relationships between users or ratings, instead finding relationships in the NLP processing for similar words, genres, or authors. 

Collaborative Filtering: Recommendations are much simpler for offline prediction where the books have been rated by multiple users. It still faces issues with sparsity, so we use an ensemble approach to identify books first from a matrix of books with a higher number of user ratings and more active users, and then from the sparse matrix. Online predictions for new books are more difficult than new users. New users will be able to receive recommendations that will start either poorly or randomly and improve as additional knowledge is added implicitly or explicitly. New books, unless very popular and quickly reviewed, will start quite sparse and may take time to generate on to recommendation lists for users. 


## Browsing Books

### Loading data

In [None]:
#Loading Libraries
from __future__ import print_function

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import collections
from mpl_toolkits.mplot3d import Axes3D
from IPython import display
from matplotlib import pyplot as plt
import sklearn
import sklearn.manifold
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.logging.set_verbosity(tf.logging.ERROR)

# Add some convenience functions to Pandas DataFrame.
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.3f}'.format
def mask(df, key, function):
  """Returns a filtered dataframe, by applying function to key"""
  return df[function(df[key])]

def flatten_cols(df):
  df.columns = [' '.join(col).strip() for col in df.columns.values]
  return df

pd.DataFrame.mask = mask
pd.DataFrame.flatten_cols = flatten_cols

# Install Altair and activate its colab renderer.
#print("Installing Altair...")
#!pip install git+git://github.com/altair-viz/altair.git
#!pip install altair vega_datasets
import altair as alt
alt.data_transformers.enable('default', max_rows=None)
alt.renderers.enable('colab')
#print("Done installing Altair.")

from sklearn.metrics.pairwise import cosine_similarity 

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

In [None]:
#Loading Dataset
from ast import literal_eval
 
books_small = pd.read_csv('https://raw.githubusercontent.com/malcolmosh/goodbooks-10k/master/books_enriched.csv', index_col=[0],dtype={"isbn":object}, converters={"genres": literal_eval, "authors": literal_eval})
#note books_enriched.csv is a modified version of books.csv from goodbooks-10k dataset
users = pd.read_csv("Datasets\\Users.csv")
ratings= pd.read_csv("Datasets\\Ratings.csv")
books_large = pd.read_csv("Datasets\\Books.csv",dtype={"ISBN":object},low_memory=False)


### Books

In [None]:
books_large.head()

In [None]:
books_small.head(5)

In [None]:
print(books_large.shape)
print(books_small.shape)

In [None]:
print(books_large.columns)
print(books_small.columns)

In [None]:
books_small = books_small[books_small['language_code'] == 'eng'].copy()
books_small = books_small[["authors", "description", "genres", "isbn", "original_title", "original_publication_year"]].copy()
books_small.shape

In [None]:
# number of books by number of genres
books_small['genres'].value_counts().groupby(len).sum()

In [None]:
books_small['genres']

In [None]:
books_small["isbn"].isin(ratings["ISBN"]).value_counts()

In [None]:
books_small["isbn"].isin(books_large["ISBN"]).value_counts()

In [None]:
books_small["original_title"].isin(books_large["Book-Title"]).value_counts()

In [None]:
books_large["Book-Title"].isin(books_small["original_title"]).value_counts()

Combining datasets

In [None]:
all_books=pd.merge(books_large,books_small,how="inner",right_on="original_title",left_on="Book-Title")
all_books.head()

In [None]:
all_books.shape

In [None]:
all_books.drop(['index','Book-Title','isbn13','books_count','publishDate','isbn','language_code','Book-Author', 'Year-Of-Publication', 'Publisher','small_image_url','Image-URL-S','Image-URL-M','Image-URL-L'],axis=1,inplace=True, errors='ignore')

In [None]:
all_books["ISBN"].isin(ratings["ISBN"]).value_counts()

In [None]:
all_books.head()

In [None]:
#adding column for consecutive number by duplicate book-title and year
all_books['duplicate_count'] = all_books.groupby(['original_title', all_books['authors'].map(tuple)]).cumcount()+1
all_books.head()

In [None]:
all_books.shape

### Users

In [None]:
users.head()

In [None]:
users['city'],users['state'],users['country']=users["Location"].str.split(",",2).str
users.drop(['Location'],axis=1,inplace=True, errors='ignore')
users.head()

In [None]:
users["country"].fillna("Unknown", inplace = True)
users["state"].fillna("Unknown", inplace = True)
users["city"].fillna("Unknown", inplace = True)

In [None]:
country_list=users["country"].value_counts().where(users["country"].value_counts()>7500,other="Other")
print(country_list)


In [None]:
users["new_country"]=users["country"].apply(lambda x: x if country_list[x]!="Other" else "Other")
users["new_country"].value_counts()


In [None]:
state_list=users["state"].value_counts().where(users["state"].value_counts()>5000,other="Other")
users["new_state"]=users["state"].apply(lambda x: x if state_list[x]!="Other" else "Other")
users["new_state"].value_counts()

In [None]:
city_list=users["city"].value_counts().where(users["city"].value_counts()>1500,other="Other")
users["new_city"]=users["city"].apply(lambda x: x if city_list[x]!="Other" else "Other")
users["new_city"].value_counts()

In [None]:
users.describe()

In [None]:
import random
random.seed(42)

f=lambda x: x if x<100 else round(random.randint(24,44))
users['Age']=users['Age'].apply(f)
users.describe()

In [None]:
users['Age'].value_counts()

In [None]:
users.describe(include=[object])

In [None]:
users.drop(['city','state','country'],axis=1,inplace=True, errors='ignore')

### Ratings

In [None]:
ratings.head()

In [None]:
#sort column list authors
all_books['authors']=all_books['authors'].apply(lambda x: sorted(x))
all_books['authors_str'] = all_books['authors'].apply(lambda x: ' '.join(map(str, x)))
all_books = all_books.copy()
all_books.head()

In [None]:
ratings_books=pd.merge(ratings,all_books,how="inner",on="ISBN")
ratings_books=pd.merge(ratings_books,all_books[all_books['duplicate_count'] == 1],how="inner", on=["original_title","original_publication_year", "authors_str"])
ratings_books

In [None]:
ratings_books['ISBN']=ratings_books.apply(lambda x: x['ISBN_y'] if x['duplicate_count_x']>1 and x['duplicate_count_y']==1 else x['ISBN_x'],axis=1)
ratings_books.head(5)

In [None]:
#removing when duplicate_count greater than 1
all_books=all_books[all_books['duplicate_count'] == 1]
#removing column duplicate_count
all_books.drop(['duplicate_count'],axis=1,inplace=True, errors='ignore')
all_books.reset_index(drop=True, inplace=True)
all_books.shape

In [None]:
all_books['Book_id']=all_books.index + 1

In [None]:
ratings_books.shape

In [None]:
#substituting ISBN in ratings_books with book_id
ratings_books=pd.merge(ratings_books,all_books,how="left",left_on="ISBN",right_on="ISBN")


In [None]:
# dropping columns except User-ID	ISBN	Book-Rating
ratings_books = ratings_books[['User-ID', 'Book_id','Book-Rating']]
ratings_books.duplicated().value_counts()

In [None]:
ratings_books.drop_duplicates(inplace=True)
ratings_books.shape

In [None]:
ratings_books.head(-5)

In [None]:
all_books.head()

### Visuals

To visualize genres, books are assigned a random genre from their genre list

In [None]:
all_books_final = all_books.copy()
filter=lambda x: random.choice(x)
all_books_final['genre_rnd']=all_books['genres'].apply(filter)

In [None]:
all_books_final.reset_index(drop=True, inplace=True)

In [None]:
all_books_final['genre_rnd'].value_counts()

In [None]:
book_ratings = all_books_final.merge(
    ratings_books
    .groupby("Book_id", as_index=False)
    .agg({'Book-Rating': ['count', 'mean']})
    .flatten_cols(),
    on='Book_id')

genre_filter = alt.selection_multi(fields=['genre_rnd'])

genre_chart = alt.Chart().mark_bar().encode(
    x="count()",
    y=alt.Y('genre_rnd'),
    color=alt.condition(
        genre_filter,
        alt.Color("genre_rnd:N"),
        alt.value('lightgray'))
).properties(height=600, selection=genre_filter)

In [None]:
(book_ratings[['Book_id', 'Book-Rating count', 'Book-Rating mean']]
 .sort_values('Book-Rating count', ascending=False)
 .head(10))

In [None]:
def filtered_hist(field, label, filter):
  """Creates a layered chart of histograms.
  The first layer (light gray) contains the histogram of the full data, and the
  second contains the histogram of the filtered data.
  Args:
    field: the field for which to generate the histogram.
    label: String label of the histogram.
    filter: an alt.Selection object to be used to filter the data.
  """
  base = alt.Chart().mark_bar().encode(
      x=alt.X(field, bin=alt.Bin(maxbins=10), title=label),
      y="count()",
  ).properties(
      width=300,
  )
  return alt.layer(
      base.transform_filter(filter),
      base.encode(color=alt.value('lightgray'), opacity=alt.value(.7)),
  ).resolve_scale(y='independent')

In [None]:
# Display the number of ratings and average rating per book.
alt.hconcat(
    filtered_hist('Book-Rating count', '# ratings / book', genre_filter),
    filtered_hist('Book-Rating mean', 'mean rating', genre_filter),
    genre_chart,
    data=book_ratings)

## Content-Based Filtering

In [None]:
all_books_final.head()

In [None]:
# Find the rows with duplicate book titles
duplicate_title_rows = all_books_final[all_books_final.duplicated(subset=['original_title', 'authors_str'], keep=False)]

# Sort the duplicate rows by book title for easier inspection
duplicate_title_rows = duplicate_title_rows.sort_values(by='original_title')

# Print the list of books with duplicate titles
print("List of books with duplicate titles:\n")
print(duplicate_title_rows[['original_title', 'authors_str']])

In [None]:
books = all_books_final.copy()

In [None]:
books['genres_str'] = books['genres'].apply(lambda x: ' '.join(x))

In [None]:
features=['original_title', 'authors_str', 'genres_str', 'description']
features[1:]
books[features[3]]

In [None]:
import nltk
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize text
    words = nltk.word_tokenize(text)

    # Remove stopwords
    words = [word for word in words if word not in stopwords.words('english')]

    # Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    # Reconstruct the text
    cleaned_text = ' '.join(words)

    return cleaned_text

def combine_features(row, features=['original_title', 'authors_str', 'genres_str', 'description']):
    #print(row)
    result = row[features[0]]
    for feature in features[1:]:
        result += ' ' + str(row[feature])        
    return result

# Clean and preprocess the text features
features = ['original_title', 'authors_str', 'genres_str', 'description']
books['combined_features'] = books.apply(combine_features, args=(features,), axis=1)
books['cleaned_combined_features'] = books['combined_features'].apply(clean_text)

In [None]:
books.head()

In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(books['cleaned_combined_features'])

In [None]:
cosine_sim = cosine_similarity(tfidf_matrix)

In [None]:
def recommend_books_content(title, cosine_sim_internal=cosine_sim, top_n=5):
    # Get the index of the book that matches the title
    idx = books[books['original_title'] == title].index

    if len(idx) == 0:
        print("Book not found in dataset")
        return None

    # Get the pairwise similarity scores of all books with that book
    sim_scores = list(enumerate(cosine_sim_internal[idx[0]]))

    # Sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the top_n most similar books (excluding the input book itself)
    sim_scores = sim_scores[0:top_n + 1]

    # Get the book indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top_n most similar books
    return books.iloc[book_indices][['original_title', 'authors', 'genres', 'description', 'cleaned_combined_features']]	

In [None]:
recommended_books = recommend_books_content("Fantastic Beasts and Where to Find Them")
recommended_books

In [None]:
recommended_books = recommend_books_content("The Princess Bride")
recommended_books

In [None]:
recommended_books = recommend_books_content("The Da Vinci Code")
recommended_books

### Testing without the description

In [None]:
books.head()

In [None]:
# Clean and preprocess the text features
features = ['original_title', 'authors_str', 'genres_str']
books['combined_features_2'] = books.apply(combine_features, args=(features,), axis=1)
books['cleaned_combined_features_2'] = books['combined_features_2'].apply(clean_text) 

In [None]:
books.head()

In [None]:
tfidf_vectorizer_2 = TfidfVectorizer()
tfidf_matrix_2 = tfidf_vectorizer_2.fit_transform(books['cleaned_combined_features_2'])

In [None]:
cosine_sim_2 = cosine_similarity(tfidf_matrix_2)

In [None]:
recommended_books = recommend_books_content("Fantastic Beasts and Where to Find Them", cosine_sim_internal=cosine_sim_2)
recommended_books

In [None]:
recommended_books = recommend_books_content("The Princess Bride", cosine_sim_internal=cosine_sim_2)
recommended_books

In [None]:
recommended_books = recommend_books_content("The Da Vinci Code", cosine_sim_internal=cosine_sim_2)
recommended_books

### With user preference

In [None]:
# Let's assume the user has read and liked the following books:
user_preferences = ['The Fault in Our Stars', 'Pride and Prejudice', 'Memoirs of a Geisha']

# Extract the cleaned_combined_features of these books
user_profile = books[books['original_title'].isin(user_preferences)]['cleaned_combined_features'].tolist()

# Combine the features of the user's preferred books
user_profile_combined = ' '.join(user_profile)

# Calculate the cosine similarity between the user profile and all books
tfidf_user_profile = tfidf_vectorizer.transform([user_profile_combined])
cosine_sim_user = cosine_similarity(tfidf_user_profile, tfidf_matrix)

# Get the top N recommendations
top_n = 5
book_indices = cosine_sim_user.argsort().flatten()[-top_n:]
recommended_books = books.iloc[book_indices][['original_title', 'authors']]

print("Top", top_n, "book recommendations for the user:")
print(recommended_books)

In [None]:
# Extract the cleaned_combined_features of these books
user_profile_2 = books[books['original_title'].isin(user_preferences)]['cleaned_combined_features_2'].tolist()

# Combine the features of the user's preferred books
user_profile_combined_2 = ' '.join(user_profile_2)

# Calculate the cosine similarity between the user profile and all books
tfidf_user_profile_2 = tfidf_vectorizer_2.transform([user_profile_combined_2])
cosine_sim_user_2 = cosine_similarity(tfidf_user_profile_2, tfidf_matrix_2)

# Get the top N recommendations
top_n = 5
book_indices = cosine_sim_user_2.argsort().flatten()[-top_n:]
recommended_books = books.iloc[book_indices][['original_title', 'authors']]

print("Top", top_n, "book recommendations for the user:")
print(recommended_books)

## Collaborative Filtering Based Recommender System

Matrix factorization collaborative filtering provides recommendations to users based on neighbours with similar rating patterns identified by preference or rating. For user based, a similarity measure it ranks the opinion of closer clusters of neighbours to be of more value than those at a farther distance. Essentially the proximity is used to add weights to other users’ ratings which are then normalized and aggregated to determine a predicted rating and rank for books for the target user. </br>

For item-based filtering, the resulting matrix of users and books helps determine similarity of books not based on content but rather on the behaviour of the users. This can be helpful when combined with content-based filtering to provide books that the user may like but may not have been exposed to before. 


In [None]:
books_large.shape

In [None]:
ratings.shape

In [None]:
complete_df = ratings.merge(books_large, on=['ISBN'], how='inner')
print(complete_df.shape)
complete_df.head()

In [None]:
def create_pt(users_min_ratings=200, books_min_ratings=50):
    x = complete_df.groupby('User-ID').count()['Book-Rating']>users_min_ratings
    knowledgable_users = x[x].index

    filtered_rating = complete_df[complete_df['User-ID'].isin(knowledgable_users)]

    y = filtered_rating.groupby('Book-Title').count()['Book-Rating']>=books_min_ratings
    famous_books = y[y].index

    final_ratings =  filtered_rating[filtered_rating['Book-Title'].isin(famous_books)]

    pt = final_ratings.pivot_table(index='Book-Title',columns='User-ID'
                          ,values='Book-Rating')

    pt.fillna(0,inplace=True)

    return pt

In [None]:
pt = create_pt()
pt

In [None]:
book_sparse = csr_matrix(pt)
book_sparse

In [None]:
similarity_score = cosine_similarity(book_sparse)

In [None]:
similarity_score.shape

In [None]:
def recommend_books_collaborative(title, pt_internal=pt, cosine_sim_internal=similarity_score, top_n=5):
    # Get the index of the book that matches the title in pt
    index = np.where(pt_internal.index==title)[0]

    if len(index) == 0:
        print("Book not found in dataset")
        return None

    similar_books = sorted(list(enumerate(cosine_sim_internal[index[0]])),key=lambda x:x[1], reverse=True)[0:top_n+2]

    # Get the book indices
    book_indices = []
    
    for i in similar_books:
        item = []
        book_indices.append((books_large[books_large['Book-Title'] == pt_internal.index[i[0]]]).head(1).index[0])


    #book_indices = [i[0] for i in similar_books]
    

    # Return the top_n most similar books
    return books_large.iloc[book_indices][['Book-Title', 'Book-Author', 'Image-URL-M']]	

In [None]:
index = np.where(pt.index=="1984")[0]
index

In [None]:
sorted(list(enumerate(similarity_score[index[0]])),key=lambda x:x[1], reverse=True)

In [None]:
recommend_books_collaborative("Fantastic Beasts and Where to Find Them")

In [None]:
recommend_books_collaborative("The Princess Bride")

In [None]:
recommend_books_collaborative("The Da Vinci Code")

### Using knn

In [None]:
model_knn = NearestNeighbors(algorithm='brute')
model_knn.fit(book_sparse)

In [None]:
def recommend_books_collaborative_knn(title, pt_internal=pt, model=model_knn, top_n=5):
    # Get the index of the book that matches the title in pt
    index = np.where(pt_internal.index==title)[0]

    if len(index) == 0:
        print("Book not found in dataset")
        return None

    distances, suggestions = model.kneighbors(pt_internal.iloc[index[0], :].values.reshape(1, -1))

    # Get the book indices
    book_indices = []
    
    total = top_n
    for i in suggestions[0]:
        if total == 0:
            break
        item = []
        book_indices.append((books_large[books_large['Book-Title'] == pt_internal.index[i]]).head(1).index[0])
        total = total - 1

    # Return the top_n most similar books
    return books_large.iloc[book_indices][['Book-Title', 'Book-Author', 'Image-URL-M']]	

In [None]:
recommend_books_collaborative_knn("Fantastic Beasts and Where to Find Them")

In [None]:
recommend_books_collaborative_knn("The Princess Bride")

In [None]:
recommend_books_collaborative_knn("The Da Vinci Code")

### Adding more books

In [None]:
pt_new = create_pt(10, 10)
pt_new.shape

In [None]:
book_sparse_new = csr_matrix(pt_new)
book_sparse_new

In [None]:
similarity_score_new = cosine_similarity(book_sparse_new)
similarity_score_new.shape

In [None]:
recommend_books_collaborative("Fantastic Beasts and Where to Find Them", pt_internal=pt_new, cosine_sim_internal=similarity_score_new)

In [None]:
recommend_books_collaborative("The Princess Bride", pt_internal=pt_new, cosine_sim_internal=similarity_score_new)

In [None]:
recommend_books_collaborative("The Da Vinci Code", pt_internal=pt_new, cosine_sim_internal=similarity_score_new)

In [None]:
model_knn_new = NearestNeighbors(algorithm='brute')
model_knn_new.fit(book_sparse_new)

In [None]:
recommend_books_collaborative_knn("Fantastic Beasts and Where to Find Them", pt_internal=pt_new, model=model_knn_new)

In [None]:
recommend_books_collaborative_knn("The Princess Bride", pt_internal=pt_new, model=model_knn_new)

In [None]:
recommend_books_collaborative_knn("The Da Vinci Code", pt_internal=pt_new, model=model_knn_new)

## Performance Evaluation

Smoke tests indicate that an ensemble approach would be best at providing accurate book recommendations. Content based filtering with book descriptions should be cross referenced by a list generated without recommendations to remove erroneous suggestions. Additionally, by mixing collaborative and content-based recommendations, it is more likely that you can recommend something to a user they may not have received via a single calculation. By creating sparse lists, the model can perform calculations faster, but at the sacrifice of books with low counts of ratings. This may result in a return of no results, which can be mitigated with a content recommended list.  