This is the code for the Cosine Similarity recommendation system. The limitations of this is:
1. Only take book data into account, and doesn't compare user-to-user
2. The amount of memory and computing power is large. Currently to get around this I'm using 20000 instances of book data.
3. The output is the similarity of 1 book compared to the other in the dataset. Because of so, we cannot really use a new book not in the dataset, and this makes it kind of hard to evaluate. For recommending book for the user, I take their top m books, and look at most similar books.

*Tunable parameters:
- Number of training instances
- m: maximum number of books to consider for recommendations (take m books with highest rating from user and consider them for similarity)


In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from brs_data_preprocessing import get_preprocessed_data as preproc, merged_book_ratings as merge

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/legoeuro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
usersDf, bookDf, ratingDf = preproc('Users.csv', 'Books.csv', 'Ratings.csv')

#Data concatenation
bookInfo = []
for _, row in bookDf.iterrows():
    bookInfo.append(f"{row['Book-Title']} {row['Book-Author']} {row['Publisher']}")
bookDf["info"] = bookInfo

#configs
bookDf = bookDf.iloc[:20000]
ratingDf['ISBN'] = ratingDf['ISBN'].astype(str)
bookDf['ISBN'] = bookDf["ISBN"].astype(str)
bookDf['id'] = bookDf.index


Add index for later steps 

Preprocess and clean book information
.

In [3]:

#from https://www.kaggle.com/code/muhammadayman/recommendation-system-using-cosine-similarity#Feature-Engineering
stop = stopwords.words('english')
def preprocess(column):
    #make all words with lower letters
    column = column.str.lower()
    #getting rid of any punctution
    column = column.str.replace('http\S+|www.\S+|@|%|:|,|', '', case=False)
    #spliting each sentence to words to apply previous funtions on them 
    word_tokens = column.str.split()
    keywords = word_tokens.apply(lambda x: [item for item in x if item not in stop])
    #assemble words of each sentence again and assign them in new column
    for i in range(len(keywords)):
        keywords[i] = " ".join(keywords[i])
        column = keywords

    return column
bookDf['info'] = preprocess(bookDf['info'])
# bookDf['Book-Title'] = preprocess(bookDf['Book-Title'])
bookDf.head(5)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,info,id
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,classical mythology mark p. o. morford oxford ...,0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,clara callan richard bruce wright harperflamin...,1
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,decision normandy carlo d'este harperperennial,2
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,flu: story great influenza pandemic 1918 searc...,3
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,mummies urumchi e. j. w. barber w. w. norton &...,4


In [4]:
# from sklearn import preprocessing

# mms = preprocessing.MinMaxScaler()

# bookDf['Year-Of-Publication'] = (bookDf['Year-Of-Publication'] - bookDf['Year-Of-Publication'].mean()) / bookDf['Year-Of-Publication'].std() 

Apply cosine vectorizer to all books; Find user top 10 books that they have not read

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import pdist

CV = CountVectorizer()
titleVect = CV.fit_transform(bookDf['info'])
# authVect = CV.fit_transform(bookDf['Book-Author'])
# pubVect = CV.fit_transform(bookDf['Publisher'])

titleSim = cosine_similarity(titleVect)

normalize = 10
bookRatingMerge = merge(ratings_df=ratingDf, books_df=bookDf)
def recommendFromUser(userId, n, m):
    """
    userId: id of the user in the dataset
    n: number of recommendations
    m: maximum number of books to consider for recommendations (take m books with highest rating from user and consider them for similarity)
       made with the assumption that book-rating only goes from 1 to 10
    """
    userRatings = bookRatingMerge[bookRatingMerge['User-ID'] == userId]
    userRatings = userRatings[userRatings['Book-Rating'] > 5]

    ratingTruncated = userRatings.nlargest(m, 'Book-Rating')
    print(ratingTruncated.to_markdown())
    recommendations = []
    for _, row in bookDf.iterrows():
        bookId = row['ISBN']
        bookIndex = bookDf[bookDf['ISBN'] == bookId].index[0]
        sim = 0
        isBookRead = False
        for _, ratingRow in ratingTruncated.iterrows():
            ratingBookId = ratingRow['ISBN']
            ratingBookIndex = bookDf[bookDf['ISBN'] == ratingBookId].index[0]
            if bookIndex == ratingBookIndex:
                isBookRead = True
                break
            #weight by rating
            sim += ratingRow['Book-Rating']/(normalize) * titleSim[bookIndex][ratingBookIndex]
        if (not isBookRead):
            recommendations.append((bookId, sim)) 
    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations[:n]

#Example: get the top recommendation for the user with id 276709
recommendation = recommendFromUser(276709, 10, 10)
mappedRec = map(lambda x: (bookDf[bookDf['ISBN'] == x[0]]['Book-Title'].values[0], x[1]), recommendation)
print(list(mappedRec))



|        |       ISBN | Book-Title                                       | Book-Author       |   Year-Of-Publication | Publisher   | info                                                                      |    id |   User-ID |   Book-Rating |
|-------:|-----------:|:-------------------------------------------------|:------------------|----------------------:|:------------|:--------------------------------------------------------------------------|------:|----------:|--------------:|
| 344029 | 0515107662 | The Sherbrooke Bride (Bride Trilogy (Paperback)) | Catherine Coulter |                  1996 | Jove Books  | sherbrooke bride (bride trilogy (paperback)) catherine coulter jove books | 15978 |    276709 |            10 |
[('The Scottish Bride (Bride Trilogy (Paperback))', 0.9090909090909093), ('The Heiress Bride (Bride Trilogy (Paperback))', 0.9090909090909093), ('The Sherbrooke Twins', 0.6154574548966638), ('The Cove', 0.5393598899705937), ('The Maze', 0.5393598899705937), ('The E

In [11]:
def recommendFromBook(ISBN):
    recList = []
    thisBookId = bookDf[bookDf['ISBN'] == ISBN].index[0]
    for _, row in bookDf.iterrows():
        bookId = row['ISBN']
        bookIndex = bookDf[bookDf['ISBN'] == bookId].index[0]
        if (ISBN != bookId):
            recList.append((bookId, titleSim[bookIndex][thisBookId]))
    recList.sort(key=lambda x: x[1], reverse=True)
    recList = recList[slice(5)]
    return recList

#Example: get the top recommendation for the book with ISBN 034545104X
print(f"Book title: {bookDf[bookDf['ISBN'] == '080652121X']['Book-Title'].values[0]}")
rec = recommendFromBook('080652121X')
mappedRec = map(lambda x: (bookDf[bookDf['ISBN'] == x[0]]['Book-Title'].values[0], x[1]), rec)
print(list(mappedRec))

print(f"Book title: {bookDf[bookDf['ISBN'] == '0374157065']['Book-Title'].values[0]}")
rec2 = recommendFromBook('0374157065')
mappedRec = map(lambda x: (bookDf[bookDf['ISBN'] == x[0]]['Book-Title'].values[0], x[1]), rec2)
print(list(mappedRec))
    

Book title: Hitler's Secret Bankers: The Myth of Swiss Neutrality During the Holocaust
[('Secret', 0.2461829819586655), ('Young Adam', 0.2461829819586655), ('Secret of Nimh Storybook', 0.2279211529192759), ('The Diaries of Adam and Eve', 0.21320071635561041), ("The Modern Witch's Guide to Magic and Spells", 0.19069251784911848)]
Book title: Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It
[('The Great Victorian Collection', 0.36514837167011066), ('The Shackle', 0.3464101615137754), ('Before and After', 0.3464101615137754), ('The Pickup', 0.31622776601683794), ('Housekeeping', 0.31622776601683794)]
