This is the code for the XGBoost recommendation system. The limitations of this is:
1. Only take book data into account, and doesn't compare user-to-user
2. The amount of memory and computing power is large, with the whole book dataset python needs to allocate ~80Gb. Currently to get around this I'm using 20000 instances of book data.

In [92]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/legoeuro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [102]:
bookDf = pd.read_csv('../input/Books.csv', sep=',')
bookDf.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], inplace=True)

ratingDf = pd.read_csv('../input/Ratings.csv', sep=',')
usersDf = pd.read_csv('../input/Users.csv', sep=',')

#Data concatenation
bookInfo = []
for i in range(len(bookDf)):
    bookInfo.append(f"{bookDf['Book-Title'][i]} {bookDf['Book-Author'][i]} {bookDf['Publisher'][i]}")
bookDf["info"] = bookInfo

#configs
bookDf = bookDf.iloc[:20000]
ratingDf['ISBN'] = ratingDf['ISBN'].astype(str)
bookDf['ISBN'] = bookDf["ISBN"].astype(str)
bookDf['id'] = bookDf.index


  bookDf = pd.read_csv('../input/Books.csv', sep=',')


ValueError: You are trying to merge on object and int64 columns for key 'ISBN'. If you wish to proceed you should use pd.concat

Add index for later steps 

Preprocess and clean book information
.

In [94]:

#from https://www.kaggle.com/code/muhammadayman/recommendation-system-using-cosine-similarity#Feature-Engineering
stop = stopwords.words('english')
def preprocess(column):
    #make all words with lower letters
    column = column.str.lower()
    #getting rid of any punctution
    column = column.str.replace('http\S+|www.\S+|@|%|:|,|', '', case=False)
    #spliting each sentence to words to apply previous funtions on them 
    word_tokens = column.str.split()
    keywords = word_tokens.apply(lambda x: [item for item in x if item not in stop])
    #assemble words of each sentence again and assign them in new column
    for i in range(len(keywords)):
        keywords[i] = " ".join(keywords[i])
        column = keywords

    return column
bookDf['info'] = preprocess(bookDf['info'])
# bookDf['Book-Title'] = preprocess(bookDf['Book-Title'])
bookDf.head(5)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,info
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,classical mythology mark p. o. morford oxford ...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,clara callan richard bruce wright harperflamin...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,decision normandy carlo d'este harperperennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,flu: story great influenza pandemic 1918 searc...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,mummies urumchi e. j. w. barber w. w. norton &...


In [95]:
# from sklearn import preprocessing

# mms = preprocessing.MinMaxScaler()

# bookDf['Year-Of-Publication'] = (bookDf['Year-Of-Publication'] - bookDf['Year-Of-Publication'].mean()) / bookDf['Year-Of-Publication'].std() 

Apply cosine vectorizer to all books; Find user top 10 books that they have not read

In [110]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import pdist

CV = CountVectorizer()
titleVect = CV.fit_transform(bookDf['info'])
titleSim = cosine_similarity(titleVect)


# n: number of recommendations
# m: maximum number of books to consider for recommendations (take m books with highest rating from user and consider them for similarity)
# assuming book-rating only goes from 1 to 10
normalize = 10
bookRatingMerge = pd.merge(ratingDf, bookDf, on='ISBN', how='inner')
def recommend(userId, n, m):
    userRatings = bookRatingMerge[bookRatingMerge['User-ID'] == userId]
    userRatings = userRatings[userRatings['Book-Rating'] > 5]

    ratingTruncated = userRatings.nlargest(m, 'Book-Rating')
    print(ratingTruncated.to_markdown())
    recommendations = []
    for _, row in bookDf.iterrows():
        bookId = row['ISBN']
        bookIndex = bookDf[bookDf['ISBN'] == bookId].index[0]
        sim = 0
        isBookRead = False
        for _, ratingRow in ratingTruncated.iterrows():
            ratingBookId = ratingRow['ISBN']
            ratingBookIndex = bookDf[bookDf['ISBN'] == ratingBookId].index[0]
            if bookIndex == ratingBookIndex:
                isBookRead = True
                break
            #weight by rating
            sim += ratingRow['Book-Rating']/(normalize) * titleSim[bookIndex][ratingBookIndex]
        if (not isBookRead):
            recommendations.append((bookId, sim)) 
    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations[:n]

#Example: get the top recommendation for the user with id 276709
recommendation = recommend(276709, 10, 10)
mappedRec = map(lambda x: (bookDf[bookDf['ISBN'] == x[0]]['Book-Title'].values[0], x[1]), recommendation)
print(list(mappedRec))



|        |   User-ID |       ISBN |   Book-Rating | Book-Title                                       | Book-Author       |   Year-Of-Publication | Publisher   | info                                                                          |
|-------:|----------:|-----------:|--------------:|:-------------------------------------------------|:------------------|----------------------:|:------------|:------------------------------------------------------------------------------|
| 384744 |    276709 | 0515107662 |            10 | The Sherbrooke Bride (Bride Trilogy (Paperback)) | Catherine Coulter |                  1996 | Jove Books  | The Sherbrooke Bride (Bride Trilogy (Paperback)) Catherine Coulter Jove Books |
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
0515107662
05