# Content Based and Collaborative Filtering method

In the following notebook we examine the performance of two baseline models: Content Based Filtering and Collaborative Filtering approaches. The choice of these two is motivated by their common use and popularity in the recommender systems research.

## Imports

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data initialization

In [6]:
ratings = pd.read_csv("Data/Ratings.csv")
books = pd.read_csv("Data/Books.csv", dtype={3: str})
users = pd.read_csv("Data/Users.csv")

## User-Based Collaborative Filtering

The following is based on [this Kaggle documentation](https://www.kaggle.com/code/gspmoreira/recommender-systems-in-python-101)

In [7]:
ratings_with_book_titles = ratings.merge(books,on='ISBN')
ratings_with_book_titles.drop(columns=["ISBN","Image-URL-S","Image-URL-M"],axis=1,inplace=True)
complete_df = ratings_with_book_titles.merge(users.drop("Age", axis=1), on="User-ID")

In [8]:
complete_df['Location'] = complete_df['Location'].str.split(',').str[-1].str.strip()
complete_df.head()

Unnamed: 0,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-L,Location
0,276725,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,usa
1,2313,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,usa
2,2313,9,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1986,Tor Books,http://images.amazon.com/images/P/0812533550.0...,usa
3,2313,8,In Cold Blood (Vintage International),TRUMAN CAPOTE,1994,Vintage,http://images.amazon.com/images/P/0679745580.0...,usa
4,2313,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,Rebecca Wells,1996,HarperCollins,http://images.amazon.com/images/P/0060173289.0...,usa


In [14]:
# Select user IDs with more than 200 book ratings
min_ratings_threshold = 200

# Count book ratings per user
num_ratings_per_user = complete_df.groupby('User-ID')['Book-Rating'].count()

# Filter users with more than the minimum threshold
knowledgeable_user_ids = num_ratings_per_user[num_ratings_per_user > min_ratings_threshold].index

In [15]:
# Filter ratings from knowledgeable users
knowledgeable_user_ratings = complete_df[complete_df['User-ID'].isin(knowledgeable_user_ids)]

In [16]:
min_ratings_count_threshold=50
rating_counts= knowledgeable_user_ratings.groupby('Book-Title').count()['Book-Rating']
popular_books = rating_counts[rating_counts >= min_ratings_count_threshold].index

In [17]:
final_ratings =  knowledgeable_user_ratings[knowledgeable_user_ratings['Book-Title'].isin(popular_books)]

In [18]:
pt = final_ratings.pivot_table(index='Book-Title',columns='User-ID'
                          ,values='Book-Rating')
pt.fillna(0,inplace=True)
pt

User-ID,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
from sklearn.metrics.pairwise import cosine_similarity 

In [20]:
similarity_score = cosine_similarity(pt)

In [21]:
def recommend(book_name):
    index = np.where(pt.index==book_name)[0][0]
    similar_books = sorted(list(enumerate(similarity_score[index])),key=lambda x:x[1], reverse=True)[1:6]
    
    data = []
    
    for i in similar_books:
        item = []
        temp_df = books[books['Book-Title'] == pt.index[i[0]]]
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Book-Title'].values))
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Book-Author'].values))
 
        data.append(item)
    return data

In [22]:
recommend("The Catcher in the Rye")

[["The Hitchhiker's Guide to the Galaxy", 'Douglas Adams'],
 ['The Nanny Diaries: A Novel', 'Emma McLaughlin'],
 ['A Wrinkle in Time', "Madeleine L'Engle"],
 ['To Kill a Mockingbird', 'Harper Lee'],
 ['Tis: A Memoir', 'Frank McCourt']]

In [23]:
# Install Surprise library
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25ldone
[?25h  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp311-cp311-macosx_11_0_arm64.whl size=493936 sha256=87c903b2a2ae2dc435b18e2305232de4a5e4d5731703b7301b140cea699f10d3
  Stored in directory: /Users/linkamitome/Library/Caches/pip/wheels/2a/8f/6e/7e2899163e2d85d8266daab4aa1cdabec7a6c56f83c015b5af
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


In [24]:
import pandas as pd
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Define the rating scale
reader = Reader(rating_scale=(0, 10))

# Load the data into Surprise's dataset format
data = Dataset.load_from_df(complete_df[['User-ID', 'Book-Title', 'Book-Rating']], reader)

# Split the dataset into training and testing sets
train_set, test_set = train_test_split(data, test_size=0.20, random_state=42)

# Define the SVD algorithm
cf_model = SVD()

# Train the algorithm on the training set
cf_model.fit(train_set)

# Make predictions on the test set
predictions = model.test(test_set)

## Evaluations

In [25]:
# Evaluate the model
accuracy.rmse(predictions)

RMSE: 3.5167


3.5167040060379726

In [196]:
from collections import defaultdict
from surprise import accuracy

def precision_recall_at_k(predictions, k=10, threshold=7.0):
    user_est_true = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()

    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls

def calculate_f1(precisions, recalls):
    f1_scores = {}
    for uid in precisions:
        if precisions[uid] + recalls[uid] > 0:
            f1_scores[uid] = 2 * (precisions[uid] * recalls[uid]) / (precisions[uid] + recalls[uid])
        else:
            f1_scores[uid] = 0
    return f1_scores

precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=4.0)
f1_scores = calculate_f1(precisions, recalls)

avg_precision = sum(precisions.values()) / len(precisions)
avg_recall = sum(recalls.values()) / len(recalls)
avg_f1 = sum(f1_scores.values()) / len(f1_scores)

print(f'Precision@10: {avg_precision:.4f}')
print(f'Recall@10: {avg_recall:.4f}')
print(f'F1 Score@10: {avg_f1:.4f}')

Precision@10: 0.8653
Recall@10: 0.5070
F1 Score@10: 0.4553


In [114]:
def mean_average_precision(predictions, k=10):
    user_est_true = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))
    
    average_precisions = []

    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        n_rel_and_rec_k = sum(1 for (est, true_r) in user_ratings[:k] if true_r >= 4.0)
        precisions_at_k = [sum(1 for (est, true_r) in user_ratings[:i+1] if true_r >= 4.0) / (i + 1) for i in range(k)]
        average_precision = sum(precisions_at_k[:n_rel_and_rec_k]) / min(k, n_rel_and_rec_k) if n_rel_and_rec_k != 0 else 0
        average_precisions.append(average_precision)

    return sum(average_precisions) / len(average_precisions)

# Example usage
map_score = mean_average_precision(predictions, k=10)
print(f'MAP@10: {map_score:.4f}')

MAP@10: 0.5475


## Content-Based 

The following code is based on [this Kaggle documentation](https://www.kaggle.com/code/hilalmleykeyuksel/book-recommender#CONTENT-BASED-COLLABORATIVE-FILTERING)

In [115]:
import re
df=complete_df.copy()
df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)
df.drop(columns=["Year-Of-Publication"],axis=1,inplace=True)
df.drop(index=df[df["Book-Rating"]==0].index,inplace=True)
df["Book-Title"]=df["Book-Title"].apply(lambda x: re.sub("[\W_]+"," ",x).strip())
df.head()

Unnamed: 0,User-ID,Book-Rating,New-User-ID_x,Book-ID_x,Book-Title,Book-Author,Publisher,Image-URL-L,Book-ID_y,Location,New-User-ID_y
1,2313,5,1655,0,Flesh Tones A Novel,M. J. Rose,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,0.0,usa,1655.0
2,2313,9,1655,383,Ender s Game Ender Wiggins Saga Paperback,Orson Scott Card,Tor Books,http://images.amazon.com/images/P/0812533550.0...,383.0,usa,1655.0
3,2313,8,1655,443,In Cold Blood Vintage International,TRUMAN CAPOTE,Vintage,http://images.amazon.com/images/P/0679745580.0...,443.0,usa,1655.0
4,2313,9,1655,1313,Divine Secrets of the Ya Ya Sisterhood A Novel,Rebecca Wells,HarperCollins,http://images.amazon.com/images/P/0060173289.0...,1313.0,usa,1655.0
5,2313,5,1655,1545,The Mistress of Spices,Chitra Banerjee Divakaruni,Anchor Books/Doubleday,http://images.amazon.com/images/P/0385482388.0...,1545.0,usa,1655.0


In [116]:
df.columns

Index(['User-ID', 'Book-Rating', 'New-User-ID_x', 'Book-ID_x', 'Book-Title',
       'Book-Author', 'Publisher', 'Image-URL-L', 'Book-ID_y', 'Location',
       'New-User-ID_y'],
      dtype='object')

In [117]:
from sklearn.feature_extraction.text import CountVectorizer
def content_based(bookTitle):
    bookTitle = str(bookTitle)
    
    if bookTitle in df["Book-Title"].values:
        rating_count = pd.DataFrame(df["Book-Title"].value_counts())
        rating_count.columns = ['Count']
        rare_books = rating_count[rating_count["Count"] <= 200].index
        common_books = df[~df["Book-Title"].isin(rare_books)]
        
        if bookTitle in rare_books:
            most_common = pd.Series(common_books["Book-Title"].unique()).sample(3).values
            print("No Recommendations for this Book ☹️ \n")
            print("YOU MAY TRY: \n")
            print("{}".format(most_common[0]), "\n")
            print("{}".format(most_common[1]), "\n")
            print("{}".format(most_common[2]), "\n")
        else:
            common_books = common_books.drop_duplicates(subset=["Book-Title"])
            common_books.reset_index(inplace=True)
            common_books["index"] = [i for i in range(common_books.shape[0])]
            targets = ["Book-Title", "Book-Author", "Publisher"]
            common_books["all_features"] = [" ".join(common_books[targets].iloc[i,].values) for i in range(common_books[targets].shape[0])]
            vectorizer = CountVectorizer()
            common_booksVector = vectorizer.fit_transform(common_books["all_features"])
            similarity = cosine_similarity(common_booksVector)
            index = common_books[common_books["Book-Title"] == bookTitle]["index"].values[0]
            similar_books = list(enumerate(similarity[index]))
            similar_booksSorted = sorted(similar_books, key=lambda x: x[1], reverse=True)[1:6]
            books = []
            for i in range(len(similar_booksSorted)):
                books.append([common_books[common_books["index"] == similar_booksSorted[i][0]]["Book-Title"].item(),
                              common_books[common_books["index"] == similar_booksSorted[i][0]]["Book-Author"].item()])
            print(books)

    else:
        print("error")

In [118]:
content_based("The Catcher in the Rye")

[['The Fellowship of the Ring The Lord of the Rings Part 1', 'J.R.R. TOLKIEN'], ['The Lovely Bones A Novel', 'Alice Sebold'], ['The Da Vinci Code', 'Dan Brown'], ['Harry Potter and the Order of the Phoenix Book 5', 'J. K. Rowling'], ['The Five People You Meet in Heaven', 'Mitch Albom']]


In [157]:
# Import necessary libraries from Surprise
from surprise import Reader, Dataset, SVD, accuracy
from surprise.model_selection import train_test_split

# Load the data into Surprise's dataset format
reader = Reader(rating_scale=(0, 10))
data2 = Dataset.load_from_df(df[['User-ID', 'Book-Title', 'Book-Rating']], reader)

# Split the dataset into training and testing sets
train2_set, test2_set = train_test_split(data2, test_size=0.20, random_state=42)

# Define the SVD algorithm
cb_model = SVD()

# Train the algorithm on the training set
cb_model.fit(train2_set)

# Make predictions on the test set
predictions2 = cb_model.test(test2_set)

# Evaluate the model
rmse2 = accuracy.rmse(predictions2)

RMSE: 1.6409


In [195]:
precisions, recalls = precision_recall_at_k(predictions2, k=10, threshold=7.0)
f1_scores = calculate_f1(precisions, recalls)

avg_precision = sum(precisions.values()) / len(precisions)
avg_recall = sum(recalls.values()) / len(recalls)
avg_f1 = sum(f1_scores.values()) / len(f1_scores)

print(f'Precision@10: {avg_precision:.4f}')
print(f'Recall@10: {avg_recall:.4f}')
print(f'F1 Score@10: {avg_f1:.4f}')

Precision@10: 0.7786
Recall@10: 0.9165
F1 Score@10: 0.7337


In [177]:
map_score2 = mean_average_precision(predictions2, k=10)
print(f'MAP@10: {map_score2:.4f}')

MAP@10: 0.9723
