# Content Based and Collaborative Filtering methods

In the following notebook we examine the performance of two baseline models: Content Based Filtering and Collaborative Filtering approaches. The choice of these two is motivated by their common use and popularity in the recommender systems research.

## Imports

In [1]:
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD, accuracy
from sklearn.model_selection import train_test_split

SEED = 42

from collections import defaultdict
from surprise import accuracy
from sklearn.metrics import mean_squared_error

## Data initialization

In [2]:
ratings = pd.read_csv("Data/Ratings.csv")
books = pd.read_csv("Data/Books.csv", dtype={3: str})
users = pd.read_csv("Data/Users.csv")

## Collaborative Filtering

The following is partly based on [this Kaggle documentation](https://www.kaggle.com/code/alnourabdalrahman9/collaborative-filtering-books-recommendation)

CF method recommends a book for a given user matching its preferences with other similar users.

### Data Merging

For a more efficient handling, merging neccesary columns into one data frame.

In [3]:
ratings = ratings.merge(books, on='ISBN').drop(columns=["ISBN", "Image-URL-S", "Image-URL-M", "Image-URL-L"])
ratings = ratings.merge(users.drop("Age", axis=1), on="User-ID")

In [4]:
ratings

Unnamed: 0,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Location
0,276725,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"tyler, texas, usa"
1,2313,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"cincinnati, ohio, usa"
2,2313,9,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1986,Tor Books,"cincinnati, ohio, usa"
3,2313,8,In Cold Blood (Vintage International),TRUMAN CAPOTE,1994,Vintage,"cincinnati, ohio, usa"
4,2313,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,Rebecca Wells,1996,HarperCollins,"cincinnati, ohio, usa"
...,...,...,...,...,...,...,...
1031131,276442,7,Le Huit,Katherine Neville,2002,Le Cherche Midi,"genève, genève, switzerland"
1031132,276618,5,Ludwig Marum: Briefe aus dem Konzentrationslag...,Ludwig Marum,1984,C.F. MÃ¼ller,"stuttgart, \n/a\""., germany"""
1031133,276647,0,Christmas With Anne and Other Holiday Stories:...,L. M. Montgomery,2001,Starfire,"arlington heights, illinois, usa"
1031134,276647,10,Heaven (Coretta Scott King Author Award Winner),Angela Johnson,1998,Simon &amp; Schuster Children's Publishing,"arlington heights, illinois, usa"


In [5]:
# Since not all users contain cities, we keep only the country
ratings['Location'] = ratings['Location'].str.split(',').str[-1].str.strip()

### Filtering 

The Kaggle narrative proposes two ways to filter data: thresholds for minimum user review count and minimum book ratings count. Lets apply both by starting with minimum user review count.

In [6]:
active_users = ratings.groupby('User-ID')['Book-Rating'].count()
# Filtering of 1000 was a the maximum viable threshold for Kernel to not crash
active_user_ids = active_users[active_users > 200].index
#active_user_ids

In [7]:
books = ratings.groupby('Book-Title')['Book-Rating'].count()
book_titles = books.index
# Filter for user above threshold and all books
filtered_df = ratings[ratings['User-ID'].isin(active_user_ids) & ratings['Book-Title'].isin(book_titles)]

In [8]:
filtered_df

Unnamed: 0,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Location
37,6543,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,usa
38,6543,0,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown",usa
39,6543,0,The Da Vinci Code,Dan Brown,2003,Doubleday,usa
40,6543,0,Wild Animus,Rich Shapero,2004,Too Far,usa
41,6543,0,Four To Score (A Stephanie Plum Novel),Janet Evanovich,1999,St. Martin's Paperbacks,usa
...,...,...,...,...,...,...,...
892567,133868,0,"Bold Land, Bold Love",Connie Mason,1998,Love Spell,usa
892568,133868,0,Suddenly You,Lisa Kleypas,2001,"Avon Books, Harper Collins",usa
892569,133868,8,Heartless,Kat Martin,2001,St. Martin's Press,usa
892570,133868,0,Shifting Calder Wind,Janet Dailey,2004,Zebra Books,usa


Applying minimum book rating count.

In [9]:
counts_ratings = filtered_df.groupby('Book-Title').count()['Book-Rating']

In [10]:
# Filtering above 100 as optimal for kernel processing
popular = counts_ratings[counts_ratings >=100].index

In [11]:
filtered_df = filtered_df[filtered_df['Book-Title'].isin(popular)]

In [12]:
filtered_df

Unnamed: 0,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Location
38,6543,0,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown",usa
39,6543,0,The Da Vinci Code,Dan Brown,2003,Doubleday,usa
40,6543,0,Wild Animus,Rich Shapero,2004,Too Far,usa
44,6543,0,Violets Are Blue,James Patterson,2002,Warner Vision,usa
48,6543,8,Fahrenheit 451,RAY BRADBURY,1987,Del Rey,usa
...,...,...,...,...,...,...,...
831409,148199,10,"The Golden Compass (His Dark Materials, Book 1)",PHILIP PULLMAN,2001,Yearling,canada
831463,148199,0,The Fellowship of the Ring (The Lord of the Ri...,J. R. R. Tolkien,2002,Houghton Mifflin Company,canada
841306,38781,0,The Nanny Diaries: A Novel,Emma McLaughlin,2003,St. Martin's Griffin,usa
856993,4385,0,The Five People You Meet in Heaven,Mitch Albom,2003,Hyperion,usa


### Similarity Matrix

In [13]:
pivot = filtered_df.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')
pivot.fillna(0, inplace=True)

pivot.shape

(150, 801)

In [14]:
similarity_matrix = cosine_similarity(pivot)

Defining a function for CF top 10 recommendations; based on the aforementioned kaggle narrative, adjusted for wider applicability

In [15]:
def recommend_cf(name):
    top_n = 10
    book_idx = pivot.index.get_loc(name)
    # Search through to find the most similar books, ignoring the first since its a score with itself
    similar_books = sorted(enumerate(similarity_matrix[book_idx]), key=lambda x: x[1], reverse=True)[1:top_n+1]
    # Return a list for recommended books; _ since the score is not needed
    recommendations = [pivot.index[i] for i, _ in similar_books]
    return recommendations

In [16]:
recommend_cf("Violets Are Blue")

['Kiss the Girls',
 '2nd Chance',
 'The Testament',
 '1st to Die: A Novel',
 'When the Wind Blows',
 'The Partner',
 'The Summons',
 'Cradle and All',
 'Along Came a Spider (Alex Cross Novels)',
 'Good in Bed']

## Evaluation

Four main metrics were chosen to define accuracy for the baseline models:

- RMSE
- Recall@10

In [17]:
print(filtered_df.columns)

Index(['User-ID', 'Book-Rating', 'Book-Title', 'Book-Author',
       'Year-Of-Publication', 'Publisher', 'Location'],
      dtype='object')


In [18]:
train_data, test_data = train_test_split(filtered_df, test_size=0.2, random_state=SEED)
# pivot tables for training data
train_pivot = train_data.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')
train_pivot.fillna(0, inplace=True)

In [19]:
similarity_matrix_train = cosine_similarity(train_pivot)

We can evaluate the method by comparing predicted and actual ratings. We do this for one user but the evaluation can be scoped (as done later).

In [20]:
def predict_rating(user_id, book_title):
    if book_title not in train_pivot.index:
        return np.nan

    if user_id not in train_pivot.columns:
        return np.nan

    book_idx = train_pivot.index.get_loc(book_title)
    user_ratings = train_pivot[user_id]
    
    # weighted average of similar books
    sim_scores = similarity_matrix_train[book_idx]
    weighted_sum = np.dot(sim_scores, user_ratings)
    sum_of_similarities = np.sum(sim_scores)
    
    if sum_of_similarities == 0:
        return np.nan
    
    return weighted_sum / sum_of_similarities

In [21]:
predict_rating(6543, "The Da Vinci Code")

0.37610675165187146

We were having issues for predicting rate for the test set as well as calculating rmse. Ive asked [chatgpt for help](https://chatgpt.com/share/9d3f2300-a817-4e78-8ad0-74a2bcead0d3)

In [22]:
# predictions for the test set
test_data['Predicted_Rating'] = test_data.apply(lambda row: predict_rating(row['User-ID'], row['Book-Title']), axis=1)

In [23]:
data_for_validation = test_data.dropna(subset=['Predicted_Rating'])

### RMSE

In [24]:
rmse = np.sqrt(mean_squared_error(data_for_validation['Book-Rating'], data_for_validation['Predicted_Rating']))
round(rmse, 3)

3.873

### Recall@20

Extract data with ratings greater than 5 (which we deem to be relevant)

In [344]:
relevant_df = test_data[test_data['Book-Rating'] > 5][['User-ID', 'Book-Title']]
unique_users = relevant_df['User-ID'].unique()

# remove users not appearing in user-item matrix
untrained_users = list(set(unique_users) - set(train_pivot.columns))
relevant_df = relevant_df[~relevant_df['User-ID'].isin(untrained_users)]

trained_users = [user for user in unique_users if user not in untrained_users]

In [345]:
def fetch_top_10(user_id):
    """Get the titles of top 10 books"""
    top_10_books_rating = train_pivot[user_id].sort_values(ascending=False)[:20]
    top_10_books_title = top_10_books_rating.index.to_list()
    return top_10_books_title if top_10_books_title else []

In [346]:
user_index = pd.Index(trained_users, name='User-ID')
pred_lists = pd.DataFrame(index=user_index, columns=['Top-20-Predictions'])

for user_id in pred_lists.index:
    top_20_preds = fetch_top_10(user_id)
    pred_lists.loc[user_id, 'Top-20-Predictions'] = top_20_preds

In [347]:
relevant_df['recall_10'] = False

for index, row in relevant_df.iterrows():
    user_id = row['User-ID']
    ground_book = row['Book-Title']
    top_20_preds = pred_lists.loc[user_id]

    is_in_top_20 = any(ground_book in item for item in pred_lists.loc[user_id])
    relevant_df.at[index, 'recall_10'] = is_in_top_20

In [348]:
tp = relevant_df['recall_10'].sum()
avg_recall_at_20 = tp / len(relevant_df)
print(avg_recall_at_20)

0.08527918781725888


## Content-Based Filtering

The following method is based on [this Kaggle documentation](https://www.kaggle.com/code/eyadgk/books-eda-vis-recommendation-systems#6-%7C%7C-Content-Based-Filtering-Recommender-System)

In [349]:
filtered_df_cbf = filtered_df.copy()
counts = pd.DataFrame(filtered_df_cbf['Book-Title'].value_counts())

In [350]:
counts.shape

(150, 1)

Lets remove books with lower than 5 counts.

In [351]:
rare=counts[counts['count']<=5].index
common=filtered_df_cbf[~filtered_df_cbf["Book-Title"].isin(rare)]
common.shape

(21099, 7)

Selecting only unique books to reduce redundancy.

In [352]:
unique = common.drop_duplicates(subset=["Book-Title"])
# keeping the index as column for easier evaluation later
unique.reset_index(inplace=True)
unique

Unnamed: 0,index,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Location
0,38,6543,0,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown",usa
1,39,6543,0,The Da Vinci Code,Dan Brown,2003,Doubleday,usa
2,40,6543,0,Wild Animus,Rich Shapero,2004,Too Far,usa
3,44,6543,0,Violets Are Blue,James Patterson,2002,Warner Vision,usa
4,48,6543,8,Fahrenheit 451,RAY BRADBURY,1987,Del Rey,usa
...,...,...,...,...,...,...,...,...
145,17833,260897,0,The Poisonwood Bible: A Novel,Barbara Kingsolver,1999,Perennial,usa
146,17862,260897,0,Airframe,Michael Crichton,1997,Ballantine Books,usa
147,20019,278418,0,Watership Down,Richard Adams,1976,Avon,usa
148,23722,7158,0,Mystic River,Dennis Lehane,2001,William Morrow &amp; Company,usa


Since CBF considers books which the user considered positively before, we create a column containing all valuable information on books.

In [353]:
# Lets convert content in the columns into strings 
unique[['Book-Title', 'Book-Author', 'Publisher']] = unique[['Book-Title', 'Book-Author', 'Publisher']].apply(lambda x: x.astype('object'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unique[['Book-Title', 'Book-Author', 'Publisher']] = unique[['Book-Title', 'Book-Author', 'Publisher']].apply(lambda x: x.astype('object'))


In [354]:
# Concatanating them via
targets = ["Book-Title", "Book-Author", "Publisher"]
unique["cbf_data"] = unique[targets].apply(lambda row: " ".join(row.astype(str)), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unique["cbf_data"] = unique[targets].apply(lambda row: " ".join(row.astype(str)), axis=1)


In [356]:
# transforms the relevant column to document-term matrix 
vectorizer=CountVectorizer()
common_booksVector=vectorizer.fit_transform(unique["cbf_data"])

In [357]:
# calculates cosine similarity 
similarity=cosine_similarity(common_booksVector)

Lets find similar books based on the `similarity`

In [358]:
def recommend_cbf(name):
    top_n = 10
    # finding index of the book
    matching_rows = unique[unique['Book-Title'] == name].index
    
    if matching_rows.empty:
        return []
    
    book_index = matching_rows[0]
    # calculating similarity scores; ignoring the first one since its of self
    similar_books = sorted(enumerate(similarity[book_index]), key=lambda x: x[1], reverse=True)[1:top_n+1]
    # no need for a score, only names of the books
    recommendations = [unique['Book-Title'].iloc[i] for i, _ in similar_books]
    
    return recommendations

In [364]:
recommend_cbf("The Da Vinci Code")

['The Beach House',
 'The Street Lawyer',
 'Angels &amp; Demons',
 'The Fellowship of the Ring (The Lord of the Rings, Part 1)',
 'The Lovely Bones: A Novel',
 'The King of Torts',
 'A Map of the World',
 'Wicked: The Life and Times of the Wicked Witch of the West',
 'The Deep End of the Ocean',
 'Harry Potter and the Order of the Phoenix (Book 5)']