# Content Based and Collaborative Filtering methods

In the following notebook we examine the performance of two baseline models: Content Based Filtering and Collaborative Filtering approaches. The choice of these two is motivated by their common use and popularity in the recommender systems research.

## Imports

In [21]:
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split

from collections import defaultdict
from surprise import accuracy

## Data initialization

In [4]:
ratings = pd.read_csv("Data/Ratings.csv")
books = pd.read_csv("Data/Books.csv", dtype={3: str})
users = pd.read_csv("Data/Users.csv")

## Collaborative Filtering

The following is partly based on [this Kaggle documentation](https://www.kaggle.com/code/alnourabdalrahman9/collaborative-filtering-books-recommendation)

### Data Merging

For a more efficient handling, merging neccesary columns into one data frame.

In [5]:
ratings = ratings.merge(books, on='ISBN').drop(columns=["ISBN", "Image-URL-S", "Image-URL-M", "Image-URL-L"])
ratings = ratings.merge(users.drop("Age", axis=1), on="User-ID")

In [4]:
ratings

Unnamed: 0,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Location
0,276725,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"tyler, texas, usa"
1,2313,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"cincinnati, ohio, usa"
2,2313,9,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1986,Tor Books,"cincinnati, ohio, usa"
3,2313,8,In Cold Blood (Vintage International),TRUMAN CAPOTE,1994,Vintage,"cincinnati, ohio, usa"
4,2313,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,Rebecca Wells,1996,HarperCollins,"cincinnati, ohio, usa"
...,...,...,...,...,...,...,...
1031131,276442,7,Le Huit,Katherine Neville,2002,Le Cherche Midi,"genève, genève, switzerland"
1031132,276618,5,Ludwig Marum: Briefe aus dem Konzentrationslag...,Ludwig Marum,1984,C.F. MÃ¼ller,"stuttgart, \n/a\""., germany"""
1031133,276647,0,Christmas With Anne and Other Holiday Stories:...,L. M. Montgomery,2001,Starfire,"arlington heights, illinois, usa"
1031134,276647,10,Heaven (Coretta Scott King Author Award Winner),Angela Johnson,1998,Simon &amp; Schuster Children's Publishing,"arlington heights, illinois, usa"


In [6]:
# Since not all users contain cities, we keep only the country
ratings['Location'] = ratings['Location'].str.split(',').str[-1].str.strip()

### Filtering 

In [7]:
active_users = ratings.groupby('User-ID')['Book-Rating'].count()
# Filtering of 2000 was a the maximum viable threshold for Kernel to not crash
active_user_ids = active_users[active_users > 2000].index
#active_user_ids

In [8]:
books = ratings.groupby('Book-Title')['Book-Rating'].count()
book_titles = books.index
# Filter for user above threshold and all books
filtered_df = ratings[ratings['User-ID'].isin(active_user_ids) & ratings['Book-Title'].isin(book_titles)]

In [8]:
filtered_df

Unnamed: 0,User-ID,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Location
4021,98391,9,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,usa
4022,98391,8,At the Edge,David Dun,2002,Pinnacle Books,usa
4023,98391,9,Southampton Row (Charlotte &amp; Thomas Pitt N...,Anne Perry,2002,Ballantine Books,usa
4024,98391,9,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown",usa
4025,98391,10,The Da Vinci Code,Dan Brown,2003,Doubleday,usa
...,...,...,...,...,...,...,...
509042,171118,8,Too Many Clients,Rex Stout,1990,Bantam Books,canada
509043,171118,0,Nebula Award Stories: 5,James Blish,1983,Bantam Books,canada
509044,171118,0,Wrong End of Time,John Brunner,1971,Doubleday,canada
509045,171118,0,Werewolf Principle,Clifford D. Simak,1967,Putnam Pub Group,canada


### Similarity Matrix

In [9]:
pivot_table = filtered_df.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')
pivot_table.fillna(0, inplace=True)

In [10]:
similarity_matrix = cosine_similarity(pivot_table)

Defining a function for CF top 10 recommendations; based on the aforementioned kaggle narrative, adjusted for wider applicability

In [11]:
def recommend_cf(name):
    top_n = 10
    book_idx = pivot_table.index.get_loc(name)
    # Search through to find the most similar books
    similar_books = sorted(enumerate(similarity_matrix[book_idx]), key=lambda x: x[1], reverse=True)[1:top_n+1]
    # Return a list for recommended books; _ since the score is not needed
    recommendations = [pivot_table.index[i] for i, _ in similar_books]
    return recommendations

In [12]:
recommend_cf("The Catcher in the Rye")

['Girl, Interrupted',
 'House of Sand and Fog',
 "ANGELA'S ASHES",
 'Peace Like a River',
 'Empire Falls',
 'A Night to Remember',
 "Ender's Game (Ender Wiggins Saga (Paperback))",
 'Waiting (Vintage International)',
 'Life, the Universe and Everything',
 'Sideways Stories from Wayside School (Wayside School)']

## Content-Based Filtering

The following method is based on [this Kaggle documentation](https://www.kaggle.com/code/eyadgk/books-eda-vis-recommendation-systems#6-%7C%7C-Content-Based-Filtering-Recommender-System)

In [9]:
unique_books = filtered_df.drop_duplicates(subset=['Book-Title'])
unique_books.reset_index(drop=True, inplace=True)
feature_columns = ["Book-Title", "Book-Author", "Publisher"]
unique_books['combined_features'] = unique_books.apply(lambda x: ' '.join(x[feature_columns].astype(str)), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unique_books['combined_features'] = unique_books.apply(lambda x: ' '.join(x[feature_columns].astype(str)), axis=1)


In [12]:
unique_books['combined_features']

0         Flesh Tones: A Novel M. J. Rose Ballantine Books
1                     At the Edge David Dun Pinnacle Books
2        Southampton Row (Charlotte &amp; Thomas Pitt N...
3        The Lovely Bones: A Novel Alice Sebold Little,...
4                    The Da Vinci Code Dan Brown Doubleday
                               ...                        
59244              Too Many Clients Rex Stout Bantam Books
59245     Nebula Award Stories: 5 James Blish Bantam Books
59246             Wrong End of Time John Brunner Doubleday
59247    Werewolf Principle Clifford D. Simak Putnam Pu...
59248    Complete Guide to Effective English: Harbrace ...
Name: combined_features, Length: 59249, dtype: object

In [13]:
vectorizer = CountVectorizer()
feature_vectors = vectorizer.fit_transform(unique_books['combined_features'])

In [14]:
from scipy.sparse import csr_matrix
feature_vectors = csr_matrix(feature_vectors)

In [16]:
similarity_matrix = cosine_similarity(feature_vectors)

In [19]:
def recommend_cbf(name):
    top_n = 10
    book_index = unique_books[unique_books['Book-Title'] == name].index[0]
    similar_books = sorted(enumerate(similarity_matrix[book_index]), key=lambda x: x[1], reverse=True)[1:top_n+1]
    recommendations = [unique_books['Book-Title'].iloc[i] for i, _ in similar_books]
    return recommendations

In [20]:
recommend_cbf("The Catcher in the Rye")

['Catcher in the Rye',
 'The Little House in the Highlands',
 'The Clothes They Stood Up in and the Lady in the Van: And, the Lady in the Van',
 'The Castafiore Emerald (The Adventures of Tintin)',
 'Not the end of the world',
 'At the Highest Levels: The Inside Story of the End of the Cold War',
 'WIND IN THE WILLOWS: THE WILD WOOD (Wind in the Willows, No 3)',
 'Fire in the lake: The Vietnamese and the Americans in Vietnam',
 "The People's Almanac Presents the Book of Lists/the '90s Edition",
 'In the Walled Gardens: A Novel']

## Evaluations

Four main metrics were chosen to define accuracy for the baseline models:

- RMSE
- Precision
- Recall
- F1 