### Model Development and Training

##### Importing relevant libraries

In [1]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 23.3.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import re
import nltk
import numpy as np
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, KNNWithMeans
from surprise.model_selection import train_test_split
from collections import defaultdict

##### Loading Datasets

In [3]:
books_df = pd.read_csv('books_cleaned.csv')
ratings_df = pd.read_csv('ratings_cleaned.csv')

#### Analysis

Recommendations are based on which books have the highest average rating. The recommendations are therefore independent and exclusive of titles.

##### Content-based model

Next, I will use a content-based approach to generating recommendations. In this approach, recommendations are generated based on the similarity between books. A TF-IDF matrix is created from book titles and authors, and cosine similarity is used to identify books most similar to a given book. This approach tailors recommendations by focusing on book features rather than user preferences.

In [4]:
def remove_stopwords(text_series):
    try:
        stop_words = set(stopwords.words('english'))
    except LookupError:
        nltk.download('stopwords')
        stop_words = set(stopwords.words('english'))

    cleaned_text = text_series.apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
    return cleaned_text

def combine_columns(df, titles, authors):
    df['combined'] = titles + ' ' + authors
    return df

def compute_tfidf_matrix(df):
    tfidf = TfidfVectorizer()
    tfidf_matrix = tfidf.fit_transform(df['combined'])
    return tfidf_matrix

def compute_cosine_similarity(tfidf_matrix):
    return cosine_similarity(tfidf_matrix)

def find_book_index(df, titles, search_title):
    clean_title = lambda x: re.sub(r'\(.*?\)', '', str(x)).lower().strip()
    search_title = clean_title(search_title)
    filtered_df = df[titles == search_title]
    if not filtered_df.empty:
        return filtered_df.index[0]
    else:
        print(f"'{search_title}' not found in the dataframe.")
        return None

def get_similar_books(df, cos_sim_matrix, idx, top_n=5, threshold=0.1):
    similarity_scores = list(enumerate(cos_sim_matrix[idx]))
    sorted_similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    book_indices = [i[0] for i in sorted_similarity_scores[1:] if i[1] > threshold][:top_n]
    return df.iloc[book_indices]['book_id'].values

#### Using the model

In [5]:
title = 'The Great Gatsby'
n = 10
books_df_titles = books_df['title']
books_df_titles_1 = remove_stopwords(books_df_titles)
books_df_authors = books_df['authors']
content_books_df = combine_columns(books_df, books_df_titles_1, books_df_authors)

tfidf_matrix = compute_tfidf_matrix(content_books_df)
cos_sim_matrix = compute_cosine_similarity(tfidf_matrix)
        
book_idx = find_book_index(content_books_df , books_df_titles, title)
            
if book_idx is not None:
    similar_books_ids = get_similar_books(content_books_df, cos_sim_matrix, book_idx, n)
    
    # Map the book IDs to titles in books_df
    book_titles = books_df[books_df['book_id'].isin(similar_books_ids)]['title'].values
    
    # Print each title on a new line
    for title in book_titles:
        print(title)     

all creatures great and small (all creatures great and small, #1-2)
the great divorce
tender is the night
this side of paradise
z: a novel of zelda fitzgerald
the beautiful and damned
the curious case of benjamin button
perfect you
the short stories
the great brain (great brain #1)


##### Analysis

These are list of books most similar in titles and authors to the title 'The Great Gatsby'.

##### Content-based model with user profile

Next, I will use a content-based approach with user profile to generate recommendations. In this approach, recommendations are generated by matching user preferences with book features. It builds a user profile based on the user's interactions (e.g., ratings, likes) and identifies key features the user prefers. The model then recommends books that share similar features with those preferred by the user, ensuring personalized suggestions tailored to the user’s tastes.

In [6]:
def get_highly_rated_titles(books_df, ratings_df, user_id, rating_threshold):
    user_ratings = ratings_df[ratings_df['user_id'] == user_id]
    high_rated_books = user_ratings[user_ratings['rating'] >= rating_threshold]
    high_rated_book_ids = high_rated_books['book_id'].tolist()

    high_rated_titles = books_df[books_df['book_id'].isin(high_rated_book_ids)]['title'].tolist()
    return high_rated_titles

def build_user_profile(tfidf_matrix, books_df, high_rated_titles):
    # Get the indices of high rated titles
    relevant_indices = books_df[books_df['title'].isin(high_rated_titles)].index
    # Select only the rows corresponding to these indices in the tfidf_matrix
    user_profile = tfidf_matrix[relevant_indices].mean(axis=0)
    user_profile = np.asarray(user_profile).reshape(1, -1)  # Convert to numpy array and reshape
    return user_profile

def get_similar_books_user_profile(tfidf_matrix, books_df, ratings_df, high_rated_titles, title, top_n):
    user_profile = build_user_profile(tfidf_matrix, books_df, high_rated_titles)
    similarity_scores = cosine_similarity(tfidf_matrix, user_profile).flatten()
    similar_indices = similarity_scores.argsort()[::-1]
   
    similar_titles = books_df.iloc[similar_indices]['title'].tolist()
   
    similar_titles = [t for t in similar_titles if t != title][:top_n]
   
    return similar_titles


#### Using the model

In [7]:
title = 'The Great Gatsby'
n = 10
user_id = 1
rating_threshold = 0.1
books_df_titles = books_df['title']
books_df_titles_1 = remove_stopwords(books_df_titles)
books_df_authors = books_df['authors']
content_books_df = combine_columns(books_df, books_df_titles_1, books_df_authors)

tfidf_matrix = compute_tfidf_matrix(content_books_df)

high_rated_titles = get_highly_rated_titles(books_df, ratings_df, user_id, rating_threshold)
              
similar_books = get_similar_books_user_profile(tfidf_matrix, content_books_df, ratings_df, high_rated_titles, title, n)
    
# Print each title on a new line
for title in similar_books:
    print(title)     

the story of a new name (the neapolitan novels #2)
my brilliant friend (the neapolitan novels #1)
those who leave and those who stay (the neapolitan novels #3)
the story of the lost child (the neapolitan novels, #4)
the girl who played with fire (millennium, #2)
the girl with the dragon tattoo (millennium, #1)
the girl who kicked the hornet's nest (millennium, #3)
people of the book
the pearl
year of wonders


##### Analysis

These are list of books most similar in titles and authors to the title 'The Great Gatsby' inclusive of the user profile of the user with id 1.

What will be the result for user with id 2?

In [8]:
title = 'The Great Gatsby'
n = 10
user_id = 2
rating_threshold = 0.1
books_df_titles = books_df['title']
books_df_titles_1 = remove_stopwords(books_df_titles)
books_df_authors = books_df['authors']
content_books_df = combine_columns(books_df, books_df_titles_1, books_df_authors)

tfidf_matrix = compute_tfidf_matrix(content_books_df)

high_rated_titles = get_highly_rated_titles(books_df, ratings_df, user_id, rating_threshold)
              
if book_idx is not None:
    similar_books = get_similar_books_user_profile(tfidf_matrix, content_books_df, ratings_df, high_rated_titles, title, n)
    
    # Print each title on a new line
    for title in similar_books:
        print(title) 

the harry potter collection 1-4 (harry potter, #1-4)
harry potter and the chamber of secrets (harry potter, #2)
harry potter and the half-blood prince (harry potter, #6)
harry potter and the sorcerer's stone (harry potter, #1)
harry potter and the goblet of fire (harry potter, #4)
harry potter and the order of the phoenix (harry potter, #5)
harry potter and the deathly hallows (harry potter, #7)
harry potter collection (harry potter, #1-6)
harry potter boxed set, books 1-5 (harry potter, #1-5)
harry potter boxset (harry potter, #1-7)


##### Analysis

These are list of books most similar in titles and authors to the title 'The Great Gatsby' inclusive of the user profile of the user with id 2.
These are dramatically different from the titles recommended to user with id 1.

#### Collaborative filtering model
Next, I will use a collaborative filtering approach to generate recommendations.
A collaborative filtering model generates recommendations by leveraging the preferences of similar users. It identifies users with similar tastes based on their interaction history (e.g., ratings) and recommends books that those similar users have liked but the target user hasn't yet experienced. This approach captures the collective wisdom of the community to provide personalized suggestions.

In [9]:
def create_surprise_data(ratings_df):
    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(ratings_df[["user_id", "book_id", "rating"]], reader)
    return data

def build_and_train_model(trainset, k):
    sim_options = {
        "name": "cosine",
        "user_based": False,  # Compute similarities between items
    }
    algo = KNNWithMeans(k=k, min_k=1, sim_options=sim_options)
    algo.fit(trainset)
    return algo

def get_collaborative_recommendations(title, n, ratings_df, books_df, k):
    # Create Surprise data
    data = create_surprise_data(ratings_df)

    # Build and train the model
    trainset = data.build_full_trainset()
    algo = build_and_train_model(trainset, k)

    # Find the book_id for the given title
    book_id = books_df[books_df['title'].str.lower() == title.lower()]['book_id'].values[0]

    # Get the inner id corresponding to the book_id
    inner_id = algo.trainset.to_inner_iid(book_id)

    # Get the neighbors of the book (n similar books)
    neighbors = algo.get_neighbors(inner_id, k=n)

    # Convert inner ids of neighbors to book_ids
    neighbor_book_ids = [algo.trainset.to_raw_iid(inner_id) for inner_id in neighbors]

    # Map book_ids back to titles
    recommendations = books_df[books_df['book_id'].isin(neighbor_book_ids)]['title'].values

    return recommendations

#### Using the model

In [10]:
title = 'The Great Gatsby'
n = 10
k= 30

recommendations = get_collaborative_recommendations(title, n, ratings_df, books_df, k)
for title in recommendations:
    print(title)


Computing the cosine similarity matrix...
Done computing similarity matrix.
في ديسمبر تنتهي كل الأحلام
الجزار
2 ضباط
the magic (the secret, #3)
diary ng panget
fifty shades duo: fifty shades darker / fifty shades freed (fifty shades, #2-3)
until nico (until, #4)
صانع الظلام
طه الغريب
a thousand boy kisses


##### Analysis

These are list of books most collaboratively similar to the title 'The Great Gatsby'.

#### What is k and min_k?
In the context of the `KNNWithMeans` algorithm from the Surprise library:

### `k`:
- **Description**: `k` is the number of nearest neighbors considered when predicting a user's rating for a book. It determines how many similar books (or users, if user-based) the algorithm will look at to compute the prediction.
- **Usage**: A higher `k` means the algorithm will consider more neighbors when making predictions, potentially leading to more accurate but less personalized recommendations. A lower `k` might result in more personalized but less stable predictions.

### `min_k`:
- **Description**: `min_k` is the minimum number of neighbors required to make a prediction. If fewer than `min_k` neighbors are found, the algorithm may fall back to a baseline estimate (such as the mean rating) rather than using a potentially unreliable small number of neighbors.
- **Usage**: `min_k` is useful for ensuring that predictions are not made based on too few neighbors, which could lead to less reliable recommendations. The default value is often `1`, meaning that at least one neighbor is needed to make a prediction.

In the model, setting `k=30` and `min_k=1` means the algorithm will consider up to 30 nearest neighbors when predicting a rating, and it requires at least one neighbor to make a prediction.
But these values were not selected arbitarily.
They are instead results of hyper parameter tuning.

#### Hyper parameter tuning

In [13]:
from surprise import KNNWithMeans
from surprise.model_selection import GridSearchCV

# Create the Surprise data
data = create_surprise_data(ratings_df)

# Define the parameter grid
param_grid = {
    'k': [10, 20, 30, 40, 50],  # Number of neighbors
    'min_k': [1, 2, 3, 4, 5],   # Minimum number of neighbors
    'sim_options': {
        'name': ['cosine', 'pearson', 'pearson_baseline'],
        'user_based': [False]  # Compute similarities between items
    }
}

# Perform grid search with cross-validation
gs = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=-1)
exit()
gs.fit(data)

# Print the best RMSE score
print(f"Best RMSE score: {gs.best_score['rmse']}")

# Print the best parameters
print(f"Best parameters: {gs.best_params['rmse']}")



Best RMSE score: 0.7982112629347284
Best parameters: {'k': 20, 'min_k': 3, 'sim_options': {'name': 'pearson_baseline', 'user_based': False}}


#### Hybrid model
I will be using a combination of content-based and collaborative model to generate recommendations.

In [11]:
title = 'The Great Gatsby'
n = 10
user_id = 2
rating_threshold = 0.1
books_df_titles = books_df['title']
books_df_titles_1 = remove_stopwords(books_df_titles)
books_df_authors = books_df['authors']
content_books_df = combine_columns(books_df, books_df_titles_1, books_df_authors)

tfidf_matrix = compute_tfidf_matrix(content_books_df)

high_rated_titles = get_highly_rated_titles(books_df, ratings_df, user_id, rating_threshold)
              
content_recommendations = get_similar_books_user_profile(tfidf_matrix, content_books_df, ratings_df, high_rated_titles, title, n)
    
# Collaborative Filtering
k= 30
collaborative_recommendations = get_collaborative_recommendations(title, n, ratings_df, books_df, k)
    
# Hybrid approach: Combine the scores from both collaborative and content-based recommendations
hybrid_recommendations = list(set(content_recommendations) | set(collaborative_recommendations))
       
# Limit to top-N hybrid recommendations
hybrid_recommendations = hybrid_recommendations[:n]
       
# Print each title on a new line
for title in hybrid_recommendations:
    print(title) 

Computing the cosine similarity matrix...
Done computing similarity matrix.
harry potter and the deathly hallows (harry potter, #7)
harry potter boxed set, books 1-5 (harry potter, #1-5)
harry potter and the half-blood prince (harry potter, #6)
the magic (the secret, #3)
diary ng panget
the harry potter collection 1-4 (harry potter, #1-4)
الجزار
a thousand boy kisses
harry potter boxset (harry potter, #1-7)
fifty shades duo: fifty shades darker / fifty shades freed (fifty shades, #2-3)
