## Setup

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Data preview

In [2]:
book_df = pd.read_csv("../datasets/clean/filtered_datasets/Final/final_books.csv")
book_df.shape

(2332, 11)

---
## Data preprocessing

We have to preprocess the data for the word tokening approach.

We will do the following techniques:
1. Edit empty descriptions, remove special characters and remove stop words
2. Tokenize the sentences
3. Update the table

In [3]:
# Utilizing word embeddings to get better results
import re
# use nltk for the utilities in preprocessing
from nltk.corpus import stopwords

#use gensim for the w2v and d2v
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec

stop_words = set(stopwords.words('english'))

#Make empty cells into empty strings (should not be a lot of them)
# book_df['description'] = book_df['description'].fillna(' ') -> NO MORE CUZ WE REMOVED THEM

# Convert 'description' column to string type
book_df['description'] = book_df['description'].astype(str)

for index, sentence in enumerate(book_df["description"]):
    # 1. Remove all special characters
    preprocessed_sentence = re.sub("[^a-zA-Z]", " ", sentence)
    # 2. Tokenize the sentence
    tokens = simple_preprocess(preprocessed_sentence)
    # 3. Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # 4. Join tokens back into a sentence
    processed_sentence = ' '.join(filtered_tokens)
    # 5. Update the 'description' column with the processed sentence
    book_df.at[index, 'description'] = processed_sentence


# Tokenize the 'description' column
book_df['tokenized_description'] = book_df['description'].apply(lambda x: x.split())


In [5]:
book_df.to_csv("./tokenized_book_df.csv")

### The tokenized data

In [4]:
pd.set_option('display.max_colwidth', 200) # -> to see more from the description
book_df[['Book-Title','categories','description']].head(2)

Unnamed: 0,Book-Title,categories,description
0,The Testament,Fiction,suicidal billionaire burnt washington litigator woman forsaken technology work wilds brazil brought together astounding mystery testament
1,Icebound,Fiction,secret arctic experiment turns frozen nightmare team scientists stranded drifting iceberg massive explosive charge battles elements survival discover one murderer reissue


---
## Word2Vector setup

We will configure and train the word2vector algorithm on the tokenized descriptions

In [5]:
w2v = Word2Vec(sentences=book_df['tokenized_description'], vector_size=100, window=5, min_count=1, workers=4)
w2v.train(book_df['tokenized_description'], total_examples=len(book_df['tokenized_description']), epochs=10)

(770738, 782650)

These are the top 3 common words:

In [6]:
print(w2v.wv.index_to_key[0], w2v.wv.index_to_key[1], w2v.wv.index_to_key [2])

life one new


---
## Get book recommendataions

We create a funtion that gets the title of the book in question and:
- gets it's description
- uses our word2vector model and the cosine similarity function to find similar descriptions
- returns the most similar books by desription

In [7]:
import warnings

# Suppress runtime warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

def get_similar_books_word2vec(title, amount=10, model=w2v, book_df=book_df):
    # Get the description tokens of the target book
    description_tokens = book_df.loc[book_df['Book-Title'] == title, 'tokenized_description'].iloc[0]
    # comppute the avg vector of the TARGET book's description tokens
    target_vector = np.mean([model.wv[token] for token in description_tokens if token in model.wv], axis=0).reshape(1, -1)

    similarity_scores = {}
    for index, row in book_df.iterrows():
        if row['Book-Title'] != title:
            # compute the avg vector of the CURRENT book's description tokens
            book_vector = np.mean([model.wv[token] for token in row['tokenized_description'] if token in model.wv], axis=0).reshape(1, -1)

            # we want to skip nan values
            if np.isnan(target_vector).any() or np.isnan(book_vector).any():
                continue
            # get the cosine similarity
            similarity_scores[index] = cosine_similarity(target_vector, book_vector)[0, 0]
    
    # sort the books by similarity score in descending order
    sorted_indices = sorted(similarity_scores, key=similarity_scores.get, reverse=True)
    
    # Return the most similar books
    return book_df.iloc[sorted_indices[:amount]]

#### Function to get formatted recommendations

In [8]:
def get_book_recom(title):
    book = book_df.loc[book_df['Book-Title'] == title].iloc[0]
    print(book["Book-Title"]  + " - "+ book["categories"] +  " - " + book["description"])

    recommended_books = get_similar_books_word2vec(title,amount=5)
    print()
    return(recommended_books)

---
## Testing the recommendation model

__Let's test the recommendation model.__

We made it so you can see both the original books and the recommended books' title, genre and description.\
This way we can look and compare them

In [9]:
# pd.set_option('display.max_colwidth', 500) # -> to see more from the description

In [10]:
get_book_recom("Icebound")[["Book-Title","categories","description"]]

Icebound - Fiction - secret arctic experiment turns frozen nightmare team scientists stranded drifting iceberg massive explosive charge battles elements survival discover one murderer reissue



Unnamed: 0,Book-Title,categories,description
20,Strangers,Fiction,group seemingly unrelated people experiences sensations numbing terror fear groping way toward one another discover sinister shared secrets chilling climax changes lives forever reissue
320,"Fluke : Or, I Know Why the Winged Whale Sings",Fiction,humpback whales sing question marine behavioral biologist nate quinn crew poking charting recording photographing big wet gray marine mammals extraordinary day whale lifts tail air display cryptic...
165,One Thousand White Women : The Journals of May Dodd: A Novel,Fiction,one thousand white women story may dodd colorful assembly pioneer women auspices government travel western prairies intermarry among cheyenne indians covert controversial brides indians program la...
595,"Shards of a Broken Crown (Serpentwar Saga, Book 4)",Fiction,demon enemy routed well winter icy grasp loosening world emerald queen vanquished army broken back bitter sea treachery recourse lackey declared lord defeated amassing still fearsome remnants ruth...
2302,Boy Who Turned into a TV Set,Juvenile Fiction,although mother warns continues watch television much turn one ogden pettibone believe discovers clear color picture glowing stomach


In [11]:
get_book_recom("Midnight Voices")[["Book-Title","categories","description"]]

Midnight Voices - Fiction - caroline two children move new spouse apartment central park west son instinctive misgivings become horrifying reality young girl vanishes caroline daughter begins waste away



Unnamed: 0,Book-Title,categories,description
1648,Devil May Care,Fiction,ellie young rich engaged love carefree days marriage new responsibility anything goes including house sitting eccentric aunt kate palatial estate burton virginia ellie feels right home nearly invi...
2252,Pygmalion: A Romance in Five Acts (Penguin Classics),Literary Criticism,professor higgins succeeds transforming unkempt london flower girl society belle
902,Night Train to Memphis,Fiction,assistant curator munich national museum vicky bliss expert egypt ph solving crimes intelligence agency offers luxury nile cruise help solve murder stop heist egyptian antiquities takes plunge vic...
118,Four To Score (A Stephanie Plum Novel),Fiction,stephanie plum trenton new jersey favorite pistol packing condom carrying bounty hunter back trail revenge seeking waitress skipped bail help year old grandma mazur ex hooker lula transvestite mus...
318,Lucy Sullivan Is Getting Married,Fiction,happens psychic tells lucy getting married within year roommates panic going happen blissful existence eating take drinking much wine bringing men home never vacuuming lucy reassures friends far b...


In [12]:
get_book_recom("The Lord of the Rings")[["Book-Title","categories","description"]]

The Lord of the Rings - Fiction - epic detailing great war ring struggle good evil middle earth tiny hobbits play key role



Unnamed: 0,Book-Title,categories,description
84,Cryptonomicon,Fiction,extraordinary first volume promises epoch making masterpiece neal stephenson hacks secret histories nations private obsessions men decrypting dazzling virtuosity forces shaped century lawrence pri...
1278,Mitla Pass,Fiction,writer gideon zadok leaves glitter hollywood newly created state israel learns much love dangerous military operation covers war correspondent
1016,Demonic Males : Apes and the Origins of Human Violence,Nature,draws recent discoveries human evolution examine whether violence among men product primitive heritage searches solutions problems war rape murder
1454,Xenocide : Volume Three of the Ender Quartet (Ender),Fiction,war survival planet lusitania fought heart child named gloriously bright lusitania ender found world humans pequininos hive queen could live together three different intelligent species could find...
570,Jukebox Queen Of Malta: A Novel,Fiction,jukebox queen malta exquisite enchanting novel love war set island perilously balanced real rocco raven intrepid auto mechanic turned corporal brooklyn arrived malta mediterranean island neolithic...


In [13]:
w2v.save("Models/word2vec.model")