# Introduction

**Build a prototype system that assigns keywords to our books:**

Using the list of search queries and the book metadata, build a model that picks the most
relevant search queries for each book, to be used as keywords. The search queries
assigned to each book should be relevant to the themes and genres of the book as
described in the metadata.

*Breaking this down: There is no target variable (ground truth), hence this is going to be unsupervised learning (on the lines of a recommendation engine). The prototype should use genre*.

In [175]:
pip install pandas

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


In [176]:
pip install nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


In [177]:
pip install scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


In [178]:
pip install numpy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


# Imports

In [179]:
import pandas as pd, numpy as np
# import seaborn as sns, matplotlib.pyplot as plt
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import re, string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
nltk.download('wordnet')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pramilabalan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pramilabalan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pramilabalan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/pramilabalan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Reading Files

In [180]:
metadata_df =  pd.read_csv('title_metadata.csv')

In [181]:
keywords_df = pd.read_csv('search_terms.csv')

In [182]:
metadata_df.shape

(500, 6)

In [183]:
metadata_df.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w..."
1,1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,fiction,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic..."
2,2,Lives of Girls and Women,<b>The only novel from bestselling author Alic...,fiction,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...
3,3,The Precipice,"In this powerful collection of interviews, Noa...",non-fiction,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political..."
4,4,Little Wins,There are some 400 million people worldwide wh...,non-fiction,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ..."


In [184]:
metadata_df.tail()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions
495,495,The Frank Family That Survived,'A major contribution to our understanding of ...,non-fiction,"['NHTZ1', 'DNBH', 'NHWR7', 'JBFG', '3MPBLB']","['The Holocaust', 'Biography: historical, poli..."
496,496,V.,"<b>The first novel from the great, incomparabl...",fiction,"['FBA', '1KBB-US-NAKC', '3MPQM']",['Modern and contemporary fiction: general and...
497,497,Business or Pleasure,<b>When Chandler Cohen accepts her next ghostw...,fiction,"['FRD', 'FU', 'FQ', 'FXD']","['Modern and Contemporary romance', 'Humorous ..."
498,498,The Dinner Guest,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,fiction,"['FBA', 'FYT', 'FXL', 'FS', '1DSE']",['Modern and contemporary fiction: general and...
499,499,The Sanatorium,<b>'<i>The Sanatorium </i>will keep you checki...,fiction,"['FF', 'FHX', 'FFP', 'FFS', 'FXR', 'FXL', '1DF...","['Crime and mystery fiction', 'Psychological t..."


In [185]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  500 non-null    int64 
 1   title               500 non-null    object
 2   description         497 non-null    object
 3   fiction_flag        500 non-null    object
 4   thema_codes         500 non-null    object
 5   thema_descriptions  500 non-null    object
dtypes: int64(1), object(5)
memory usage: 23.6+ KB


In [186]:
metadata_df.describe()

Unnamed: 0,id
count,500.0
mean,249.5
std,144.481833
min,0.0
25%,124.75
50%,249.5
75%,374.25
max,499.0


In [187]:
keywords_df.shape

(76707, 1)

In [188]:
keywords_df.head()

Unnamed: 0,search_term
0,winter killer
1,heat magazine uk
2,josie russell
3,korean
4,iron fey


In [189]:
keywords_df.tail()

Unnamed: 0,search_term
76702,iodine
76703,b088qr6qhq
76704,pokemon toys
76705,dambusters books
76706,jonathan stroud books


## Looking into metadata

In [190]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  500 non-null    int64 
 1   title               500 non-null    object
 2   description         497 non-null    object
 3   fiction_flag        500 non-null    object
 4   thema_codes         500 non-null    object
 5   thema_descriptions  500 non-null    object
dtypes: int64(1), object(5)
memory usage: 23.6+ KB


In [191]:
metadata_df['title'].unique().size  # Number of unique titles

500

In [192]:
metadata_df['description'].unique().size # Number of unique descriptions

498

In [193]:
# Looking into duplicated description values
metadata_df[metadata_df['description'].duplicated(keep=False)]

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions
272,272,Wine Folly: Magnum Edition,,non-fiction,"['WBXD1', 'WBXD', 'WTHD', 'GB', 'WB', 'WJX']","['Wines', 'Food and drink: alcoholic beverages..."
325,325,Nobody Leaves,,non-fiction,"['DNP', 'WTLC', 'DNL', '1DTP', '3MPQS']","['Reportage, journalism or collected columns',..."
408,408,Mind Games,,non-fiction,"['WDK', 'WDKX', 'VFD']","['Puzzles and quizzes', 'Trivia and quiz quest..."


*We have three null descriptions out of 500 records, we can drop these, considering we do not have high percentage of nulls, but we will simply replace these empty cells with "null" as description content for easing the later steps.*

In [194]:
metadata_df.shape

(500, 6)

In [195]:
metadata_df.fillna(value = "null", inplace=True)

In [196]:
# Again, looking into the "null" description content
metadata_df[metadata_df['description'].duplicated(keep=False)]

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions
272,272,Wine Folly: Magnum Edition,,non-fiction,"['WBXD1', 'WBXD', 'WTHD', 'GB', 'WB', 'WJX']","['Wines', 'Food and drink: alcoholic beverages..."
325,325,Nobody Leaves,,non-fiction,"['DNP', 'WTLC', 'DNL', '1DTP', '3MPQS']","['Reportage, journalism or collected columns',..."
408,408,Mind Games,,non-fiction,"['WDK', 'WDKX', 'VFD']","['Puzzles and quizzes', 'Trivia and quiz quest..."


In [197]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  500 non-null    int64 
 1   title               500 non-null    object
 2   description         500 non-null    object
 3   fiction_flag        500 non-null    object
 4   thema_codes         500 non-null    object
 5   thema_descriptions  500 non-null    object
dtypes: int64(1), object(5)
memory usage: 23.6+ KB


In [198]:
metadata_df[['thema_codes', 'thema_descriptions']].head()

Unnamed: 0,thema_codes,thema_descriptions
0,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w..."
1,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic..."
2,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...
3,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political..."
4,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ..."


## Looking into keywords

In [199]:
keywords_df.head(20)

Unnamed: 0,search_term
0,winter killer
1,heat magazine uk
2,josie russell
3,korean
4,iron fey
5,ben
6,alan garner kindle
7,she hulk
8,san francisco longing
9,odin alex mason


In [200]:
 keywords_df.shape

(76707, 1)

In [201]:
keywords_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76707 entries, 0 to 76706
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   search_term  76706 non-null  object
dtypes: object(1)
memory usage: 599.4+ KB


# Text Pre-processing

In [202]:
# Lemmatizing - bringing words to their root form - young, younger, youngest, all should come under one root form, that is, young.

def get_lemma(text):
  text = str(text)  
  lemmatizer = WordNetLemmatizer()

  def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'): #adjective
        return wordnet.ADJ
    elif nltk_tag.startswith('V'): #verb
        return wordnet.VERB
    elif nltk_tag.startswith('N'): #noun
        return wordnet.NOUN
    elif nltk_tag.startswith('R'): #adverb
        return wordnet.ADV
    else:
        return None

  pos_tagged = nltk.pos_tag(nltk.word_tokenize(text)) # get part-of-speech tags
  #print( pos_tagged)
  wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
  #print("***")
  #print(wordnet_tagged)
  lemmatized_sentence = []
  for word, tag in wordnet_tagged:
    if tag is None:
        # if there is no available tag, append the token as is
        lemmatized_sentence.append(word)
    else:
        # else use the tag to lemmatize the token
        lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
        # print(lemmatized_sentence)
  lemmatized_sentence = " ".join(lemmatized_sentence)

  return lemmatized_sentence

In [203]:
get_lemma("This is the newest version of the sampling database 10")

'This be the new version of the sample database 10'

In [204]:
# Removing unnecessary words and punctuations

def get_clean_text(text):

  # lower case
  text = text.lower()

  # removing HTML tags

  text = re.sub(r'<.*?>', "", text)


  # removing punctuation

  text = text.translate(str.maketrans("", "", string.punctuation))


  # removing stopwords

  stop_words = stopwords.words('english')
  text = " ".join(word for word in word_tokenize(text) if word not in stop_words)


  # removing extra whitespaces

  text = text.strip()


  return text

*We will first apply lemma and then clean the text. Before that, we will combine columns description and thema_descriptions, so that all the text processing can take place together.*

In [205]:
metadata_df['complete_content']  = metadata_df['description'] + " " + metadata_df['thema_descriptions']

In [206]:
metadata_df.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...
1,1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,fiction,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic...",A sunken U-Boat has lain undisturbed on the At...
2,2,Lives of Girls and Women,<b>The only novel from bestselling author Alic...,fiction,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...,<b>The only novel from bestselling author Alic...
3,3,The Precipice,"In this powerful collection of interviews, Noa...",non-fiction,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political...","In this powerful collection of interviews, Noa..."
4,4,Little Wins,There are some 400 million people worldwide wh...,non-fiction,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ...",There are some 400 million people worldwide wh...


In [207]:
# Lemmatizing
metadata_df['clean_content'] = metadata_df['complete_content'].apply(lambda x: get_lemma(x))

In [208]:
metadata_df.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,< b > THE < i > SUNDAY TIMES < /i > BESTSELLER...
1,1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,fiction,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic...",A sunken U-Boat has lain undisturbed on the At...,A sunken U-Boat have lie undisturbed on the At...
2,2,Lives of Girls and Women,<b>The only novel from bestselling author Alic...,fiction,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...,<b>The only novel from bestselling author Alic...,< b > The only novel from bestselling author A...
3,3,The Precipice,"In this powerful collection of interviews, Noa...",non-fiction,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political...","In this powerful collection of interviews, Noa...","In this powerful collection of interview , Noa..."
4,4,Little Wins,There are some 400 million people worldwide wh...,non-fiction,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ...",There are some 400 million people worldwide wh...,There be some 400 million people worldwide who...


In [209]:
# Applying text pre-processing function for removing punctuations, etc.

metadata_df['clean_content'] = metadata_df['clean_content'].apply(lambda x: get_clean_text(x))

In [210]:
metadata_df.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...
1,1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,fiction,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic...",A sunken U-Boat has lain undisturbed on the At...,sunken uboat lie undisturbed atlantic ocean fl...
2,2,Lives of Girls and Women,<b>The only novel from bestselling author Alic...,fiction,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...,<b>The only novel from bestselling author Alic...,novel bestselling author alice munro winner no...
3,3,The Precipice,"In this powerful collection of interviews, Noa...",non-fiction,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political...","In this powerful collection of interviews, Noa...",powerful collection interview noam chomsky exp...
4,4,Little Wins,There are some 400 million people worldwide wh...,non-fiction,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ...",There are some 400 million people worldwide wh...,400 million people worldwide whose creativity ...


In [211]:
# Applying the same steps on keywords dataframe

In [212]:
keywords_df.head()

Unnamed: 0,search_term
0,winter killer
1,heat magazine uk
2,josie russell
3,korean
4,iron fey


In [213]:
keywords_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76707 entries, 0 to 76706
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   search_term  76706 non-null  object
dtypes: object(1)
memory usage: 599.4+ KB


In [214]:
keywords_df['clean_keywords'] = keywords_df['search_term'].apply(lambda x: get_lemma(x))

In [215]:
keywords_df.head(10)

Unnamed: 0,search_term,clean_keywords
0,winter killer,winter killer
1,heat magazine uk,heat magazine uk
2,josie russell,josie russell
3,korean,korean
4,iron fey,iron fey
5,ben,ben
6,alan garner kindle,alan garner kindle
7,she hulk,she hulk
8,san francisco longing,san francisco longing
9,odin alex mason,odin alex mason


In [216]:
keywords_df['clean_keywords'] = keywords_df['clean_keywords'].apply(lambda x: get_clean_text(x))

In [217]:
keywords_df.head(10)

Unnamed: 0,search_term,clean_keywords
0,winter killer,winter killer
1,heat magazine uk,heat magazine uk
2,josie russell,josie russell
3,korean,korean
4,iron fey,iron fey
5,ben,ben
6,alan garner kindle,alan garner kindle
7,she hulk,hulk
8,san francisco longing,san francisco longing
9,odin alex mason,odin alex mason


# Vectorizing Methods

There are a number of methods available to convert text data into its numerical form, we considered the following methods:

* **Bag-of-words (BoW)**: This simply counts the occurrences of words in a document

* **TF-IDF**: This assigns a weight to each word based on its importance in the document and the entire corpus. TF-IDF overcomes the limitation of BoW, which is assigning equal importance to all words. By giving higher weight to rare words, TF-IDF can better capture the meaning of a document. However, both methods ignore word order and context, resulting in a loss of semantic information.

* **GloVe**: Represents words as dense vectors in a continuous space. These embeddings capture semantic relationships between words

* **BERT**: A transformer-based model that creates context-aware word embeddings. It captures bidirectional context by pre-training on a large corpus.





In this document, we will look at TF-IDF, GloVe, and BERT

## TF-IDF

In [218]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

In [219]:
X_metadata = tfidf_vectorizer.fit_transform(metadata_df['clean_content'])
X_keywords = tfidf_vectorizer.transform(keywords_df['clean_keywords'])

In [220]:
X_metadata.shape

(500, 5000)

## Cosine Similarity with TF-IDF Vector

In [221]:
similarities = cosine_similarity(X_metadata, X_keywords)

In [222]:
len(similarities[0])

76707

In [223]:
similarities.shape

(500, 76707)

In [224]:
similarities.max()

0.8987579607093636

In [225]:
similarities.mean()

0.003018143822553078

In [226]:
similarities.min()

0.0

In [227]:
len(similarities[0])

76707

In [228]:
similarities[0].argsort()

array([    0, 49877, 49878, ...,  4310, 41347, 73322])

In [229]:
similarities[0].argsort()[-10:]

array([12937, 51487, 55733, 17338, 36138, 45068, 67217,  4310, 41347,
       73322])

In [230]:
similarities[0].argsort()[-10:][::-1]

array([73322, 41347,  4310, 67217, 45068, 36138, 17338, 55733, 51487,
       12937])

In [231]:
similarities[0][73322]

0.5285209771131831

In [232]:
similarities[0]

array([0.        , 0.        , 0.        , ..., 0.        , 0.02893859,
       0.02893859])

In [233]:
num_top_keywords = 5  # This is the number of search terms with each book

In [234]:
results = []

In [235]:
book_metadata = metadata_df['clean_content'].copy()
keywords = keywords_df['clean_keywords'].copy()

In [236]:
book_keywords = {} # Creating a dictionary, which will contain the keywords with highest cosine similarity for each book indices. These indices will act as the dictionary key

In [237]:
for i, book in enumerate(book_metadata):
    top_keywords_indices = similarities[i].argsort()[-num_top_keywords:][::-1]
    top_keywords = [keywords[idx] for idx in top_keywords_indices]
    book_keywords[i] = top_keywords


In [238]:
# book_keywords

In [239]:
keywords_result = pd.Series(book_keywords)

In [240]:
keywords_result.head()

0    [baby wean book, wean, annabel karmel wean, we...
1    [betrayal lie, secret lie, wilmington lie, lie...
2    [munro, orr munro, hp munro, kresley cole munr...
3    [expose, expose jaxson, expose edinburgh, expo...
4    [montessori toddler, toddler, book toddler, to...
dtype: object

In [241]:
keywords_result.tail()

495    [frank mansell, frank gardiner, frank worthing...
496    [pynchon, rankin ian, ian rankin rebus, rebus ...
497    [avery chandler, pulse jenny chandler, rise ch...
498    [dinner guest, dinner guest, grandfather anony...
499    [sanatorium book, sanatorium, crime thriller b...
dtype: object

In [242]:
metadata_df.tail()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content
495,495,The Frank Family That Survived,'A major contribution to our understanding of ...,non-fiction,"['NHTZ1', 'DNBH', 'NHWR7', 'JBFG', '3MPBLB']","['The Holocaust', 'Biography: historical, poli...",'A major contribution to our understanding of ...,major contribution understanding second world ...
496,496,V.,"<b>The first novel from the great, incomparabl...",fiction,"['FBA', '1KBB-US-NAKC', '3MPQM']",['Modern and contemporary fiction: general and...,"<b>The first novel from the great, incomparabl...",first novel great incomparable thomas pynchon ...
497,497,Business or Pleasure,<b>When Chandler Cohen accepts her next ghostw...,fiction,"['FRD', 'FU', 'FQ', 'FXD']","['Modern and Contemporary romance', 'Humorous ...",<b>When Chandler Cohen accepts her next ghostw...,chandler cohen accept next ghostwriting gig pe...
498,498,The Dinner Guest,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,fiction,"['FBA', 'FYT', 'FXL', 'FS', '1DSE']",['Modern and contemporary fiction: general and...,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,longlisted 2018 man booker international prize...
499,499,The Sanatorium,<b>'<i>The Sanatorium </i>will keep you checki...,fiction,"['FF', 'FHX', 'FFP', 'FFS', 'FXR', 'FXL', '1DF...","['Crime and mystery fiction', 'Psychological t...",<b>'<i>The Sanatorium </i>will keep you checki...,sanatorium keep check shoulder spinetingling a...


In [243]:
keywords_result.shape

(500,)

In [244]:
keywords_result[270:275] # These contain one of the indices where we replaced the NaNs with "null"

270    [candle rose, poisen rose, whiskey rose, steel...
271    [mhairi, mhairi oreilly, mhairi mcfarlane book...
272    [food medic, delia frugal food, food arthritis...
273    [convienance store woman, verge, toast, toast ...
274    [read upside, read, bonny read, read comprehen...
dtype: object

In [245]:
metadata_df.shape

(500, 8)

In [246]:
result_df = pd.concat([metadata_df, keywords_result], axis = 1)

In [247]:
result_df.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,0
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,"[baby wean book, wean, annabel karmel wean, we..."
1,1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,fiction,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic...",A sunken U-Boat has lain undisturbed on the At...,sunken uboat lie undisturbed atlantic ocean fl...,"[betrayal lie, secret lie, wilmington lie, lie..."
2,2,Lives of Girls and Women,<b>The only novel from bestselling author Alic...,fiction,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...,<b>The only novel from bestselling author Alic...,novel bestselling author alice munro winner no...,"[munro, orr munro, hp munro, kresley cole munr..."
3,3,The Precipice,"In this powerful collection of interviews, Noa...",non-fiction,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political...","In this powerful collection of interviews, Noa...",powerful collection interview noam chomsky exp...,"[expose, expose jaxson, expose edinburgh, expo..."
4,4,Little Wins,There are some 400 million people worldwide wh...,non-fiction,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ...",There are some 400 million people worldwide wh...,400 million people worldwide whose creativity ...,"[montessori toddler, toddler, book toddler, to..."


In [248]:
result_df.shape

(500, 9)

In [249]:
result_df.tail()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,0
495,495,The Frank Family That Survived,'A major contribution to our understanding of ...,non-fiction,"['NHTZ1', 'DNBH', 'NHWR7', 'JBFG', '3MPBLB']","['The Holocaust', 'Biography: historical, poli...",'A major contribution to our understanding of ...,major contribution understanding second world ...,"[frank mansell, frank gardiner, frank worthing..."
496,496,V.,"<b>The first novel from the great, incomparabl...",fiction,"['FBA', '1KBB-US-NAKC', '3MPQM']",['Modern and contemporary fiction: general and...,"<b>The first novel from the great, incomparabl...",first novel great incomparable thomas pynchon ...,"[pynchon, rankin ian, ian rankin rebus, rebus ..."
497,497,Business or Pleasure,<b>When Chandler Cohen accepts her next ghostw...,fiction,"['FRD', 'FU', 'FQ', 'FXD']","['Modern and Contemporary romance', 'Humorous ...",<b>When Chandler Cohen accepts her next ghostw...,chandler cohen accept next ghostwriting gig pe...,"[avery chandler, pulse jenny chandler, rise ch..."
498,498,The Dinner Guest,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,fiction,"['FBA', 'FYT', 'FXL', 'FS', '1DSE']",['Modern and contemporary fiction: general and...,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,longlisted 2018 man booker international prize...,"[dinner guest, dinner guest, grandfather anony..."
499,499,The Sanatorium,<b>'<i>The Sanatorium </i>will keep you checki...,fiction,"['FF', 'FHX', 'FFP', 'FFS', 'FXR', 'FXL', '1DF...","['Crime and mystery fiction', 'Psychological t...",<b>'<i>The Sanatorium </i>will keep you checki...,sanatorium keep check shoulder spinetingling a...,"[sanatorium book, sanatorium, crime thriller b..."


In [250]:
result_df.rename(columns = {0: "search_term"}, inplace=True)

In [251]:
result_df.tail()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,search_term
495,495,The Frank Family That Survived,'A major contribution to our understanding of ...,non-fiction,"['NHTZ1', 'DNBH', 'NHWR7', 'JBFG', '3MPBLB']","['The Holocaust', 'Biography: historical, poli...",'A major contribution to our understanding of ...,major contribution understanding second world ...,"[frank mansell, frank gardiner, frank worthing..."
496,496,V.,"<b>The first novel from the great, incomparabl...",fiction,"['FBA', '1KBB-US-NAKC', '3MPQM']",['Modern and contemporary fiction: general and...,"<b>The first novel from the great, incomparabl...",first novel great incomparable thomas pynchon ...,"[pynchon, rankin ian, ian rankin rebus, rebus ..."
497,497,Business or Pleasure,<b>When Chandler Cohen accepts her next ghostw...,fiction,"['FRD', 'FU', 'FQ', 'FXD']","['Modern and Contemporary romance', 'Humorous ...",<b>When Chandler Cohen accepts her next ghostw...,chandler cohen accept next ghostwriting gig pe...,"[avery chandler, pulse jenny chandler, rise ch..."
498,498,The Dinner Guest,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,fiction,"['FBA', 'FYT', 'FXL', 'FS', '1DSE']",['Modern and contemporary fiction: general and...,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,longlisted 2018 man booker international prize...,"[dinner guest, dinner guest, grandfather anony..."
499,499,The Sanatorium,<b>'<i>The Sanatorium </i>will keep you checki...,fiction,"['FF', 'FHX', 'FFP', 'FFS', 'FXR', 'FXL', '1DF...","['Crime and mystery fiction', 'Psychological t...",<b>'<i>The Sanatorium </i>will keep you checki...,sanatorium keep check shoulder spinetingling a...,"[sanatorium book, sanatorium, crime thriller b..."


In [252]:
result_df.shape

(500, 9)

In [253]:
tfidf_output = result_df.explode('search_term')

In [254]:
tfidf_output.shape

(2500, 9)

In [255]:
tfidf_output.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,search_term
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,baby wean book
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,wean
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,annabel karmel wean
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,wean book
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,villette charlotte bronte


In [256]:
tfidf_output.to_csv("TfIDf_output.csv")

## BERT

In [257]:
pip install -U sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


In [258]:

from sentence_transformers import SentenceTransformer # This is a library for sentence and text embeddings.

** bert-base-nli-mean-tokens** pre-trained BERT-based model that has been fine-tuned for sentence embeddings.

In [259]:
sbert_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

In [260]:
metadata_df.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...
1,1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,fiction,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic...",A sunken U-Boat has lain undisturbed on the At...,sunken uboat lie undisturbed atlantic ocean fl...
2,2,Lives of Girls and Women,<b>The only novel from bestselling author Alic...,fiction,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...,<b>The only novel from bestselling author Alic...,novel bestselling author alice munro winner no...
3,3,The Precipice,"In this powerful collection of interviews, Noa...",non-fiction,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political...","In this powerful collection of interviews, Noa...",powerful collection interview noam chomsky exp...
4,4,Little Wins,There are some 400 million people worldwide wh...,non-fiction,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ...",There are some 400 million people worldwide wh...,400 million people worldwide whose creativity ...


In [261]:
metadata_df['bert_embeddings'] = metadata_df['clean_content'].apply(lambda x: sbert_model.encode(x))

In [262]:
metadata_df.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,bert_embeddings
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,"[-0.0536515, -0.05696627, -0.06004447, 0.03443..."
1,1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,fiction,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic...",A sunken U-Boat has lain undisturbed on the At...,sunken uboat lie undisturbed atlantic ocean fl...,"[-0.080311686, 0.0620853, 0.04595251, -0.01543..."
2,2,Lives of Girls and Women,<b>The only novel from bestselling author Alic...,fiction,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...,<b>The only novel from bestselling author Alic...,novel bestselling author alice munro winner no...,"[-0.06821202, -0.05175161, -0.0088345725, 0.06..."
3,3,The Precipice,"In this powerful collection of interviews, Noa...",non-fiction,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political...","In this powerful collection of interviews, Noa...",powerful collection interview noam chomsky exp...,"[0.01108276, 0.05567558, -0.030121665, -0.0085..."
4,4,Little Wins,There are some 400 million people worldwide wh...,non-fiction,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ...",There are some 400 million people worldwide wh...,400 million people worldwide whose creativity ...,"[0.019774005, -0.01750415, -0.005929993, -0.04..."


In [263]:
keywords_df.head()

Unnamed: 0,search_term,clean_keywords
0,winter killer,winter killer
1,heat magazine uk,heat magazine uk
2,josie russell,josie russell
3,korean,korean
4,iron fey,iron fey


In [264]:
keywords_df['bert_embeddings']= keywords_df['clean_keywords'].apply(lambda x: sbert_model.encode(x))

In [265]:
keywords_df.head()

Unnamed: 0,search_term,clean_keywords,bert_embeddings
0,winter killer,winter killer,"[-0.02831785, 0.102512956, 0.0096159475, 0.087..."
1,heat magazine uk,heat magazine uk,"[-0.0095325615, 0.0519015, -0.034713894, 0.101..."
2,josie russell,josie russell,"[0.008748813, -0.0020361133, 0.019947806, 0.03..."
3,korean,korean,"[-0.041517537, 0.03994494, 0.020216538, -0.028..."
4,iron fey,iron fey,"[-0.034213535, -0.0520466, -0.023690881, 0.130..."


In [266]:
metadata_df['bert_embeddings'].shape

(500,)

In [267]:
keywords_df['bert_embeddings'].shape

(76707,)

In [268]:
# This will throw error because of inconsistent shape of the dataset

# similarities_bert = cosine_similarity( keywords_df['bert_embeddings'], metadata_df['bert_embeddings']
#                                       )

In [269]:

len(keywords_df['bert_embeddings'])

76707

In [270]:
# Convert the embeddings columns to NumPy arrays
metadata_embeddings = np.array(metadata_df['bert_embeddings'].tolist())
keywords_embeddings = np.array(keywords_df['bert_embeddings'].tolist())

In [271]:


# Reshape metadata_embeddings and keywords_embeddings if necessary
metadata_embeddings = metadata_embeddings.reshape(len(metadata_df['bert_embeddings']), -1)
keywords_embeddings = keywords_embeddings.reshape(len(keywords_df['bert_embeddings']), -1)





In [272]:
# Compute cosine similarity using matrix multiplication
similarity_bert_matrix = cosine_similarity(metadata_embeddings, keywords_embeddings)

In [273]:
# Computing output

## Cosine Similarity with BERT

In [274]:
num_top_keywords = 5

In [275]:
results = []

In [276]:
keywords_df.head()

Unnamed: 0,search_term,clean_keywords,bert_embeddings
0,winter killer,winter killer,"[-0.02831785, 0.102512956, 0.0096159475, 0.087..."
1,heat magazine uk,heat magazine uk,"[-0.0095325615, 0.0519015, -0.034713894, 0.101..."
2,josie russell,josie russell,"[0.008748813, -0.0020361133, 0.019947806, 0.03..."
3,korean,korean,"[-0.041517537, 0.03994494, 0.020216538, -0.028..."
4,iron fey,iron fey,"[-0.034213535, -0.0520466, -0.023690881, 0.130..."


In [277]:
book_metadata = metadata_df['clean_content'].copy()
keywords = keywords_df['clean_keywords'].copy()

In [278]:
book_keywords = {} # Creating a dictionary, which will contain the keywords with highest cosine similarity for each book indices - this will act as the dictionary key

In [279]:
for i, book in enumerate(book_metadata):
    top_keywords_indices = similarity_bert_matrix[i].argsort()[-num_top_keywords:][::-1]
    top_keywords = [keywords[idx] for idx in top_keywords_indices]
    book_keywords[i] = top_keywords


In [280]:
# book_keywords

In [281]:
keywords_bert_result = pd.Series(book_keywords )

In [282]:
keywords_bert_result.head()

0    [deliciously ella cookbook, book baby, book ba...
1    [submarine fiction, submarine thriller, sea li...
2    [book fiction, nonfiction book, fiction book, ...
3    [noam chomsky book, chomsky book, horrible his...
4    [billionaire obsession, passionate billionaire...
dtype: object

In [283]:
keywords_bert_result.tail()

495    [ww2 book, ww2 fiction novel, ww2 fiction, ww1...
496    [historical novel, historical fiction book, cl...
497    [avery chandler, rise chandler johnson, kerry ...
498    [man booker prize, booker prize shortlist, man...
499    [sanatorium book, sanitorium book, sanatorium,...
dtype: object

In [284]:
metadata_df.tail()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,bert_embeddings
495,495,The Frank Family That Survived,'A major contribution to our understanding of ...,non-fiction,"['NHTZ1', 'DNBH', 'NHWR7', 'JBFG', '3MPBLB']","['The Holocaust', 'Biography: historical, poli...",'A major contribution to our understanding of ...,major contribution understanding second world ...,"[-0.026605265, 0.059623726, -0.07832027, -0.01..."
496,496,V.,"<b>The first novel from the great, incomparabl...",fiction,"['FBA', '1KBB-US-NAKC', '3MPQM']",['Modern and contemporary fiction: general and...,"<b>The first novel from the great, incomparabl...",first novel great incomparable thomas pynchon ...,"[-0.00030172136, 0.03130717, 0.024693009, 0.04..."
497,497,Business or Pleasure,<b>When Chandler Cohen accepts her next ghostw...,fiction,"['FRD', 'FU', 'FQ', 'FXD']","['Modern and Contemporary romance', 'Humorous ...",<b>When Chandler Cohen accepts her next ghostw...,chandler cohen accept next ghostwriting gig pe...,"[-0.031196805, -0.06850113, 0.022941642, -0.05..."
498,498,The Dinner Guest,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,fiction,"['FBA', 'FYT', 'FXL', 'FS', '1DSE']",['Modern and contemporary fiction: general and...,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,longlisted 2018 man booker international prize...,"[0.029906986, 0.047078524, -0.09180317, -0.048..."
499,499,The Sanatorium,<b>'<i>The Sanatorium </i>will keep you checki...,fiction,"['FF', 'FHX', 'FFP', 'FFS', 'FXR', 'FXL', '1DF...","['Crime and mystery fiction', 'Psychological t...",<b>'<i>The Sanatorium </i>will keep you checki...,sanatorium keep check shoulder spinetingling a...,"[-0.029930346, -0.05241185, 0.005281358, 0.046..."


In [285]:
keywords_bert_result.shape

(500,)

In [286]:
keywords_result[270:275] #this contains one of the indices of the NaNs we replaced to "null"

270    [candle rose, poisen rose, whiskey rose, steel...
271    [mhairi, mhairi oreilly, mhairi mcfarlane book...
272    [food medic, delia frugal food, food arthritis...
273    [convienance store woman, verge, toast, toast ...
274    [read upside, read, bonny read, read comprehen...
dtype: object

In [287]:
metadata_df.shape

(500, 9)

In [288]:
bert_result_df = pd.concat([metadata_df, keywords_bert_result], axis = 1)

In [289]:
bert_result_df.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,bert_embeddings,0
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,"[-0.0536515, -0.05696627, -0.06004447, 0.03443...","[deliciously ella cookbook, book baby, book ba..."
1,1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,fiction,"['FHD', 'FV', 'FJ', '1DSP-PT-P']","['Espionage and spy thriller', 'Historical fic...",A sunken U-Boat has lain undisturbed on the At...,sunken uboat lie undisturbed atlantic ocean fl...,"[-0.080311686, 0.0620853, 0.04595251, -0.01543...","[submarine fiction, submarine thriller, sea li..."
2,2,Lives of Girls and Women,<b>The only novel from bestselling author Alic...,fiction,"['FBA', 'FC', 'FXB', '1KBC-CA-O']",['Modern and contemporary fiction: general and...,<b>The only novel from bestselling author Alic...,novel bestselling author alice munro winner no...,"[-0.06821202, -0.05175161, -0.0088345725, 0.06...","[book fiction, nonfiction book, fiction book, ..."
3,3,The Precipice,"In this powerful collection of interviews, Noa...",non-fiction,"['JPB', 'QDTS', 'JPF', 'KCS', 'DNP', 'RNT']","['Comparative politics', 'Social and political...","In this powerful collection of interviews, Noa...",powerful collection interview noam chomsky exp...,"[0.01108276, 0.05567558, -0.030121665, -0.0085...","[noam chomsky book, chomsky book, horrible his..."
4,4,Little Wins,There are some 400 million people worldwide wh...,non-fiction,"['KJH', 'KJMB', 'KJW', 'VSC', 'VSPM', 'JMC']","['Entrepreneurship / Start-ups', 'Management: ...",There are some 400 million people worldwide wh...,400 million people worldwide whose creativity ...,"[0.019774005, -0.01750415, -0.005929993, -0.04...","[billionaire obsession, passionate billionaire..."


In [290]:
bert_result_df.rename(columns = {0: "search_term"}, inplace=True)

In [291]:
bert_result_df.tail()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,bert_embeddings,search_term
495,495,The Frank Family That Survived,'A major contribution to our understanding of ...,non-fiction,"['NHTZ1', 'DNBH', 'NHWR7', 'JBFG', '3MPBLB']","['The Holocaust', 'Biography: historical, poli...",'A major contribution to our understanding of ...,major contribution understanding second world ...,"[-0.026605265, 0.059623726, -0.07832027, -0.01...","[ww2 book, ww2 fiction novel, ww2 fiction, ww1..."
496,496,V.,"<b>The first novel from the great, incomparabl...",fiction,"['FBA', '1KBB-US-NAKC', '3MPQM']",['Modern and contemporary fiction: general and...,"<b>The first novel from the great, incomparabl...",first novel great incomparable thomas pynchon ...,"[-0.00030172136, 0.03130717, 0.024693009, 0.04...","[historical novel, historical fiction book, cl..."
497,497,Business or Pleasure,<b>When Chandler Cohen accepts her next ghostw...,fiction,"['FRD', 'FU', 'FQ', 'FXD']","['Modern and Contemporary romance', 'Humorous ...",<b>When Chandler Cohen accepts her next ghostw...,chandler cohen accept next ghostwriting gig pe...,"[-0.031196805, -0.06850113, 0.022941642, -0.05...","[avery chandler, rise chandler johnson, kerry ..."
498,498,The Dinner Guest,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,fiction,"['FBA', 'FYT', 'FXL', 'FS', '1DSE']",['Modern and contemporary fiction: general and...,<b>LONGLISTED FOR THE 2018 MAN BOOKER INTERNAT...,longlisted 2018 man booker international prize...,"[0.029906986, 0.047078524, -0.09180317, -0.048...","[man booker prize, booker prize shortlist, man..."
499,499,The Sanatorium,<b>'<i>The Sanatorium </i>will keep you checki...,fiction,"['FF', 'FHX', 'FFP', 'FFS', 'FXR', 'FXL', '1DF...","['Crime and mystery fiction', 'Psychological t...",<b>'<i>The Sanatorium </i>will keep you checki...,sanatorium keep check shoulder spinetingling a...,"[-0.029930346, -0.05241185, 0.005281358, 0.046...","[sanatorium book, sanitorium book, sanatorium,..."


In [292]:
bert_result_df.shape

(500, 10)

In [293]:
bert_output = bert_result_df.explode('search_term')

In [294]:
bert_output.shape

(2500, 10)

In [295]:
bert_output.head()

Unnamed: 0,id,title,description,fiction_flag,thema_codes,thema_descriptions,complete_content,clean_content,bert_embeddings,search_term
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,"[-0.0536515, -0.05696627, -0.06004447, 0.03443...",deliciously ella cookbook
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,"[-0.0536515, -0.05696627, -0.06004447, 0.03443...",book baby
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,"[-0.0536515, -0.05696627, -0.06004447, 0.03443...",book baby
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,"[-0.0536515, -0.05696627, -0.06004447, 0.03443...",charlotte philby book
0,0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,non-fiction,"['VFX', 'WBH', 'VF', 'VS']","['Parenting: advice and issues', 'Health and w...",<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,sunday times bestseller charlotte give confide...,"[-0.0536515, -0.05696627, -0.06004447, 0.03443...",oliver vegetable book


In [296]:
bert_output.to_csv("Output_Bert.csv")

# Comparing Results

In [297]:

comparing_df = pd.DataFrame(columns = ['title', 'description', 'search_term_tf_idf', 'search_term_bert'])

In [298]:
comparing_df['title'] = tfidf_output['title'].copy()

In [299]:
comparing_df['description'] = tfidf_output['description'].copy()

In [300]:
comparing_df['search_term_tf_idf'] = tfidf_output['search_term'].copy()

In [301]:
comparing_df['search_term_bert'] = bert_output['search_term'].copy()

In [302]:
comparing_df.head(20)

Unnamed: 0,title,description,search_term_tf_idf,search_term_bert
0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,baby wean book,deliciously ella cookbook
0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,wean,book baby
0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,annabel karmel wean,book baby
0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,wean book,charlotte philby book
0,How to Wean Your Baby,<b>THE <i>SUNDAY TIMES </i>BESTSELLER</b>\r\n<...,villette charlotte bronte,oliver vegetable book
1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,betrayal lie,submarine fiction
1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,secret lie,submarine thriller
1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,wilmington lie,sea lie
1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,lie,ww2 naval fiction
1,Horse Under Water,A sunken U-Boat has lain undisturbed on the At...,lie,drown sea


In [303]:
comparing_df.to_csv("Comparing_results.csv")

# Conclusion


TF-IDF gave results that were closely represented in the keyword corpus. However, at places where the description was largely based on reviews or author's profile, TF-IDF could not perform well, for example, 'Lives of Girls and Women' written by Alice Munro. On the other hand, BERT did well in finding related words where synonyms were evidently present. For example, for the book 'Horse Under Water', TF-IDF search terms are heavily based on the term 'lie', whereas BERT gives more appropriate results.

More pre-processing techniques could be implemented on the description content. For example, content where author's bio or reviews from renowned writers can be removed, considering these do not relate to the description or align with the keyword discovery task. These words simpyly add to the noise and does not provide information of the book, it rather delves into the reader's reaction.

The current prototype takes 5 as the total number of keywords to be identified using cosine similarity. This has resulted in keywords with less similarity score. Experiments can be conducted to find a threshold for the cosine similarity score and the number of keywords. This may result in more relevant keywords. 