## Predicting book genre from book description 

highly used resource: https://www.kaggle.com/code/prathameshgadekar/book-genre-prediction-nlp/notebook

Selecting MUltinomialNB since their analysis evaluated that to be the best model for that dataset, and since I am using the same dataframe, as I have not been able to find other dataframes with both book description and genre, I find it appropriate.


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno #For missing value visualization

import plotly.offline as py
py.init_notebook_mode(connected=True)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


In [2]:
#For NLP
import re
import nltk
import string
# from wordcloud import WordCloud

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
#For Modelling Purpose
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.dummy import DummyClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.multiclass import OneVsRestClassifier

In [4]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/elisealstad/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# Data cleaning

In [5]:
data = pd.read_csv('../assets/data.csv')
amazondf = pd.read_pickle('../assets/amazon_books_description.pkl')


In [6]:
data.drop('index',inplace = True,axis = 1)
data = data.rename(columns={'title':'Title', 'summary':'Description' })
new_labels_data =  {'fantasy':"Science Fiction & Fantasy" ,
                     'psychology': 'psychology and self-help'}
data['genre'] = data['genre'].replace(new_labels_data)


In [7]:
amazondf.head(1)

Unnamed: 0,Title,Author(s),Publish_Date,Description,ISBN,Page_Count,Categories,Average_Rating,Rating_Count,Language,genre,Title_org
0,That's That,Colin Broderick,2013-05-07,A brutally honest and deeply affecting memoir ...,9780307716347,368.0,Biography & Autobiography,,,en,,


In [8]:
df = pd.concat([data, amazondf])
df = (
    df.dropna(subset=['Description', 'genre'])
    .filter(items=['Title', 'Description','genre'])
)

In [9]:
df.shape

(6946, 3)

In [10]:
#cleaning unecessary text from the string 
Stopwords = set(stopwords.words('english'))

def clean(text):
    text = text.lower() #Converting to lowerCase
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ',text) #removing punctuation
    
    text_tokens = word_tokenize(text) #removing stopwords
    tw = [word for word in text_tokens if not word in Stopwords]
    text = (" ").join(tw)
    
    splt = text.split(' ')
    output = [x for x in splt if len(x) > 3] #removing words with length<=3
    text = (" ").join(output)
    
    text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text) #removing single character 
    text = re.sub('<.*?>+',' ',text) #removing HTML Tags
    text = re.sub('\n', ' ',text) #removal of new line characters
    text = re.sub(r'\s+', ' ',text) #removal of multiple spaces
    return text

In [11]:
df.genre.value_counts()

genre
thriller                      1212
Science Fiction & Fantasy     1010
history                        736
science                        647
horror                         600
crime                          500
romance                        276
psychology and self-help       258
sports                         239
travel                         212
Biographies & Memoirs          173
Business & Money               173
Children's Books               169
Politics & Social Sciences     160
Cookbooks, Food & Wine         151
Crafts, Hobbies & Home         146
Health                         143
Teen & YA                      141
Name: count, dtype: int64

# Data exploration

In [12]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

def most_common_words(column, n=10):
    # Tokenize and lowercase the words
    words = word_tokenize(" ".join(column.str.lower()))

    # Remove stopwords (common words like 'the', 'and', etc.)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.isalnum() and word not in stop_words]

    # Calculate word frequencies
    word_freq = Counter(words)

    # Get the most common words
    common_words = word_freq.most_common(n)

    return common_words

for genre in df.genre.unique().tolist(): 
    most_common_words_description = most_common_words(df.query('genre==@genre')['Description'], n=10)
    print('\n\n',f"Most common words in {genre} column:")
    for word, count in most_common_words_description:
        print(f"{word}: {count}")



 Most common words in Science Fiction & Fantasy column:
one: 1061
world: 687
new: 651
king: 582
time: 565
two: 547
find: 547
back: 533
also: 485
life: 483


 Most common words in science column:
one: 691
time: 637
world: 486
earth: 464
planet: 425
new: 421
ship: 403
human: 375
first: 343
life: 337


 Most common words in crime column:
one: 550
murder: 450
man: 378
police: 326
also: 314
two: 313
poirot: 297
death: 295
house: 280
found: 280


 Most common words in history column:
one: 682
father: 578
war: 468
two: 442
time: 441
also: 439
story: 428
life: 415
first: 412
new: 410


 Most common words in horror column:
one: 731
anita: 541
house: 453
new: 404
also: 403
man: 385
back: 383
two: 371
life: 365
find: 347


 Most common words in thriller column:
one: 1136
new: 743
two: 585
life: 542
less: 536
man: 519
alex: 485
find: 438
time: 436
back: 420


 Most common words in psychology and self-help column:
book: 216
life: 194
people: 152
new: 146
us: 142
one: 120
love: 116
world: 94
make:

# Data pre-processing 

In [14]:
# data preprocessing 

def data_preprocessing(text):
    tokens = word_tokenize(text) #Tokenization
    tokens = [WordNetLemmatizer().lemmatize(word) for word in tokens] #Lemmetization
    tokens = [SnowballStemmer(language = 'english').stem(word) for word in tokens] #Stemming
    return " ".join(tokens)

In [15]:
df['Description'] = df['Description'].apply(data_preprocessing)
df

Unnamed: 0,Title,Description,genre
0,Drowned Wednesday,drown wednesday is the first truste among the ...,Science Fiction & Fantasy
1,The Lost Hero,"as the book open , jason awaken on a school bu...",Science Fiction & Fantasy
2,The Eyes of the Overworld,cugel is easili persuad by the merchant fianos...,Science Fiction & Fantasy
3,Magic's Promise,the book open with herald-mag vanyel return to...,Science Fiction & Fantasy
4,Taran Wanderer,taran and gurgi have return to caer dallben fo...,Science Fiction & Fantasy
...,...,...,...
837,Pretty Girls,lee child say it ’ s “ stunning… certain to be...,thriller
838,Go! All in One,this is the ebook of the print book and may no...,Business & Money
839,Who's There On Halloween?,who 's there on halloween ? encourag reader to...,Children's Books
840,Foreclosure Investing For Dummies,make foreclosur invest work for you with this ...,Business & Money


In [16]:
#Converting all the categorical features of 'genre' to numerical

labelencoder = LabelEncoder()
df['genre_vec'] = labelencoder.fit_transform(df['genre'])
df['genre_vec']

0       7
1       7
2       7
3       7
4       7
       ..
837    16
838     1
839     2
840     1
841    13
Name: genre_vec, Length: 6946, dtype: int64

In [17]:
labelencoder.inverse_transform(df['genre_vec'])

array(['Science Fiction & Fantasy', 'Science Fiction & Fantasy',
       'Science Fiction & Fantasy', ..., "Children's Books",
       'Business & Money', 'romance'], dtype=object)

In [18]:
cv = CountVectorizer()
X = cv.fit_transform(df['Description'])
y = df['genre_vec']

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Test different ML models 

In [20]:
models = [BernoulliNB(),MultinomialNB(),SGDClassifier(),LogisticRegression(),RandomForestClassifier(),GradientBoostingClassifier(),
         AdaBoostClassifier(),SVC(),DummyClassifier(),ExtraTreeClassifier(),KNeighborsClassifier()]

In [29]:
import time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
Name = []
Accuracy = []
Precision = []
F1_Score = []
Recall = []
Time_Taken = []
for model in models:
    name = type(model).__name__
    Name.append(name)
    begin = time.time()
    model.fit(X_train,y_train)
    prediction = model.predict(X_test)
    end = time.time()
    Accuracy.append(accuracy_score(prediction,y_test))
    Precision.append(precision_score(prediction,y_test,average = 'macro'))
    Recall.append(recall_score(prediction, y_test, average='macro', zero_division=1))
    F1_Score.append(f1_score(prediction,y_test,average = 'macro'))
    Time_Taken.append(end-begin)
    print(name + ' Successfully Trained')

BernoulliNB Successfully Trained
MultinomialNB Successfully Trained
SGDClassifier Successfully Trained



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



LogisticRegression Successfully Trained
RandomForestClassifier Successfully Trained
GradientBoostingClassifier Successfully Trained
AdaBoostClassifier Successfully Trained
SVC Successfully Trained
DummyClassifier Successfully Trained
ExtraTreeClassifier Successfully Trained
KNeighborsClassifier Successfully Trained


In [32]:
Dict = {'Name':Name,'Accuracy':Accuracy,'Precision_score':Precision,'Recall_score':Precision,
        'F1_score':F1_Score,'Time Taken':Time_Taken}
model_df = pd.DataFrame(Dict)
model_df

Unnamed: 0,Name,Accuracy,Precision_score,Recall_score,F1_score,Time Taken
0,BernoulliNB,0.330935,0.137544,0.137544,0.133584,0.052388
1,MultinomialNB,0.534532,0.33941,0.33941,0.351079,0.022617
2,SGDClassifier,0.599281,0.535043,0.535043,0.538543,0.338006
3,LogisticRegression,0.605036,0.551865,0.551865,0.557967,6.776117
4,RandomForestClassifier,0.477698,0.359631,0.359631,0.380613,17.339131
5,GradientBoostingClassifier,0.530935,0.429237,0.429237,0.466317,351.766833
6,AdaBoostClassifier,0.298561,0.235048,0.235048,0.247421,9.82597
7,SVC,0.360432,0.193885,0.193885,0.197525,35.842732
8,DummyClassifier,0.174101,0.055556,0.055556,0.016476,0.000865
9,ExtraTreeClassifier,0.213669,0.186854,0.186854,0.183106,0.320879


# run and store best model 

In [53]:
# Select best fitting model 
model =LogisticRegression(max_iter=1000)
model.fit(X,y)
prediction = model.predict(X_test)

In [54]:
import pickle

filename = "model.pickle"

# save model
pickle.dump(model, open(filename, "wb"))

# load model
loaded_model = pickle.load(open(filename, "rb"))

# you can use loaded model to compute predictions
y_predicted = loaded_model.predict(X)

In [55]:
mybooks = pd.read_pickle('../assets/my_books.pkl')
mybooks = mybooks.query('Description.notna()')

In [56]:
mybooks['Description_cleaned'] = mybooks['Description'].apply(clean)

# data preprocessing 
mybooks['Description'] = mybooks['Description'].apply(data_preprocessing)
mybooks['Description']

0      japanes fairi tale - enchant , enigmat stori o...
1      ' a sensual feast of a novel , written with el...
2      a new york time , usa today , and washington p...
3      * the sunday time number one bestsel * * over ...
4      the addict no.1 bestsel that everyon is talk a...
                             ...                        
359    the key to rebecca is a grip thriller set dure...
360    winner of the pulitz prize , a new york time b...
361    one of the most influenti book of the twentiet...
362    in this deepli stir novel , acclaim author cri...
363    mr jone of manor farm is so lazi and drunken t...
Name: Description, Length: 332, dtype: object

In [57]:
#Converting all the categorical features of 'genre' to numerical
nyX = cv.transform(mybooks['Description'])

In [58]:
# you can use loaded model to compute predictions
genre = loaded_model.predict(nyX)

In [59]:
inv = labelencoder.inverse_transform(genre)
print(inv)


['Science Fiction & Fantasy' 'romance' 'Science Fiction & Fantasy'
 'psychology and self-help' 'romance' 'travel' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'Science Fiction & Fantasy' 'travel' 'romance' 'romance' 'thriller'
 'Teen & YA' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'Teen & YA' 'romance'
 'romance' 'Science Fiction & Fantasy' 'Science Fiction & Fantasy'
 'psychology and self-help' 'thriller' 'thriller' 'horror' 'romance'
 'Science Fiction & Fantasy' 'romance' 'thriller' 'Cookbooks, Food & Wine'
 'thriller' 'thriller' 'thriller' 'thriller' 'Science Fiction & Fantasy'
 'thriller' 'Biographies & Memoirs' 'Science Fiction & Fantasy' 'travel'
 'thriller' 'thriller' 'crime' 'thriller' 'Business & Money' 'thriller'
 'thriller' 'romance' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'psychology and self-help' 'Teen & YA' 'romance'
 'psychology and self-help' 'Business & Money' '

In [60]:
mybooks['genre'] = inv

In [61]:
mybooks[['Title','genre']].head(50)


Unnamed: 0,Title,genre
0,Night Train to the Stars,Science Fiction & Fantasy
1,The Language of Food,romance
2,The House in the Cerulean Sea,Science Fiction & Fantasy
3,Invisible Women: Data Bias in a World Designed...,psychology and self-help
4,Gone Girl,romance
5,Ute av verden,travel
6,Local Woman Missing,thriller
7,The Silent Patient,thriller
8,"The Devotion of Suspect X (Detective Galileo, #1)",thriller
9,The Secret History,thriller


In [62]:
mybooks.genre.value_counts()

genre
thriller                      114
romance                        50
Science Fiction & Fantasy      35
psychology and self-help       21
travel                         20
history                        14
Biographies & Memoirs          14
science                        13
Teen & YA                      11
Children's Books               11
crime                           6
Business & Money                6
horror                          4
Cookbooks, Food & Wine          3
sports                          3
Politics & Social Sciences      3
Health                          2
Crafts, Hobbies & Home          2
Name: count, dtype: int64

# Evaluating prediction of model 

It seems like a lot of books are not predicted corretly and a majority of my books are predicted as thriller. I do read a lot of thrillers but the share is too big too be true. It does seem like the initival dataset used for prediction have a substantial share of thrillers, which may cause it too predict too many books as thrillers. 

# Improving the model 
I will try and add another dataset for training, https://github.com/uchidalab/book-dataset/tree/master/Task2, which contains 270K books from amazon with title and category name. 

I will have to: 
- make sure common category names between two datasets are the same 
- remove some small and unecessary categories from the amazon dataframe. 
- collect description from google api in several rounds with max 49K books each time to now exceed limit of 50K per day. 

Model building: 
- use similar approach and test several models: https://www.kaggle.com/code/prathameshgadekar/book-genre-prediction-nlp/notebook
- select best predicting model 