## Predicting book genre from book description 
## Multi-class classification

highly used resource: https://www.kaggle.com/code/prathameshgadekar/book-genre-prediction-nlp/notebook

Selecting MUltinomialNB since their analysis evaluated that to be the best model for that dataset, and since I am using the same dataframe, as I have not been able to find other dataframes with both book description and genre, I find it appropriate.

Resources: 
    - dataset: https://github.com/uchidalab/book-dataset/tree/master and collect description from google books api
    - dataset:  https://www.kaggle.com/datasets/athu1105/book-genre-prediction
    - source reference: https://www.kaggle.com/code/majinx/nlp-book-genre-prediction-eda-and-modelling#Testing-Different-Models


In [63]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno #For missing value visualization

import plotly.offline as py
py.init_notebook_mode(connected=True)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


In [64]:
#For NLP
import re
import nltk
import string
# from wordcloud import WordCloud

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [65]:
#For Modelling Purpose
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.dummy import DummyClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.multiclass import OneVsRestClassifier

In [66]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/elisealstad/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# Data cleaning

In [67]:
data = pd.read_csv('../assets/data.csv')
amazondf = pd.read_pickle('../assets/amazon_books_description.pkl')


In [68]:
data.drop('index',inplace = True,axis = 1)
data = data.rename(columns={'title':'Title', 'summary':'Description' })
new_labels_data =  {'fantasy':"Science Fiction & Fantasy" ,
                     'psychology': 'psychology and self-help'}
data['genre'] = data['genre'].replace(new_labels_data)


In [69]:
amazondf.head(1)

Unnamed: 0,Title,Author(s),Publish_Date,Description,ISBN,Page_Count,Categories,Average_Rating,Rating_Count,Language,genre,Title_org
0,That's That,Colin Broderick,2013-05-07,A brutally honest and deeply affecting memoir ...,9780307716347,368.0,Biography & Autobiography,,,en,,


In [70]:
df = pd.concat([data, amazondf])
df = (
    df.dropna(subset=['Description', 'genre'])
    .filter(items=['Title', 'Description','genre'])
)

In [71]:
df.shape

(7716, 3)

In [72]:
#cleaning unecessary text from the string 
Stopwords = set(stopwords.words('english'))

def clean(text):
    text = text.lower() #Converting to lowerCase
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ',text) #removing punctuation
    
    text_tokens = word_tokenize(text) #removing stopwords
    tw = [word for word in text_tokens if not word in Stopwords]
    text = (" ").join(tw)
    
    splt = text.split(' ')
    output = [x for x in splt if len(x) > 3] #removing words with length<=3
    text = (" ").join(output)
    
    text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text) #removing single character 
    text = re.sub('<.*?>+',' ',text) #removing HTML Tags
    text = re.sub('\n', ' ',text) #removal of new line characters
    text = re.sub(r'\s+', ' ',text) #removal of multiple spaces
    return text

In [73]:
df.genre.value_counts()

genre
thriller                      1263
Science Fiction & Fantasy     1051
history                        789
science                        647
horror                         600
crime                          500
romance                        325
psychology and self-help       306
sports                         294
travel                         246
Biographies & Memoirs          235
Business & Money               233
Children's Books               217
Politics & Social Sciences     215
Cookbooks, Food & Wine         208
Crafts, Hobbies & Home         198
Health                         196
Teen & YA                      193
Name: count, dtype: int64

# Data exploration

In [74]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

def most_common_words(column, n=10):
    # Tokenize and lowercase the words
    words = word_tokenize(" ".join(column.str.lower()))

    # Remove stopwords (common words like 'the', 'and', etc.)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.isalnum() and word not in stop_words]

    # Calculate word frequencies
    word_freq = Counter(words)

    # Get the most common words
    common_words = word_freq.most_common(n)

    return common_words

for genre in df.genre.unique().tolist(): 
    most_common_words_description = most_common_words(df.query('genre==@genre')['Description'], n=10)
    print('\n\n',f"Most common words in {genre} column:")
    for word, count in most_common_words_description:
        print(f"{word}: {count}")



 Most common words in Science Fiction & Fantasy column:
one: 1088
world: 719
new: 677
king: 585
time: 581
find: 554
two: 552
back: 537
life: 495
also: 492


 Most common words in science column:
one: 691
time: 637
world: 486
earth: 464
planet: 425
new: 421
ship: 403
human: 375
first: 343
life: 337


 Most common words in crime column:
one: 550
murder: 450
man: 378
police: 326
also: 314
two: 313
poirot: 297
death: 295
house: 280
found: 280


 Most common words in history column:
one: 701
father: 579
war: 492
time: 450
story: 450
also: 449
new: 449
two: 447
book: 428
first: 427


 Most common words in horror column:
one: 731
anita: 541
house: 453
new: 404
also: 403
man: 385
back: 383
two: 371
life: 365
find: 347


 Most common words in thriller column:
one: 1172
new: 784
two: 595
life: 556
less: 536
man: 534
alex: 488
find: 462
time: 454
first: 437


 Most common words in psychology and self-help column:
book: 266
life: 231
new: 176
us: 174
people: 173
one: 144
love: 142
world: 109
mak

# Data pre-processing 

In [75]:
# data preprocessing 

def data_preprocessing(text):
    tokens = word_tokenize(text) #Tokenization
    tokens = [WordNetLemmatizer().lemmatize(word) for word in tokens] #Lemmetization
    tokens = [SnowballStemmer(language = 'english').stem(word) for word in tokens] #Stemming
    return " ".join(tokens)

In [76]:
df['Description'] = df['Description'].apply(data_preprocessing)
df

Unnamed: 0,Title,Description,genre
0,Drowned Wednesday,drown wednesday is the first truste among the ...,Science Fiction & Fantasy
1,The Lost Hero,"as the book open , jason awaken on a school bu...",Science Fiction & Fantasy
2,The Eyes of the Overworld,cugel is easili persuad by the merchant fianos...,Science Fiction & Fantasy
3,Magic's Promise,the book open with herald-mag vanyel return to...,Science Fiction & Fantasy
4,Taran Wanderer,taran and gurgi have return to caer dallben fo...,Science Fiction & Fantasy
...,...,...,...
814,Claimed,"sometim , the hero must be the villain ... fou...",romance
815,Outsmart Your Cancer,book & cd . this easy-to-read altern treatment...,Health
816,Bistronomy,"finalist for the iacp cookbook award , chef an...","Cookbooks, Food & Wine"
817,A Guide to Japanese Hot Springs,this text is a guid to over 160 of the best ho...,travel


In [77]:
#Converting all the categorical features of 'genre' to numerical

labelencoder = LabelEncoder()
df['genre_vec'] = labelencoder.fit_transform(df['genre'])
df['genre_vec']

0       7
1       7
2       7
3       7
4       7
       ..
814    13
815     5
816     3
817    17
818     5
Name: genre_vec, Length: 7716, dtype: int64

In [78]:
labelencoder.inverse_transform(df['genre_vec'])

array(['Science Fiction & Fantasy', 'Science Fiction & Fantasy',
       'Science Fiction & Fantasy', ..., 'Cookbooks, Food & Wine',
       'travel', 'Health'], dtype=object)

In [79]:
cv = CountVectorizer()
X = cv.fit_transform(df['Description'])
y = df['genre_vec']

In [80]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Test different ML models 

In [81]:
models = [BernoulliNB(),MultinomialNB(),SGDClassifier(),LogisticRegression(),RandomForestClassifier(),GradientBoostingClassifier(),
         AdaBoostClassifier(),SVC(),DummyClassifier(),ExtraTreeClassifier(),KNeighborsClassifier()]

In [82]:
import time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
Name = []
Accuracy = []
Precision = []
F1_Score = []
Recall = []
Time_Taken = []
for model in models:
    name = type(model).__name__
    Name.append(name)
    begin = time.time()
    model.fit(X_train,y_train)
    prediction = model.predict(X_test)
    end = time.time()
    Accuracy.append(accuracy_score(prediction,y_test))
    Precision.append(precision_score(prediction,y_test,average = 'macro'))
    Recall.append(recall_score(prediction, y_test, average='macro', zero_division=1))
    F1_Score.append(f1_score(prediction,y_test,average = 'macro'))
    Time_Taken.append(end-begin)
    print(name + ' Successfully Trained')

BernoulliNB Successfully Trained
MultinomialNB Successfully Trained
SGDClassifier Successfully Trained



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



LogisticRegression Successfully Trained
RandomForestClassifier Successfully Trained


KeyboardInterrupt: 

In [None]:
Dict = {'Name':Name,'Accuracy':Accuracy,'Precision_score':Precision,'Recall_score':Precision,
        'F1_score':F1_Score,'Time Taken':Time_Taken}
model_df = pd.DataFrame(Dict)
model_df

Unnamed: 0,Name,Accuracy,Precision_score,Recall_score,F1_score,Time Taken
0,BernoulliNB,0.330935,0.137544,0.137544,0.133584,0.052388
1,MultinomialNB,0.534532,0.33941,0.33941,0.351079,0.022617
2,SGDClassifier,0.599281,0.535043,0.535043,0.538543,0.338006
3,LogisticRegression,0.605036,0.551865,0.551865,0.557967,6.776117
4,RandomForestClassifier,0.477698,0.359631,0.359631,0.380613,17.339131
5,GradientBoostingClassifier,0.530935,0.429237,0.429237,0.466317,351.766833
6,AdaBoostClassifier,0.298561,0.235048,0.235048,0.247421,9.82597
7,SVC,0.360432,0.193885,0.193885,0.197525,35.842732
8,DummyClassifier,0.174101,0.055556,0.055556,0.016476,0.000865
9,ExtraTreeClassifier,0.213669,0.186854,0.186854,0.183106,0.320879


# run and store best model 

In [83]:
# Select best fitting model 
model =LogisticRegression(max_iter=1000)
model.fit(X,y)
prediction = model.predict(X_test)

In [84]:
import pickle

filename = "model.pickle"

# save model
pickle.dump(model, open(filename, "wb"))

# load model
loaded_model = pickle.load(open(filename, "rb"))

# you can use loaded model to compute predictions
y_predicted = loaded_model.predict(X)

In [85]:
mybooks = pd.read_pickle('../assets/my_books.pkl')
mybooks = mybooks.query('Description.notna()')

In [86]:
mybooks['Description_cleaned'] = mybooks['Description'].apply(clean)

# data preprocessing 
mybooks['Description'] = mybooks['Description'].apply(data_preprocessing)
mybooks['Description']

0      japanes fairi tale - enchant , enigmat stori o...
1      ' a sensual feast of a novel , written with el...
2      a new york time , usa today , and washington p...
3      * the sunday time number one bestsel * * over ...
4      the addict no.1 bestsel that everyon is talk a...
                             ...                        
359    the key to rebecca is a grip thriller set dure...
360    winner of the pulitz prize , a new york time b...
361    one of the most influenti book of the twentiet...
362    in this deepli stir novel , acclaim author cri...
363    mr jone of manor farm is so lazi and drunken t...
Name: Description, Length: 332, dtype: object

In [87]:
#Converting all the categorical features of 'genre' to numerical
nyX = cv.transform(mybooks['Description'])

In [88]:
# you can use loaded model to compute predictions
genre = loaded_model.predict(nyX)

In [89]:
inv = labelencoder.inverse_transform(genre)
print(inv)


['Science Fiction & Fantasy' 'travel' 'Science Fiction & Fantasy'
 'psychology and self-help' 'romance' 'travel' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'Science Fiction & Fantasy' "Children's Books" 'romance' 'romance'
 'thriller' 'Teen & YA' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller' 'Teen & YA'
 'romance' 'romance' 'Science Fiction & Fantasy'
 'Science Fiction & Fantasy' 'psychology and self-help' 'thriller'
 'thriller' 'romance' 'romance' 'Science Fiction & Fantasy' 'romance'
 'horror' 'Cookbooks, Food & Wine' 'thriller' 'history' 'thriller'
 'romance' 'Science Fiction & Fantasy' 'thriller' 'Biographies & Memoirs'
 'Science Fiction & Fantasy' 'travel' 'thriller' 'thriller' 'crime'
 'thriller' 'Business & Money' 'thriller' 'thriller' 'romance' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'psychology and self-help'
 'Teen & YA' 'romance' 'psychology and self-help' 'Business & Mo

In [90]:
mybooks['genre'] = inv

In [93]:
mybooks[['Title','genre']].sample(50)


Unnamed: 0,Title,genre
202,Eva Luna,Children's Books
77,Pretty Girls,thriller
88,Black Mountain,thriller
298,The Unseen (The Barrøy Chronicles #1),thriller
60,"The Da Vinci Code (Robert Langdon, #2)",thriller
280,The Thursday Murder Club (Thursday Murder Club...,crime
42,Maybe in Another Life,Science Fiction & Fantasy
292,Why We Sleep: Unlocking the Power of Sleep and...,psychology and self-help
91,Wahala,romance
287,The Heart's Invisible Furies,thriller


In [94]:
mybooks.genre.value_counts()

genre
thriller                      116
romance                        47
Science Fiction & Fantasy      32
psychology and self-help       23
history                        19
Children's Books               18
travel                         17
Teen & YA                      13
Biographies & Memoirs          11
science                         9
Politics & Social Sciences      6
crime                           5
Business & Money                5
sports                          3
Health                          3
horror                          2
Crafts, Hobbies & Home          2
Cookbooks, Food & Wine          1
Name: count, dtype: int64

# Evaluating prediction of model 

It seems like a lot of books are not predicted corretly and a majority of my books are predicted as thriller. I do read a lot of thrillers but the share is too big too be true. It does seem like the initival dataset used for prediction have a substantial share of thrillers, which may cause it too predict too many books as thrillers. 

# Improving the model 
I will try and add another dataset for training, https://github.com/uchidalab/book-dataset/tree/master/Task2, which contains 270K books from amazon with title and category name. 

I will have to: 
- make sure common category names between two datasets are the same 
- remove some small and unecessary categories from the amazon dataframe. 
- collect description from google api in several rounds with max 49K books each time to now exceed limit of 50K per day. 

Model building: 
- use similar approach and test several models: https://www.kaggle.com/code/prathameshgadekar/book-genre-prediction-nlp/notebook
- select best predicting model 