## Predicting book genre from book description 

highly used resource: https://www.kaggle.com/code/prathameshgadekar/book-genre-prediction-nlp/notebook

Selecting MUltinomialNB since their analysis evaluated that to be the best model for that dataset, and since I am using the same dataframe, as I have not been able to find other dataframes with both book description and genre, I find it appropriate.


In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno #For missing value visualization

import plotly.offline as py
py.init_notebook_mode(connected=True)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


In [4]:
#For NLP
import re
import nltk
import string
# from wordcloud import WordCloud

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
#For Modelling Purpose
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.dummy import DummyClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.multiclass import OneVsRestClassifier

In [6]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/elisealstad/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [7]:
data = pd.read_csv('../assets/data.csv')

data.head()

Unnamed: 0,index,title,genre,summary
0,0,Drowned Wednesday,fantasy,Drowned Wednesday is the first Trustee among ...
1,1,The Lost Hero,fantasy,"As the book opens, Jason awakens on a school ..."
2,2,The Eyes of the Overworld,fantasy,Cugel is easily persuaded by the merchant Fia...
3,3,Magic's Promise,fantasy,The book opens with Herald-Mage Vanyel return...
4,4,Taran Wanderer,fantasy,Taran and Gurgi have returned to Caer Dallben...


In [8]:
data.drop('index',inplace = True,axis = 1)
data.head()

Unnamed: 0,title,genre,summary
0,Drowned Wednesday,fantasy,Drowned Wednesday is the first Trustee among ...
1,The Lost Hero,fantasy,"As the book opens, Jason awakens on a school ..."
2,The Eyes of the Overworld,fantasy,Cugel is easily persuaded by the merchant Fia...
3,Magic's Promise,fantasy,The book opens with Herald-Mage Vanyel return...
4,Taran Wanderer,fantasy,Taran and Gurgi have returned to Caer Dallben...


In [9]:
#cleaning unecessary text from the string 
Stopwords = set(stopwords.words('english'))

def clean(text):
    text = text.lower() #Converting to lowerCase
    # text = re.sub('[%s]' % re.escape(string.punctuation), ' ',text) #removing punctuation
    
    text_tokens = word_tokenize(text) #removing stopwords
    tw = [word for word in text_tokens if not word in Stopwords]
    text = (" ").join(tw)
    
    splt = text.split(' ')
    output = [x for x in splt if len(x) > 3] #removing words with length<=3
    text = (" ").join(output)
    
    text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text) #removing single character 
    text = re.sub('<.*?>+',' ',text) #removing HTML Tags
    text = re.sub('\n', ' ',text) #removal of new line characters
    text = re.sub(r'\s+', ' ',text) #removal of multiple spaces
    return text

In [10]:
data['summary'] = data['summary'].map(clean)
data['title'] = data['title'].map(clean)
data

Unnamed: 0,title,genre,summary
0,drowned wednesday,fantasy,drowned wednesday first trustee among morrow d...
1,lost hero,fantasy,book opens jason awakens school unable remembe...
2,eyes overworld,fantasy,cugel easily persuaded merchant fianosther att...
3,magic promise,fantasy,book opens herald-mage vanyel returning countr...
4,taran wanderer,fantasy,taran gurgi returned caer dallben following ev...
...,...,...,...
4652,hounded,fantasy,atticus sullivan last druids lives peacefully ...
4653,charlie chocolate factory,fantasy,charlie bucket wonderful adventure begins find...
4654,rising,fantasy,live dream children born free says like land f...
4655,frostbite,fantasy,rose loves dimitri dimitri might love tasha ma...


In [11]:
# data preprocessing 

def data_preprocessing(text):
    tokens = word_tokenize(text) #Tokenization
    tokens = [WordNetLemmatizer().lemmatize(word) for word in tokens] #Lemmetization
    tokens = [SnowballStemmer(language = 'english').stem(word) for word in tokens] #Stemming
    return " ".join(tokens)

In [12]:
data['summary'] = data['summary'].apply(data_preprocessing)
data['title'] = data['title'].apply(data_preprocessing)
data

Unnamed: 0,title,genre,summary
0,drown wednesday,fantasy,drown wednesday first truste among morrow day ...
1,lost hero,fantasy,book open jason awaken school unabl rememb any...
2,eye overworld,fantasy,cugel easili persuad merchant fianosth attempt...
3,magic promis,fantasy,book open herald-mag vanyel return countri val...
4,taran wander,fantasy,taran gurgi return caer dallben follow event t...
...,...,...,...
4652,hound,fantasy,atticus sullivan last druid life peac arizona ...
4653,charli chocol factori,fantasy,charli bucket wonder adventur begin find willi...
4654,rise,fantasy,live dream child born free say like land fathe...
4655,frostbit,fantasy,rose love dimitri dimitri might love tasha mas...


In [13]:
#Converting all the categorical features of 'genre' to numerical

labelencoder = LabelEncoder()
data['genre_vec'] = labelencoder.fit_transform(data['genre'])
data['genre_vec']

0       1
1       1
2       1
3       1
4       1
       ..
4652    1
4653    1
4654    1
4655    1
4656    1
Name: genre_vec, Length: 4657, dtype: int64

In [14]:
labelencoder.inverse_transform(data['genre_vec'])

array(['fantasy', 'fantasy', 'fantasy', ..., 'fantasy', 'fantasy',
       'fantasy'], dtype=object)

In [15]:
cv = CountVectorizer()
X = cv.fit_transform(data['summary'])
y = data['genre_vec']

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [17]:
model = MultinomialNB()
model.fit(X,y)
prediction = model.predict(X_test)

In [18]:
import pickle

filename = "model.pickle"

# save model
pickle.dump(model, open(filename, "wb"))

# load model
loaded_model = pickle.load(open(filename, "rb"))

# you can use loaded model to compute predictions
y_predicted = loaded_model.predict(X)

In [19]:
mybooks = pd.read_pickle('../assets/my_books.pkl')
mybooks = mybooks.query('Description.notna()')

In [20]:
mybooks['Description_cleaned'] = mybooks['Description'].apply(clean)

# data preprocessing 
mybooks['Description'] = mybooks['Description'].apply(data_preprocessing)
mybooks['Description']

0      japanes fairi tale - enchant , enigmat stori o...
1      ' a sensual feast of a novel , written with el...
2      a new york time , usa today , and washington p...
3      * the sunday time number one bestsel * * over ...
4      the addict no.1 bestsel that everyon is talk a...
                             ...                        
359    the key to rebecca is a grip thriller set dure...
360    winner of the pulitz prize , a new york time b...
361    one of the most influenti book of the twentiet...
362    in this deepli stir novel , acclaim author cri...
363    mr jone of manor farm is so lazi and drunken t...
Name: Description, Length: 332, dtype: object

In [21]:
#Converting all the categorical features of 'genre' to numerical
nyX = cv.transform(mybooks['Description'])

In [22]:
# you can use loaded model to compute predictions
genre = loaded_model.predict(nyX)

In [23]:
inv = labelencoder.inverse_transform(genre)
print(inv)


['fantasy' 'thriller' 'thriller' 'thriller' 'thriller' 'science'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'science' 'travel' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'fantasy' 'thriller' 'thriller'
 'fantasy' 'fantasy' 'fantasy' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'travel' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller' 'thriller' 'thriller'
 'sports' 'thriller' 'thriller' 'thriller' 'science' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'science' 'thriller' 'thriller'
 'thriller' 'thriller' 'thriller' 'thriller'

In [24]:
mybooks['genre'] = inv

In [25]:
mybooks[['Title','genre']].head(50)


Unnamed: 0,Title,genre
0,Night Train to the Stars,fantasy
1,The Language of Food,thriller
2,The House in the Cerulean Sea,thriller
3,Invisible Women: Data Bias in a World Designed...,thriller
4,Gone Girl,thriller
5,Ute av verden,science
6,Local Woman Missing,thriller
7,The Silent Patient,thriller
8,"The Devotion of Suspect X (Detective Galileo, #1)",thriller
9,The Secret History,thriller


In [26]:
data.genre.value_counts()

genre
thriller      1023
fantasy        876
science        647
history        600
horror         600
crime          500
romance        111
psychology     100
sports         100
travel         100
Name: count, dtype: int64

In [27]:
mybooks.genre.value_counts()

genre
thriller      272
fantasy        21
science        19
sports          5
romance         5
psychology      5
travel          3
history         2
Name: count, dtype: int64

# Evaluating prediction of model 

It seems like a lot of books are not predicted corretly and a majority of my books are predicted as thriller. I do read a lot of thrillers but the share is too big too be true. It does seem like the initival dataset used for prediction have a substantial share of thrillers, which may cause it too predict too many books as thrillers. 

# Improving the model 
I will try and add another dataset for training, https://github.com/uchidalab/book-dataset/tree/master/Task2, which contains 270K books from amazon with title and category name. 

I will have to: 
- make sure common category names between two datasets are the same 
- remove some small and unecessary categories from the amazon dataframe. 
- collect description from google api in several rounds with max 49K books each time to now exceed limit of 50K per day. 

Model building: 
- use similar approach and test several models: https://www.kaggle.com/code/prathameshgadekar/book-genre-prediction-nlp/notebook
- select best predicting model 

In [52]:
amazondf = pd.read_pickle('../assets/amazon_books.pkl')
amazondf.head()

Unnamed: 0,title,author,category
8568,The Martian,Andy Weir,thriller
8569,Rogue Lawyer,John Grisham,thriller
8570,The Survivor (A Mitch Rapp Novel),Vince Flynn,thriller
8571,Career of Evil (Cormoran Strike),Robert Galbraith,thriller
8572,The Girl on the Train,Paula Hawkins,thriller


In [43]:
new_labels_data =  {'fantasy':"Science Fiction & Fantasy" ,
                     'psychology': 'psychology and self-help'}
data['genre'] = data['genre'].replace(new_labels_data)

In [55]:
# change name on column category to genre
# concat df and 
# make all categories small\

In [58]:
# collect description for first 49K books without description from api call