## Predicting book genre from book description 

highly used resource: https://www.kaggle.com/code/prathameshgadekar/book-genre-prediction-nlp/notebook

Selecting MUltinomialNB since their analysis evaluated that to be the best model for that dataset, and since I am using the same dataframe, as I have not been able to find other dataframes with both book description and genre, I find it appropriate.


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno #For missing value visualization

import plotly.offline as py
py.init_notebook_mode(connected=True)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


In [2]:
#For NLP
import re
import nltk
import string
# from wordcloud import WordCloud

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
#For Modelling Purpose
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.dummy import DummyClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.multiclass import OneVsRestClassifier

In [4]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/elisealstad/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# Data cleaning

In [5]:
data = pd.read_csv('../assets/data.csv')
amazondf = pd.read_pickle('../assets/amazon_books_description.pkl')


In [6]:
data.drop('index',inplace = True,axis = 1)
data = data.rename(columns={'title':'Title', 'summary':'Description' })
new_labels_data =  {'fantasy':"Science Fiction & Fantasy" ,
                     'psychology': 'psychology and self-help'}
data['genre'] = data['genre'].replace(new_labels_data)


In [7]:
amazondf.head(1)

Unnamed: 0,Title,Author(s),Publish_Date,Description,ISBN,Page_Count,Categories,Average_Rating,Rating_Count,Language,genre,Title_org
0,That's That,Colin Broderick,2013-05-07,A brutally honest and deeply affecting memoir ...,9780307716347,368.0,Biography & Autobiography,,,en,,


In [8]:
df = pd.concat([data, amazondf])
df = (
    df.dropna(subset=['Description', 'genre'])
    .filter(items=['Title', 'Description','genre'])
)

In [9]:
#cleaning unecessary text from the string 
Stopwords = set(stopwords.words('english'))

def clean(text):
    text = text.lower() #Converting to lowerCase
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ',text) #removing punctuation
    
    text_tokens = word_tokenize(text) #removing stopwords
    tw = [word for word in text_tokens if not word in Stopwords]
    text = (" ").join(tw)
    
    splt = text.split(' ')
    output = [x for x in splt if len(x) > 3] #removing words with length<=3
    text = (" ").join(output)
    
    text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text) #removing single character 
    text = re.sub('<.*?>+',' ',text) #removing HTML Tags
    text = re.sub('\n', ' ',text) #removal of new line characters
    text = re.sub(r'\s+', ' ',text) #removal of multiple spaces
    return text

In [10]:
def most_common_words(column, n=10):
    # Tokenize and lowercase the words
    words = word_tokenize(" ".join(column.str.lower()))

    # Remove stopwords (common words like 'the', 'and', etc.)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.isalnum() and word not in stop_words]

    # Calculate word frequencies
    word_freq = Counter(words)

    # Get the most common words
    common_words = word_freq.most_common(n)

    return common_words

# Get most common words in the 'description' column
most_common_words_description = most_common_words(df['description'], n=10)

# Display the result
print("Most common words in 'description' column:")
for word, count in most_common_words_description:
    print(f"{word}: {count}")

Unnamed: 0,Title,Description,genre
0,Drowned Wednesday,drowned wednesday first trustee among morrow d...,Science Fiction & Fantasy
1,The Lost Hero,book opens jason awakens school unable remembe...,Science Fiction & Fantasy
2,The Eyes of the Overworld,cugel easily persuaded merchant fianosther att...,Science Fiction & Fantasy
3,Magic's Promise,book opens herald mage vanyel returning countr...,Science Fiction & Fantasy
4,Taran Wanderer,taran gurgi returned caer dallben following ev...,Science Fiction & Fantasy
...,...,...,...
837,Pretty Girls,child says stunning… certain book year kathy r...,thriller
838,Go! All in One,ebook printed book include media website acces...,Business & Money
839,Who's There On Halloween?,halloween encourages readers play along clues ...,Children's Books
840,Foreclosure Investing For Dummies,make foreclosure investing work practical easy...,Business & Money


# Data exploration

In [11]:
df.genre.value_counts()

genre
thriller                      1212
Science Fiction & Fantasy     1010
history                        736
science                        647
horror                         600
crime                          500
romance                        276
psychology and self-help       258
sports                         239
travel                         212
Biographies & Memoirs          173
Business & Money               173
Children's Books               169
Politics & Social Sciences     160
Cookbooks, Food & Wine         151
Crafts, Hobbies & Home         146
Health                         143
Teen & YA                      141
Name: count, dtype: int64

In [15]:
for genre in df.genre.unique().tolist():
    print(genre)
    display(df.query('genre == @genre').genre.value_counts())

Science Fiction & Fantasy


genre
Science Fiction & Fantasy    1010
Name: count, dtype: int64

science


genre
science    647
Name: count, dtype: int64

crime


genre
crime    500
Name: count, dtype: int64

history


genre
history    736
Name: count, dtype: int64

horror


genre
horror    600
Name: count, dtype: int64

thriller


genre
thriller    1212
Name: count, dtype: int64

psychology and self-help


genre
psychology and self-help    258
Name: count, dtype: int64

romance


genre
romance    276
Name: count, dtype: int64

sports


genre
sports    239
Name: count, dtype: int64

travel


genre
travel    212
Name: count, dtype: int64

Biographies & Memoirs


genre
Biographies & Memoirs    173
Name: count, dtype: int64

Cookbooks, Food & Wine


genre
Cookbooks, Food & Wine    151
Name: count, dtype: int64

Children's Books


genre
Children's Books    169
Name: count, dtype: int64

Teen & YA


genre
Teen & YA    141
Name: count, dtype: int64

Health


genre
Health    143
Name: count, dtype: int64

Crafts, Hobbies & Home


genre
Crafts, Hobbies & Home    146
Name: count, dtype: int64

Business & Money


genre
Business & Money    173
Name: count, dtype: int64

Politics & Social Sciences


genre
Politics & Social Sciences    160
Name: count, dtype: int64

# Data pre-processing 

In [None]:
# data preprocessing 

def data_preprocessing(text):
    tokens = word_tokenize(text) #Tokenization
    tokens = [WordNetLemmatizer().lemmatize(word) for word in tokens] #Lemmetization
    tokens = [SnowballStemmer(language = 'english').stem(word) for word in tokens] #Stemming
    return " ".join(tokens)

In [None]:
df['Description'] = df['Description'].apply(data_preprocessing)
df

In [None]:
#Converting all the categorical features of 'genre' to numerical

labelencoder = LabelEncoder()
df['genre_vec'] = labelencoder.fit_transform(df['genre'])
df['genre_vec']

In [None]:
labelencoder.inverse_transform(df['genre_vec'])

In [None]:
cv = CountVectorizer()
X = cv.fit_transform(df['Description'])
y = df['genre_vec']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Test different ML models 

In [None]:
models = [BernoulliNB(),MultinomialNB(),SGDClassifier(),LogisticRegression(),RandomForestClassifier(),GradientBoostingClassifier(),
         AdaBoostClassifier(),SVC(),DummyClassifier(),ExtraTreeClassifier(),KNeighborsClassifier()]

In [None]:
Name = []
Accuracy = []
Precision = []
F1_Score = []
Recall = []
Time_Taken = []
for model in models:
    name = type(model).__name__
    Name.append(name)
    begin = time.time()
    model.fit(X_train,y_train)
    prediction = model.predict(X_test)
    end = time.time()
    Accuracy.append(accuracy_score(prediction,y_test))
    Precision.append(precision_score(prediction,y_test,average = 'macro'))
    Recall.append(recall_score(prediction,y_test,average = 'macro'))
    F1_Score.append(f1_score(prediction,y_test,average = 'macro'))
    Time_Taken.append(end-begin)
    print(name + ' Successfully Trained')

In [None]:
Dict = {'Name':Name,'Accuracy':Accuracy,'Precision_score':Precision,'Recall_score':Precision,
        'F1_score':F1_Score,'Time Taken':Time_Taken}
model_df = pd.DataFrame(Dict)
model_df

# run and store best model 

In [None]:
# Select best fitting model 
model = MultinomialNB()
model.fit(X,y)
prediction = model.predict(X_test)

In [None]:
import pickle

filename = "model.pickle"

# save model
pickle.dump(model, open(filename, "wb"))

# load model
loaded_model = pickle.load(open(filename, "rb"))

# you can use loaded model to compute predictions
y_predicted = loaded_model.predict(X)

In [None]:
mybooks = pd.read_pickle('../assets/my_books.pkl')
mybooks = mybooks.query('Description.notna()')

In [None]:
mybooks['Description_cleaned'] = mybooks['Description'].apply(clean)

# data preprocessing 
mybooks['Description'] = mybooks['Description'].apply(data_preprocessing)
mybooks['Description']

In [None]:
#Converting all the categorical features of 'genre' to numerical
nyX = cv.transform(mybooks['Description'])

In [None]:
# you can use loaded model to compute predictions
genre = loaded_model.predict(nyX)

In [None]:
inv = labelencoder.inverse_transform(genre)
print(inv)


In [None]:
mybooks['genre'] = inv

In [None]:
mybooks[['Title','genre']].head(50)


In [None]:
mybooks.genre.value_counts()

# Evaluating prediction of model 

It seems like a lot of books are not predicted corretly and a majority of my books are predicted as thriller. I do read a lot of thrillers but the share is too big too be true. It does seem like the initival dataset used for prediction have a substantial share of thrillers, which may cause it too predict too many books as thrillers. 

# Improving the model 
I will try and add another dataset for training, https://github.com/uchidalab/book-dataset/tree/master/Task2, which contains 270K books from amazon with title and category name. 

I will have to: 
- make sure common category names between two datasets are the same 
- remove some small and unecessary categories from the amazon dataframe. 
- collect description from google api in several rounds with max 49K books each time to now exceed limit of 50K per day. 

Model building: 
- use similar approach and test several models: https://www.kaggle.com/code/prathameshgadekar/book-genre-prediction-nlp/notebook
- select best predicting model 