# Portfolio 3: Predicting the Genre of Books from Summaries

We'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction we will select a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give us one genre label per book. 

Our goal in this portfolio is to take this data and build a predictive model to classify the books into one of the five target genres.  We will need to extract suitable features from the texts and select a suitable model to classify them. We will build at least one model but We could build two and compare the results.

## Data Preparation

The first task is to read the data. It is made available in tab-separated format but has no column headings. We can use `read_csv` to read this but we need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
from tqdm import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']
books = pd.read_csv("data/booksummaries/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


We next filter the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [2]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape

(8954, 5)

In [3]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()

Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


## Modelling

Now we take over to build a suitable model and present our results

In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn import preprocessing

In [5]:
def clean_summary(text):
    text = re.sub("\'", "", text)
    text = re.sub("[^a-zA-Z]"," ",text)
    text = ' '.join(text.split())
    text = text.lower()
    return text

In [6]:
genre_books['clean_summary'] = genre_books['summary'].apply(lambda x: clean_summary(x))
genre_books.head(1)

Unnamed: 0,title,author,date,summary,genre,clean_summary
0,Animal Farm,George Orwell,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",Children's literature,old major the old boar on the manor farm calls...


In [7]:
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)

genre_books['clean_summary'] = genre_books['clean_summary'].apply(lambda x: remove_stopwords(x))

In [8]:
le = preprocessing.LabelEncoder()
le.fit(genre_books['genre'])
y = le.transform(genre_books['genre'])

In [9]:
X_train, X_test, y_train, y_test = train_test_split(genre_books['summary'], y, test_size=0.2)

In [10]:
tf_idf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)
train = tf_idf_vectorizer.fit_transform(X_train)
test = tf_idf_vectorizer.transform(X_test)

In [11]:
lr = LogisticRegression()
clf = OneVsRestClassifier(lr)
clf.fit(train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [12]:
y_pred = clf.predict(test)
f1_score(y_test, y_pred, average="micro"), accuracy_score(y_test, y_pred)

(0.7213847012841987, 0.7213847012841987)

In [13]:
pred_probability = clf.predict_proba(test)

In [14]:
threshold_value = 0.3
predprobability = (pred_probability >= threshold_value).astype(int)

In [17]:
def prediction(eval):
    eval = clean_summary(eval)
    eval = remove_stopwords(eval)
    eval_vec = tf_idf_vectorizer.transform([eval])
    eval_pred = clf.predict(eval_vec)
    return le.inverse_transform(eval_pred)
for i in range(10):
    k = X_test.sample(1).index[0]
    print("Book: ", genre_books['title'][k], "\nPredicted genre: ", prediction(X_test[k])) ,
    print("Actual genre: ",genre_books['genre'][k], "\n")

Book:  Lirael 
Predicted genre:  ['Fantasy']
Actual genre:  Fantasy 

Book:  When the Moon Forgot 
Predicted genre:  ["Children's literature"]
Actual genre:  Children's literature 

Book:  The Prisoner of Zenda 
Predicted genre:  ['Fantasy']
Actual genre:  Novel 

Book:  Aelita 
Predicted genre:  ['Science Fiction']
Actual genre:  Science Fiction 

Book:  On the Road 
Predicted genre:  ['Novel']
Actual genre:  Novel 

Book:  Ellen Foster 
Predicted genre:  ['Novel']
Actual genre:  Novel 

Book:  No More Dead Dogs 
Predicted genre:  ['Novel']
Actual genre:  Children's literature 

Book:  The Good Master 
Predicted genre:  ["Children's literature"]
Actual genre:  Children's literature 

Book:  The Saint and the Fiction Makers 
Predicted genre:  ['Mystery']
Actual genre:  Mystery 

Book:  TIM Defender of the Earth 
Predicted genre:  ['Science Fiction']
Actual genre:  Science Fiction 

