In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


## Predicting the Genre of Books from Summaries

We'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction we will select a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give us one genre label per book. 

Your goal in this portfolio is to take this data and build a predictive model to classify the books into one of the five target genres.  You will need to extract suitable features from the texts and select a suitable model to classify them. You should build at least one model but you could build two and compare the results if you have time.

You should report on each stage of your experiment as you work with the data.


## Data Preparation

The first task is to read the data. It is made available in tab-separated format but has no column headings. We can use `read_csv` to read this but we need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [2]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']

books = pd.read_csv("data/booksummaries/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


We next filter the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [3]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape

(8954, 5)

In [4]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()

Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


## Modelling

We start the process by cleaning up the summary texts. This includes

- Conversion to lowercase
- Tokenization
- Removal of numeric tokens
- Removal of punctuation
- Removal of stopwords
- Lemmatization

In [5]:
import nltk
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download ('stopwords')
import re
from string import punctuation
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

wnl = WordNetLemmatizer()
stopwords = list(nltk.corpus.stopwords.words('english'))

In [6]:
tokenized_summaries = []

summaries = genre_books.summary

# step 1: convert whole text to lowercase
summaries = summaries.str.lower().to_list()

for summary in summaries:
    # step 2: tokenize all the words
    tokens = word_tokenize(summary)

    # step 3: remove all numeric tokens
    re_num = re.compile(r'[0-9]')
    tokens = [i for i in tokens if not re_num.match(i)]
    
    # step 4: remove all punctuation
    re_punc = re.compile('[%s]' % re.escape(punctuation))
    tokens = [i for i in tokens if not re_punc.match(i)]
    
    # step 5: remove all the stopwords
    tokens = [word for word in tokens if word not in stopwords]
    
    #step 6: lemmatize all the tokens
    tokens = [wnl.lemmatize(word) for word in tokens]

    tokenized_summaries.append(' '.join(tokens))

Next, we use [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from Scikit-Learn in order to obtain a matrix of TF-IDF features of the summaries.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=2, max_features=70000, strip_accents='unicode',
                            analyzer='word', token_pattern=r'\w+', use_idf=True, 
                            smooth_idf=True, sublinear_tf=True)
tfidf_matrix = vectorizer.fit_transform(tokenized_summaries)

We are going to divide the dataset into training and test sets. We will take 20% of the total dataset as the testing dataset.

In [8]:
from sklearn import metrics
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(tfidf_matrix, genre_books['genre'], test_size=0.2)

print('Training: x:', x_train.shape, 'y:', y_train.shape)
print('Testing: x:', x_test.shape, 'y:', y_test.shape)

Training: x: (7163, 38553) y: (7163,)
Testing: x: (1791, 38553) y: (1791,)


Our next step is to build a model. We are going to take a look at two methods: logistic regression and multinomial naive bayes.

## Logistic Regression

We are going to use `linear_model` from Scikit-Learn to build the model.

In [9]:
from sklearn import linear_model
reg_classifier = linear_model.LogisticRegression(solver= 'sag', max_iter=200, random_state=450)
reg_classifier.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=450, solver='sag', tol=0.0001, verbose=0,
                   warm_start=False)

In [10]:
predictions = reg_classifier.predict(x_test)
print('F1 score:', metrics.f1_score(y_test, predictions, average='macro'))
print('Accuracy:', metrics.accuracy_score(y_test, predictions))

F1 score: 0.6929177780161613
Accuracy: 0.711892797319933


Now, let us see how this model performs on the first 5 books from the original dataframe.

In [11]:
idx = 0
correct = 0
total = 0

while total < 5:
    try:
        prediction = reg_classifier.predict(vectorizer.transform(tokenized_summaries)[idx])[0]
        actual = genre_books['genre'][idx]
        if (prediction == actual):
            correct += 1
        print(genre_books['title'][idx], '\nPredicted:', prediction, '\nActual:', actual, '\n')
        total += 1
    except:
        continue
    finally:
        idx += 1

print('Got', correct, 'out of', total, 'correct.')

Animal Farm 
Predicted: Children's literature 
Actual: Children's literature 

A Clockwork Orange 
Predicted: Novel 
Actual: Novel 

The Plague 
Predicted: Novel 
Actual: Novel 

A Fire Upon the Deep 
Predicted: Fantasy 
Actual: Fantasy 

A Wizard of Earthsea 
Predicted: Science Fiction 
Actual: Fantasy 

Got 4 out of 5 correct.


## Multinomial Naive Bayes

For this method, we will use `naive_bayes.MultinomialNB` from Scikit-Learn.

In [12]:
from sklearn.naive_bayes import MultinomialNB
mnb_classifier = MultinomialNB(alpha=.45)
mnb_classifier.fit(x_train, y_train)

MultinomialNB(alpha=0.45, class_prior=None, fit_prior=True)

In [13]:
predictions = mnb_classifier.predict(x_test)
print('F1 score:', metrics.f1_score(y_test, predictions, average='macro'))
print('Accuracy:', metrics.accuracy_score(y_test, predictions))

F1 score: 0.5636639050191085
Accuracy: 0.644891122278057


We can see that both F1 score and accuracy are lower for naive bayes classifier than those for logistic regression. This can also be observed when we test the model on the first 5 books from the original dataframe.

In [14]:
idx = 0
correct = 0
total = 0

while total < 5:
    try:
        prediction = mnb_classifier.predict(vectorizer.transform(tokenized_summaries)[idx])[0]
        actual = genre_books['genre'][idx]
        if (prediction == actual):
            correct += 1
        print(genre_books['title'][idx], '\nPredicted:', prediction, '\nActual:', actual, '\n')
        total += 1
    except:
        continue
    finally:
        idx += 1

print('Got', correct, 'out of', total, 'correct.')

Animal Farm 
Predicted: Novel 
Actual: Children's literature 

A Clockwork Orange 
Predicted: Novel 
Actual: Novel 

The Plague 
Predicted: Novel 
Actual: Novel 

A Fire Upon the Deep 
Predicted: Fantasy 
Actual: Fantasy 

A Wizard of Earthsea 
Predicted: Science Fiction 
Actual: Fantasy 

Got 3 out of 5 correct.


## Saving the model

Finally, we will save both the models in the disk.

In [15]:
from joblib import dump
dump(reg_classifier, 'books-genre-linear-regression.gz')
dump(mnb_classifier, 'books-genre-naive-bayes.gz')

['books-genre-naive-bayes.gz']

The model can later be loaded by `joblib.load` function.