In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


## Predicting the Genre of Books from Summaries

We'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction we will select a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give us one genre label per book. 

Your goal in this portfolio is to take this data and build predictive models to classify the books into one of the five target genres.  You will need to extract suitable features from the texts and select suitable models to classify them. You should build and evaluate at least TWO models and compare the prediction results.

You should report on each stage of your experiment as you work with the data.


## Data Preparation

The first task is to read the data. It is made available in tab-separated format but has no column headings. We can use `read_csv` to read this but we need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [2]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']

books = pd.read_csv("data/booksummaries/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


We next filter the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [3]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape


(8954, 5)

In [4]:
genre_books.head()

Unnamed: 0,title,author,date,summary,genre
0,Animal Farm,George Orwell,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",Children's literature
1,A Clockwork Orange,Anthony Burgess,1962,"Alex, a teenager living in near-future Englan...",Novel
2,The Plague,Albert Camus,1947,The text of The Plague is divided into five p...,Novel
4,A Fire Upon the Deep,Vernor Vinge,,The novel posits that space around the Milky ...,Fantasy
6,A Wizard of Earthsea,Ursula K. Le Guin,1968,"Ged is a young boy on Gont, one of the larger...",Fantasy


In [5]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()


Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


## Feature Exaction

Now you take over to build a suitable model and present your results.

Firstly, you need to perform feature extraction to produce feature vectors for the predictive models.

In [5]:
genre_books.head()

Unnamed: 0,title,author,date,summary,genre
0,Animal Farm,George Orwell,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",Children's literature
1,A Clockwork Orange,Anthony Burgess,1962,"Alex, a teenager living in near-future Englan...",Novel
2,The Plague,Albert Camus,1947,The text of The Plague is divided into five p...,Novel
4,A Fire Upon the Deep,Vernor Vinge,,The novel posits that space around the Milky ...,Fantasy
6,A Wizard of Earthsea,Ursula K. Le Guin,1968,"Ged is a young boy on Gont, one of the larger...",Fantasy


In [7]:
Y = genre_books.genre

In [24]:
genre_books.summary[0]

' Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, \'Beasts of England\'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announces his plans to build a windmill, Napoleon has his dogs chase Snowball away and declares himself leader. N

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=2000, stop_words = 'english')
X = vectorizer.fit_transform(genre_books.summary).toarray()


In [35]:
X.shape, Y.shape

((8954, 2000), (8954,))

In [36]:
# X[0], genre_books.summary[0]

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, test_size=0.2, random_state=142)
y_train, y_test = train_test_split(Y, test_size=0.2, random_state=142)

## Model Training

Then, train two predictive models from the given data set.

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
# train, test = train_test_split(bcancer, test_size=0.2, random_state=142)

In [39]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [42]:
predicted = lr.predict(X_test)

## Model Evaluation

Finally, evaluate and compare the learned predictive models.

In [43]:
print(accuracy_score(y_test, predicted)) 
print(confusion_matrix(y_test, predicted))

0.6677833612506979
[[116  41  16  56   8]
 [ 16 338  12  42  45]
 [  3  30 173  54  11]
 [ 27  24  30 306  43]
 [  3  54  17  63 263]]


In [19]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
# Converting string labels into numbers.

Y=le.fit_transform(genre_books.genre)
print(Y.shape)

(8954,)


In [20]:
y_train = Y[0:7163]
y_test = Y[7162:]

In [21]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
predicted = lr.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [22]:
print(accuracy_score(y_test, predicted)) 
print(confusion_matrix(y_test, predicted))

0.5876116071428571
[[106  64  13  53  10]
 [ 36 319  25  60  40]
 [ 13  31 105  39  11]
 [ 38  52  44 296  53]
 [  6  79  17  55 227]]


In [39]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [40]:
gnb.fit(X_train, y_train)
predicted = gnb.predict(X_test)

In [41]:
print(accuracy_score(y_test, predicted)) 
print(confusion_matrix(y_test, predicted))

0.3359375
[[ 70  71  32  41  32]
 [ 60 211  53  59  97]
 [ 23  60  57  34  25]
 [ 69  99  84 135  96]
 [ 22 145  63  25 129]]


In [42]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
predicted = mnb.predict(X_test)

In [43]:
print(accuracy_score(y_test, predicted)) 
print(confusion_matrix(y_test, predicted))

0.35714285714285715
[[  0 147   0  94   5]
 [  0 336   0 135   9]
 [  0  86   0 106   7]
 [  0 204   0 269  10]
 [  0 275   0  74  35]]


In [52]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=15, random_state=0)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

In [53]:
print(accuracy_score(y_test, predicted)) 
print(confusion_matrix(y_test, predicted))

0.3950892857142857
[[ 27  69  25  77  48]
 [ 19 214  32 116  99]
 [  8  46  36  78  31]
 [ 17  81  41 255  89]
 [  4 112  32  60 176]]


In [56]:
vectorizer = TfidfVectorizer(max_features=50)
X = vectorizer.fit_transform(genre_books.summary).toarray()

In [57]:
X.shape

(8954, 50)

In [58]:
print(vectorizer.get_feature_names())

['about', 'after', 'all', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'for', 'from', 'had', 'has', 'have', 'he', 'her', 'him', 'his', 'in', 'into', 'is', 'it', 'not', 'of', 'on', 'one', 'out', 'she', 'that', 'the', 'their', 'them', 'then', 'they', 'this', 'time', 'to', 'two', 'up', 'was', 'when', 'where', 'which', 'while', 'who', 'will', 'with']


In [69]:
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors=10)

In [70]:
KNN.fit(X_train, y_train)
predicted = KNN.predict(X_test)

In [71]:
print(accuracy_score(y_test, predicted)) 
print(confusion_matrix(y_test, predicted))

0.35825892857142855
[[ 41  88   9  78  30]
 [ 34 275  17  94  60]
 [ 14  77  26  56  26]
 [ 33 169  25 198  58]
 [  8 183  30  61 102]]


In [73]:
print(vectorizer.get_stop_words())

None


In [76]:
from nltk.corpus import stopwords 
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\44999038\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [81]:
stop_words = stopwords.words('english') 

In [83]:
stop_words[12]

"you'd"

In [91]:
from nltk.tokenize import word_tokenize

In [97]:
filtered_summary = []
for s in genre_books.summary:
    filtered = []
    for w in s.split():
        if w in stop_words:            
            continue
        else:
            filtered.append(w)
    filtered_summary.append(filtered)

In [104]:
len(filtered_summary)

8954

In [178]:
vectorizer = TfidfVectorizer(stop_words='english', token_pattern=r'(?u)\b[A-Za-z]+\b', max_features = 2000)
X = vectorizer.fit_transform(genre_books.summary).toarray()

In [179]:
le = preprocessing.LabelEncoder()
Y=le.fit_transform(genre_books.genre)
# print(genre_encoded.shape)

In [180]:
from sklearn.model_selection import train_test_split
train_X, test_X = train_test_split(X, test_size=0.2, random_state=142)
train_Y, test_Y = train_test_split(Y, test_size=0.2, random_state=142)

In [181]:
# train_X.shape, train_Y.shape

In [182]:
# test_X.shape, test_Y.shape

In [183]:
lr = LogisticRegression()
lr.fit(train_X, train_Y)
predicted = lr.predict(test_X)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [184]:
print(accuracy_score(test_Y, predicted)) 
print(confusion_matrix(test_Y, predicted))

0.6694584031267449
[[115  41  16  57   8]
 [ 14 344  11  40  44]
 [  3  29 174  54  11]
 [ 28  27  31 305  39]
 [  5  56  17  61 261]]


In [158]:
str1 = min(genre_books.summary, key=len)
len(str1.split())

1

In [190]:
test_Y[2], predicted[2]

(3, 3)

In [192]:
clf = RandomForestClassifier(max_depth=15, random_state=0)
clf.fit(train_X, train_Y)
predicted = clf.predict(test_X)

In [193]:
print(accuracy_score(test_Y, predicted)) 

0.5862646566164154
