<h1>Kaggle Competition: Spooky Author Identification</h1>
(https://www.kaggle.com/c/spooky-author-identification)

This notebook ultimately assigns snippets of books to one of three authors.

It does so thanks to having learnt these writers' style and vocabulary via Deep Learning techniques.

In this notebook, we explore three different techniques:

1) Recurrent Neural Network with LSTM and a single embedding layer.

2) The same as 1) but with an additional 1D convolutional layer 

3) The same as 2) but with pre-trained glove 300 dimension word embeddings.

Put together only thanks to:


*   https://github.com/msahamed/yelp_comments_classification_nlp
*   http://nbviewer.jupyter.org/github/SDS-AAU/M3-2018/blob/master/notebooks/Hatespeech_LSTM_SDS.ipynb?fbclid=IwAR3yEslQ96DPfy4sBm3ABxYtP4X8xoh-RKBuzhE5ZfKb757Mp9XjD36oIyQ

In [1]:
# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation, Embedding
#from keras.layers.embeddings import Embedding

# Plotly
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import matplotlib as plt

# NLTK
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Others
import re
import requests
import zipfile
import io

import nltk
import string
import numpy as np
import pandas as pd
from nltk.corpus import stopwords

from sklearn.manifold import TSNE

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/janpetr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Firstly, we need to download necessary files. It's a bit more complicated than with a simple Anaconda Python notebook, but still okay.

1) Training dataset

2) Testing dataset

3) Glove pre-trained words

# 1) Training dataset
!wget https://raw.githubusercontent.com/SDS-AAU/M3-2018/master/assignments/individual/data/train.csv

# 2) Testing dataset
 !wget https://raw.githubusercontent.com/SDS-AAU/M3-2018/master/assignments/individual/data/test.csv

In [2]:
#url = 'https://raw.githubusercontent.com/SDS-AAU/M3-2018/master/assignments/individual/data/train.csv'
#r = requests.get(url)

with open('train.csv', 'wb') as train:
    train.write(r.content)

print("File downloaded successfully")

File downloaded successfully


In [3]:
#url = 'https://raw.githubusercontent.com/SDS-AAU/M3-2018/master/assignments/individual/data/test.csv'
#r = requests.get(url)

with open('test.csv', 'wb') as test:
    test.write(r.content)

print("File downloaded successfully")

File downloaded successfully


# 3) Glove pre-trainned packages of words
 !wget http://nlp.stanford.edu/data/glove.6B.zip

Let's see how the datasets look like.

In [7]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [8]:
df["author"] = df["author"].astype('category')
df["author_label"] = df["author"].cat.codes
df.head()

Unnamed: 0,id,text,author,author_label
0,id26305,"This process, however, afforded me no means of...",EAP,0
1,id17569,It never once occurred to me that the fumbling...,HPL,1
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,0
3,id27763,How lovely is spring As we looked from Windsor...,MWS,2
4,id12958,"Finding nothing else, not even gold, the Super...",HPL,1


*   EAP 0
*   HPL 1
*  MWS 2

In [9]:
df_test = pd.read_csv('test.csv')
df_test.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


Let's do a bit of language preprocessing. Looking at the results with and without this step, though, language preprocessing actually decreases the accuracy of the models.

In [10]:
def clean_text(text):
    
    ## Remove puncuation
    text = text.translate(string.punctuation)
    
    ## Convert words to lower case and split them
    text = text.lower().split()
    
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops and len(w) >= 3]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    text = text.split()
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in text]
    text = " ".join(stemmed_words)

    return text

In [11]:
df['text'] = df['text'].map(lambda x: clean_text(x))
df_test['text'] = df_test['text'].map(lambda x: clean_text(x))

In [12]:
df['text'].head()

0    process howev afford mean ascertain dimens dun...
1                  never occur fumbl might mere mistak
2    left hand gold snuff box which caper hill cut ...
3    love spring look windsor terrac sixteen fertil...
4    find noth els even gold superintend abandon at...
Name: text, dtype: object

In [13]:
df_test['text'].head()

0    still urg leav ireland inquietud impati father...
1    fire want fan could readili fan newspap govern...
2    broken frail door found this : two clean pick ...
3    think possibl manag without them one actual tu...
4                       sure limit knowledg may extend
Name: text, dtype: object

Creating the type of dataset the Deep Learning models work with.

In [14]:
### Create sequence
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])
data = pad_sequences(sequences, maxlen=50)

In [15]:
data

array([[    0,     0,     0, ...,  3276,    15,   156],
       [    0,     0,     0, ...,    17,   168,  1850],
       [    0,     0,     0, ...,   219,   521,  2573],
       ...,
       [    0,     0,     0, ...,    25,   469, 10248],
       [    0,     0,     0, ...,  1782,  6748,   341],
       [    0,     0,     0, ...,  1562,   511,  4130]], dtype=int32)

In [16]:
### Create sequence
vocabulary_size =  20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(df_test['text'])
sequences = tokenizer.texts_to_sequences(df_test['text'])
data_test = pad_sequences(sequences, maxlen=50)

In [17]:
data_test

array([[    0,     0,     0, ...,    17,   513,  1020],
       [    0,     0,     0, ...,  1509,  1595,  3862],
       [    0,     0,     0, ...,  1685,  2348,   854],
       ...,
       [    0,     0,     0, ...,   598,   202,   302],
       [    0,     0,     0, ...,  1874,   380, 11823],
       [    0,     0,     0, ...,   317,   106,  1350]], dtype=int32)

In [18]:
from keras.utils import to_categorical
import keras
from numpy import argmax
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

We would like to see how the models are doing before blindly classifying the books snippets, right?

Let's do some 80/20 K-Fold on the training set beforehand then.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(data, df["author_label"], test_size=0.2, random_state=0)

labels_train = keras.utils.to_categorical(y_train, num_classes=3)

# Long short-term memory (RNN model)

Building the model and training the 80/20 data on it.

In [20]:
## Network architecture
model_lstm = Sequential()
model_lstm.add(Embedding(20000, 100, input_length=50))
model_lstm.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(3, activation='sigmoid'))
model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## Fit the model
model_lstm.fit(X_train, labels_train, validation_split=0.4, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x2880fac90>

In [21]:
model_lstm.evaluate(X_train,labels_train)



[0.2933073043823242, 0.900976836681366]

Let's see how accurate the model is when predicting the authors within the training dataset.

In [22]:
y_pred = model_lstm.predict(X_test)

cr_pred = np.argmax(y_pred, axis=1)

print(classification_report(y_test, cr_pred))

pd.crosstab(y_test,cr_pred)

              precision    recall  f1-score   support

           0       0.80      0.82      0.81      1600
           1       0.81      0.77      0.79      1102
           2       0.78      0.80      0.79      1214

    accuracy                           0.80      3916
   macro avg       0.80      0.80      0.80      3916
weighted avg       0.80      0.80      0.80      3916



col_0,0,1,2
author_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1308,126,166
1,151,846,105
2,173,67,974


80% is not a bad result IMO. You can the guessed values in the crosstab. 

The y-axis represents actual values while the x-axis shows the guessed values.

Edgar Allan Poe was correctly guessed 75% (1199/1600) of the time.
H.P. Lovecraft sees a better accuracy with 78% (864/1102) while Mary Wollstonecraft Shelley was classified correctly most of the time - 80% (978/1214).

In [23]:
labels = keras.utils.to_categorical(df['author_label'], num_classes=3)

model_lstm.fit(data, labels, validation_split=0.4, epochs=3)

y_pred = model_lstm.predict(data_test)

cr_pred = np.argmax(y_pred, axis=1) # gives a list of predicted values (picks the one with the highest probability)

Epoch 1/3
Epoch 2/3
Epoch 3/3


Saving the file in the desired format for submission / uploading to the Kaggle competition.


In [24]:
y_pred_df = pd.DataFrame(y_pred)
y_pred_df['id'] = df_test['id']
y_pred_df.rename(columns={0:'EAP',
                          1:'HPL',
                          2:'MWS'}, 
                 inplace=True)
print(y_pred_df.head())
y_pred_df.to_csv("lstm_result.csv", encoding='utf-8', index=False)

        EAP       HPL       MWS       id
0  0.665064  0.165110  0.716942  id02310
1  0.730497  0.465806  0.337229  id24541
2  0.914970  0.273127  0.177266  id00134
3  0.871512  0.071910  0.795108  id27757
4  0.444277  0.008854  0.991362  id04081


# Long short-term memory + Convolutional Layer

In this architecture, we take the previous model and add one Convolutional Layer which is usually used for image processing. This way, the model works slightly faster, although, seemingly at the expense of accuracy.

In [25]:
def create_conv_model():
    model_conv = Sequential()
    model_conv.add(Embedding(vocabulary_size, 100, input_length=50))
    model_conv.add(Dropout(0.2))
    model_conv.add(Conv1D(64, 5, activation='relu'))
    model_conv.add(MaxPooling1D(pool_size=4))
    model_conv.add(LSTM(100))
    model_conv.add(Dense(3, activation='sigmoid'))
    model_conv.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        
    return model_conv

In [26]:
model_conv = create_conv_model()
model_conv.fit(X_train, labels_train, validation_split=0.4, epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x1781f7510>

In [27]:
y_pred = model_conv.predict(X_test)

cr_pred = np.argmax(y_pred, axis=1)

print(classification_report(y_test, cr_pred))

pd.crosstab(y_test,cr_pred)

             precision    recall  f1-score   support

          0       0.75      0.79      0.77      1600
          1       0.75      0.75      0.75      1102
          2       0.76      0.70      0.73      1214

avg / total       0.75      0.75      0.75      3916



col_0,0,1,2
author_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1260,161,179
1,187,832,83
2,241,121,852


Given the fact that we actually expanded the initial architecture, the decrease in accuracy is not so satisfying.

Edgar Allan Poe was correctly guessed 81% (1290/1600) of the time.
H.P. Lovecraft receives a worse accuracy with 74% (813/1102) while Mary Wollstonecraft Shelley was classified correctly 72% (876/1214).

In [27]:
labels = keras.utils.to_categorical(df['author_label'], num_classes=3)

model_conv.fit(data, labels, validation_split=0.4, epochs=3)

y_pred = model_conv.predict(data_test)

cr_pred = np.argmax(y_pred, axis=1) # gives a list of predicted values (picks the one with the highest probability)

Epoch 1/3
Epoch 2/3
Epoch 3/3


Saving the file in the desired format for submission / uploading to the Kaggle competition.

In [28]:
y_pred_df = pd.DataFrame(y_pred)
y_pred_df['id'] = df_test['id']
y_pred_df.rename(columns={0:'EAP',
                          1:'HPL',
                          2:'MWS'}, 
                 inplace=True)
print(y_pred_df.head())
y_pred_df.to_csv("conv_result.csv", encoding='utf-8', index=False)

        EAP       HPL       MWS       id
0  0.042708  0.955861  0.008144  id02310
1  0.687001  0.206237  0.021206  id24541
2  0.111675  0.860745  0.005154  id00134
3  0.468072  0.001704  0.583217  id27757
4  0.147549  0.010035  0.887747  id04081


# LSTM + CNN + Glove

This model builds upon the previous architecture by utilizing pre-trained Glove word embeddings. The yelp project works with the vector of 100 dimensions, I have decided to use the one with 300 dimensions, because why not. The switch from 100D to 300D improved the accuracy slightly.

In [29]:
embeddings_index = dict()
f = open('glove.6B.300d.txt', encoding='utf-8')

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
#fn2.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [30]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocabulary_size, 300))
for word, index in tokenizer.word_index.items():
    if index > vocabulary_size - 1:
        break
    else:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector

In [31]:
model_glove = Sequential()
model_glove.add(Embedding(vocabulary_size, 300, input_length=50, weights=[embedding_matrix], trainable=False))
model_glove.add(Dropout(0.2))
model_glove.add(Conv1D(64, 5, activation='relu'))
model_glove.add(MaxPooling1D(pool_size=4))
model_glove.add(LSTM(300))
model_glove.add(Dense(3, activation='sigmoid'))
model_glove.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [32]:
model_glove.fit(X_train, labels_train, validation_split=0.4, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x179231710>

In [33]:
y_pred = model_glove.predict(X_test)

cr_pred = np.argmax(y_pred, axis=1)

print(classification_report(y_test, cr_pred))

pd.crosstab(y_test,cr_pred)

              precision    recall  f1-score   support

           0       0.57      0.72      0.63      1600
           1       0.65      0.33      0.44      1102
           2       0.52      0.57      0.55      1214

    accuracy                           0.56      3916
   macro avg       0.58      0.54      0.54      3916
weighted avg       0.58      0.56      0.55      3916



col_0,0,1,2
author_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1145,116,339
1,430,368,304
2,439,79,696


Looks like the more complex we go, the worse results we get.

Edgar Allan Poe - 67% (1066/1600).
H.P. Lovecraft - 56% (612/1102).
Mary Wollstonecraft Shelley - 47% (565/1214).

In [34]:
labels = keras.utils.to_categorical(df['author_label'], num_classes=3)

model_glove.fit(data, labels, validation_split=0.4, epochs=3)

y_pred = model_glove.predict(data_test)

cr_pred = np.argmax(y_pred, axis=1) # gives a list of predicted values (picks the one with the highest probability)

Epoch 1/3
Epoch 2/3
Epoch 3/3


Saving the file in the desired format for submission / uploading to the Kaggle competition.

In [35]:
y_pred_df = pd.DataFrame(y_pred)
y_pred_df['id'] = df_test['id']
y_pred_df.rename(columns={0:'EAP',
                          1:'HPL',
                          2:'MWS'}, 
                 inplace=True)
print(y_pred_df.head())
y_pred_df.to_csv("glove_result.csv", encoding='utf-8', index=False)

        EAP       HPL       MWS       id
0  0.198286  0.822722  0.030150  id02310
1  0.398770  0.414429  0.105664  id24541
2  0.110008  0.828182  0.061010  id00134
3  0.286303  0.010103  0.709364  id27757
4  0.545852  0.115983  0.305244  id04081
