# Introduction

![](https://www.telegraph.co.uk/content/dam/films/2016/10/28/cthulhu_trans_NvBQzQNjv4BqeWq0Odl7YRxHNYM74_QBWlbFJiGQSGUwQFXFdwSXZiw.jpg?imwidth=1400)

Here, I'll be looking at Kaggle's Spooky Author dataset. This is largely of interest to me for two reasons. First, as an excuse to learn some basic techniques in NLP. Many of these techniques I've learned over these past few months go unused, such as latent Dirichlet allocations and recurrent neural networks. Second, as someone interested in the horror literature as a whole, figured this dataset would be a good place to start.

I will be using several different libraries.
* **Numpy**, **Pandas**, and **Sklearn** for data analysis and machine learning.
* **Plotly** for plotting data. I will also be using **Word Cloud** for generating wordclouds.
---

To begin, let me start with a following quote.

> "The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents."
> > HP Lovecraft

Well, I guess he never heard of deep learning techniques! Oh well.

# Loading the Data

In [None]:
import base64
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [None]:
# Loading in the training data with Pandas
df = pd.read_csv("../input/train.csv")

First, let us take a look at he data itself.

In [None]:
df.head()

So we observe that each entry has three attributes, namely
* an identification number,
* the text itself,
* and an abbreviation of the author.
The id entry we can safely discount for the rest of the analysis. The rest will however be useful.

Just to get an idea of what each text entry looks like.

In [None]:
# Reading the full text of the first entry.
df['text'][0]

This is written by none other than Edgar Allan Poe, as indicated by the dataframe.

This consists of one fairly long sentence. So this is exactly what we will be looking at. Next, let us check how many entries we have to deal with.

In [None]:
print(df.shape)

Hence our training data consists of 19579 entries, each with three tuples attached to them (the latter we already knew).

Note that the authors are labelled by their initials. Each author has their own distinct style, which is summerized below.

1. **[EAP - Edgar Allen Poe](https://en.wikipedia.org/wiki/Edgar_Allan_Poe)** : American writer who wrote poetry and short stories that revolved around tales of mystery and the grisly and the grim. Arguably the origin of "detective fiction," especially with his work in "The Masque of the Red Death."

2. **[HPL - HP Lovecraft](https://en.wikipedia.org/wiki/H._P._Lovecraft)** : Best known for his "Cthulu mythos." His writing style focuses heavily on fear of the unknown.

3. **[MWS - Mary Shelley](https://en.wikipedia.org/wiki/Mary_Shelley)** : Probably the most diverse of the three - she was a novelist, dramatist, travel-writer, and biographer. Her most well known work is "Frankenstein."

# Exploratory Data Analysis

## Summary statistics of the training set

Here we can visualize some basic statistics in the data, like the distribution of entries for each author. For this purpose, I will invoke the handy Plot.ly visualisation library and plot some simple bar plots. Plot.ly is probably one of the better looking graphics libraries out there. It does come at a cost of being a little hard to code compared to libraries such as *Seaborn*.

In [None]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

In [None]:
z = {'EAP': 'Edgar Allen Poe', 'MWS': 'Mary Shelley', 'HPL': 'HP Lovecraft'}
data = [go.Bar(
            x = df.author.map(z).unique(),
            y = df.author.value_counts().values,
            marker= dict(colorscale='Jet',
                         color = df.author.value_counts().values
                        ),
            text='Text entries attributed to Author'
    )]

layout = go.Layout(
    title='Target variable distribution'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

This gives us a pretty good idea now of the amount we have to work with in terms of each author.

## WordClouds

One very handy visualization tool for a data scientist when it comes to any sort of natural language processing is plotting "Word Cloud". A word cloud (as the name suggests) is an image that is made up of a mixture of distinct words which may make up a text or book and where the size of each word is proportional to its word frequency in that text (number of times the word appears). Here instead of dealing with an actual book or text, our words can simply be taken from the column "text"

**Store the text of each author in  a Python list**

We first create three different python lists that store the texts of Edgar Allen Poe, HP Lovecraft and Mary Shelley respectively as follows:

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

In [None]:
def generate_word_cloud(text, title):
    # Generate word cloud.
    wc = WordCloud(background_color='black', max_words=1000,
                  stopwords=STOPWORDS, max_font_size=40)
    wc.generate(" ".join(text))
    
    # Plot word cloud using matplotlib.
    plt.figure(figsize=(16, 13))
    plt.title(title, fontsize=20)
    plt.imshow(wc.recolor(colormap='Pastel2', random_state=42), alpha=0.98)
    plt.axis('off')

In [None]:
eap = df[df.author=="EAP"]["text"].values
hpl = df[df.author=="HPL"]["text"].values
mws = df[df.author=="MWS"]["text"].values

## HP Lovecraft

In [None]:
generate_word_cloud(hpl, "HP Lovecraft")

As one can see, we can see certain Lovecraftian themes present. Good examples would be "dark," "thought," and "strange." These themes are buried in the middle of several common words, such as "place" and "seemed." Despite trying to remove certain stopwords, it can be kind of inevitble that certain common words will dominate.

## Edgar Allen Poe

In [None]:
generate_word_cloud(eap, "Edgar Allen Poe")

## Mary Shelley

In [None]:
generate_word_cloud(mws, "Mary Shelley")

On the other hand, one can see that Mary Shelley's words revolve around primal instincts and themes of morality which range from the positive to negative ends of the spectrum, such as "friend", "fear", "hope", "spirit" etc. One common word that stands out, is "Raymond." I had to dig through a little to figure out why such a word appeared, to be honest. If you look at the lesser-known work by Mary Shelley, you will find [The Last Man](https://en.wikipedia.org/wiki/The_Last_Man), one character from which is Lord Raymond.

**Term frequencies**

An alternative strategy is simply to use a histogram in order to plot term frequencies.

In [None]:
def plot_frequent_word(count_vec, author, feature_names):
    zipped = list(zip(feature_names, count_vec))
    x, y = (list(x) for x in zip(*sorted(zipped, key=lambda x: x[1], reverse=True)))
    X = np.concatenate([x[0:15], x[-16:-1]])
    Y = np.concatenate([y[0:15], y[-16:-1]])
    # Plotting the Plot.ly plot for the Top 50 word frequencies
    data = [go.Bar(
                x = x[0:50],
                y = y[0:50],
                marker= dict(colorscale='Jet',
                             color = y[0:50]
                            ),
                text='Word counts'
        )]
    layout = go.Layout(
    title='Top 50 Word frequencies (%s)' % (author)
    )

    fig = go.Figure(data=data, layout=layout)

    py.iplot(fig, filename='basic-bar')

tf_vectorizer_hpl = CountVectorizer(max_df=0.95, 
                                     min_df=2,
                                     stop_words='english',
                                     decode_error='ignore')

tf_vectorizer_eap = CountVectorizer(max_df=0.95, 
                                     min_df=2,
                                     stop_words='english',
                                     decode_error='ignore')

tf_vectorizer_mws = CountVectorizer(max_df=0.95, 
                                     min_df=2,
                                     stop_words='english',
                                     decode_error='ignore')

tf_hpl = tf_vectorizer_hpl.fit_transform(hpl)
tf_eap = tf_vectorizer_eap.fit_transform(eap)
tf_mws = tf_vectorizer_mws.fit_transform(mws)
    
feature_names_hpl = tf_vectorizer_hpl.get_feature_names()
feature_names_eap = tf_vectorizer_eap.get_feature_names()
feature_names_mws = tf_vectorizer_mws.get_feature_names()


count_vec_hpl = np.asarray(tf_hpl.sum(axis=0)).ravel()
count_vec_eap = np.asarray(tf_eap.sum(axis=0)).ravel()
count_vec_mws = np.asarray(tf_mws.sum(axis=0)).ravel()

plot_frequent_word(count_vec_hpl, 'HP Lovecraft', feature_names_hpl)
plot_frequent_word(count_vec_eap, 'Edgar Allan Poe', feature_names_eap)
plot_frequent_word(count_vec_mws, 'Mary Shelly', feature_names_mws)

# Latent Dirichlet Allocation

Latent Dirichlet Allocation, denoted by LDA from here on (not to be confused with [Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)), is a very vital way of topic extraction in a corpus of documents. A fairly technical, although understandable, explanation can be found in the [original paper by Michael I. Jordan et al.](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf).

Without going into too much detail about the math behind the algorithm, there are two basic assumptions behind the algorithm.

1. A corpus is a collection of topics,
2. and a topic is a collection to keywords.

Like other forms of unsupervised learning, we don't get names for our topics. It is usually easy to figure out based off the keywords provided. Also, we have to manually set the number of topics we expect. In this case, we choose 10 topics for each author. Note however that there is no reason expect this is a reasonable number of topics a corpus consists of. Some code has been borrowed [from here](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py).


In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

## Mary Shelley

In [None]:
lda = LatentDirichletAllocation(n_components=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf_mws)

tf_feature_names = tf_vectorizer_mws.get_feature_names()

print_top_words(lda, tf_feature_names, 20)

While I am not intimantly familiar with Mary Shelley's work, there are a few things that pop out at me.

* **Topic #4**: Lord Raymond and Andian are both characters in The Last Man.

## HP Lovecraft

In [None]:
lda = LatentDirichletAllocation(n_components=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf_hpl)

tf_feature_names = tf_vectorizer_hpl.get_feature_names()

print_top_words(lda, tf_feature_names, 20)

## Edgar Allan Poe

In [None]:
lda = LatentDirichletAllocation(n_components=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf_eap)

tf_feature_names = tf_vectorizer_eap.get_feature_names()

print_top_words(lda, tf_feature_names, 20)