# Exploratory Data Analysis
Our training and test dataset were obtained from [this challenge](https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial?select=train.zip)

In [36]:
# imports
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from nltk.stem import WordNetLemmatizer

In [37]:
train = pd.read_csv("train.csv")

In [38]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In our dataset, we have a dataframe of 4 columns. id which is the primary key, text, which is text from an author, and author, which is one of 3 values. 

**EAP** is Edgar Allen Poe
**HPL** is H.P. Lovecraft
**MWS** is Mary Shelley

In [39]:
train.shape

(19579, 3)

When we look at the shape, we see we have 19579 lines of data.

In [40]:
train.isna().sum()

id        0
text      0
author    0
dtype: int64

Fortunately, there is no missing data.

In [41]:
train.describe()

Unnamed: 0,id,text,author
count,19579,19579,19579
unique,19579,19579,3
top,id24217,"He declined bearing the cartel, however, and i...",EAP
freq,1,1,7900


In [42]:
train.groupby('author').count()

Unnamed: 0_level_0,id,text
author,Unnamed: 1_level_1,Unnamed: 2_level_1
EAP,7900,7900
HPL,5635,5635
MWS,6044,6044


It looks like Edgar Allen Poe's work has the most entries.

## NLP

There's 4 steps to preprocessing text for NLP.

1. Tokenization - Segregation of the text into its individual constitutent words.
2. Stopwords - Throw away any words that occur too frequently as its frequency of occurrence will not be useful in helping detecting relevant texts. (as an aside also consider throwing away words that occur very infrequently).
3. Lemmatization - combine variants of words into a single parent word that still conveys the same meaning
4. Vectorization - Converting text into vector format. One of the simplest is the famous bag-of-words approach, where you create a matrix (for each document or text in the corpus). In the simplest form, this matrix stores word frequencies (word counts) and is oft referred to as vectorization of the raw text.

### 1. Tokenization

Tokenization will make a list of words and punctuation present in a corpus. Here is how you do it using nltk:

In [43]:
first_text = train.text.values[0]
print(first_text)
print("="*90)
first_text_list = nltk.word_tokenize(first_text)
first_text_list[1:20]

This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.


['process',
 ',',
 'however',
 ',',
 'afforded',
 'me',
 'no',
 'means',
 'of',
 'ascertaining',
 'the',
 'dimensions',
 'of',
 'my',
 'dungeon',
 ';',
 'as',
 'I',
 'might']

## 2. Stopwords

Stopwords will remove words like 'the' and 'to', which are very common in English.

In [44]:
stopwords = nltk.corpus.stopwords.words('english')
first_text_list_cleaned = [word for word in first_text_list if word.lower() not in stopwords]
first_text_list_cleaned[1:20]

[',',
 'however',
 ',',
 'afforded',
 'means',
 'ascertaining',
 'dimensions',
 'dungeon',
 ';',
 'might',
 'make',
 'circuit',
 ',',
 'return',
 'point',
 'whence',
 'set',
 ',',
 'without']

### 3. Lemmatization 
The work at this stage attempts to reduce as many different variations of similar words into a single term.

Here is an example:

In [45]:
lemm = WordNetLemmatizer()
print("The lemmatized form of leaves is: {}".format(lemm.lemmatize("leaves")))

The lemmatized form of leaves is: leaf


### 4. Vectorization

Lastly, we want to vectorize the text. We are going to use the Bag of Words approach to change a sentence into a list of numbers.

In [46]:
sentence = ["I love to eat Burgers", 
            "I love to eat Fries"]
vectorizer = CountVectorizer(min_df=0)
sentence_transform = vectorizer.fit_transform(sentence)
print("The features are:\n {}".format(vectorizer.get_feature_names()))
print("\nThe vectorized array looks like:\n {}".format(sentence_transform.toarray()))

The features are:
 ['burgers', 'eat', 'fries', 'love', 'to']

The vectorized array looks like:
 [[1 1 0 1 1]
 [0 1 1 1 1]]


## Topic Modelling

Getting back to the two techniques:

1. LDA: Assigns weights to words in a corpus, where each topic will assign different probability weights to each word.
2. NMF: Takes an input matrix and approximates the factorization of this matrix into two other matrices.

In [47]:
class LemmaCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        lemm = WordNetLemmatizer()
        analyzer = super(LemmaCountVectorizer, self).build_analyzer()
        return lambda doc: (lemm.lemmatize(w) for w in analyzer(doc))

In [48]:
# Storing the entire training text in a list
text = list(train.text.values)
# Calling our overwritten Count vectorizer
tf_vectorizer = LemmaCountVectorizer(max_df=0.95, 
                                     min_df=2,
                                     stop_words='english',
                                     decode_error='ignore')
tf = tf_vectorizer.fit_transform(text)

In [49]:
lda = LatentDirichletAllocation(n_components=11, max_iter=5,
                                learning_method = 'online',
                                learning_offset = 50.,
                                random_state = 0)

lda.fit(tf)

LatentDirichletAllocation(learning_method='online', learning_offset=50.0,
                          max_iter=5, n_components=11, random_state=0)

In [50]:
def print_top_words(model, feature_names, n_top_words):
    for index, topic in enumerate(model.components_):
        message = "\nTopic #{}:".format(index)
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 :-1]])
        print(message)
        print("="*70)

In [51]:
print("\nTopics in LDA model: ")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 40)


Topics in LDA model: 

Topic #0:mean night fact young return great human looking wonder countenance difficulty greater wife finally set possessed regard struck perceived act society law health key fearful mr exceedingly evidence carried home write lady various recall accident force poet neck conduct investigation

Topic #1:death love raymond hope heart word child went time good man ground evil long misery replied filled passion bed till happiness memory heavy region year escape spirit grief visit doe story beauty die plague making influence thou letter appeared power

Topic #2:left let hand said took say little length body air secret gave right having great arm thousand character minute foot true self gentleman pleasure box clock discovered point sought pain nearly case best mere course manner balloon fear head going

Topic #3:called sense table suddenly sympathy machine sens unusual labour thrown mist solution suppose specie movement whispered urged frequent wine hour appears ring tu

And that's how you do it. I'm not entirely sure how it works yet either, but I am learning.