# Spooky Author Identification
_____

![cat eyes](http://www.sciencealert.com/images/articles/processed/catseyes_1024.jpg)

## Introduction
___
In this year's Kaggle Halloween playground competition, we are being challenged to predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft.

## Approach
___
Using libraries such as [string](https://docs.python.org/2/library/string.html), [re](https://docs.python.org/2/library/re.html) and [nltk](http://www.nltk.org), we begin by cleaning the corpus - removing punctuation marks and numbers, converting all letters to lowercase and removing stopwords. 

With the help of visualisation libraries such as [matplotlib](https://matplotlib.org), [seaborn](http://seaborn.pydata.org) and [wordcloud](https://github.com/amueller/word_cloud), we are able to identify the most frequent terms used by the authors in their writing.

We will also use [nltk's inbuilt Part-of-Speech Tagging function](http://www.nltk.org/book/ch05.html) to classify the words into their parts of speech. The reason for doing so is because we might expect authors to use specific tags relatively more than their counterparts. If this is the case, then the POS tags will be a useful tool to identify which author wrote the specific sentence.

We use [sklearn's](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) `CountVectorizer`  to convert the cleaned corpus into a matrix, with each row being a particular document, and each column a particular term. The term counts (how frequent a term appears in the corpus) for a particular document will be the corresponding in the value in the dataframe. Next, we use [sklearn's](http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) `TfidfVectorizer` to account for the importance of a particular word term - intutively, words that occur frequently in a document (think 'Messi' in a soccer article) but do not frequently occur in the corpus itself tends to be an important term. On the other hand, words that frequently occur in all documents (think 'and', 'if', 'the') are deemed to have relative low importance.

Following the generation of the term frequency dataframe, we conducting Topic Modelling using [gensim](https://radimrehurek.com/gensim/index.html). By identifying the underlying topics within the corpus, we were able to identify the 'hidden' topics within the corpus. Following the identification of such topics, we then allocate topics to each document depending on the terms present in the document.

Finally, we pulled all our features together, and used a simple [Logistic Regression model](https://en.wikipedia.org/wiki/Logistic_regression) using [sklearn's GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to search for the best model by varying the regularisation term, C. On the holdout testing dataset, the preliminary model were able to obtain a testing score of 0.445 (as we are using the metric of logarithmic loss, lower is better). After which, we re-fitted the model, using the best parameters we found (C=10), using the whole training dataset.

## Evaluation Metric
___
Similar to Kaggle's evaluation metric, we will use the [multi-class logarithmic loss](https://www.kaggle.com/c/spooky-author-identification#evaluation).


## Afternote
___
Our model scored 0.42827. The score was good enough to place us at 214 of 584 teams (37th percentile). As it turns out, a simple logistic regression model with good features will outperform a sophisticated machine learning algorithm with poor features. 

In the event that you found this useful, please visit the other kernels that were instrumental in helping me formulate hypotheses, generate new features and challenging some of my beliefs:

1. [Abhishek - Approaching (Almost) Any NLP Problem on Kaggle](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle)
2. [Anisotropic - Spooky NLP and Topic Modelling tutorial](https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial)
3. [Heads or Tails - Treemap House of Horror: Spooky EDA/LDA/Features](https://www.kaggle.com/headsortails/treemap-house-of-horror-spooky-eda-lda-features)

### Importing key libraries and reading dataframe

To facilitate data processing and cleaning, we will import the following libraries:

* [pandas](http://pandas.pydata.org) - pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

* [numpy](http://www.numpy.org) - NumPy is the fundamental package for scientific computing with Python.

* [nltk](http://www.nltk.org) - NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

* [re](https://docs.python.org/2/library/re.html) - This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.

* [Matplotlib](https://matplotlib.org) - Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

* [seaborn](http://seaborn.pydata.org) - Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

* [wordcloud](https://github.com/amueller/word_cloud) - Wordcloud is a little word cloud generator in Python.

In [None]:
import pandas as pd
import numpy as np
import nltk
import re
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

In [None]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

Let's combine the two dataframes.

In [None]:
combined = pd.concat([df_train, df_test]).reset_index(drop=True)

### Exploratory Data Analysis

After reading the dataframes, let's take a look at the first 5 rows of the combined dataframe!

In [None]:
combined.head()

In this instance, we have 3 columns:

* author - This is our target label: we are supposed to predict the author from the text.
* id - This is probably a unique identifier that has no correlation with the author.
* text - This is our independent variable/feature to predict the target label, author.

What is the dimension of the dataframe?

In [None]:
print('The training dataset has %d rows and %d columns' % (df_train.shape[0], df_train.shape[1]))
print('The testing dataset has %d rows and %d columns' % (df_test.shape[0], df_test.shape[1]))
print('The combined dataset has %d rows and %d columns' % (combined.shape[0], combined.shape[1]))

In [None]:
np.sum(pd.isnull(combined))

It appears that there are no NA values in the id and text columns. Also, we note that there are 8,392 missing author values, similar to the number of rows in the testing dataset. Nothing seems to be too alarming here.

How many unique authors are there in the dataset, and how does the author distribution look like?

In [None]:
combined.author.value_counts()

It turns out that there are 3 authors in our dataset, and the author distribution looks to be **quite** uniform.

The authors in the dataset are respectively:

1. **EAP** - Edgar Allen Poe
2. **MWS** - Mary Shelley
3. **HPL** - H.P. Lovecraft

#### Most Common Words

What are the most common words used by the author in this corpus? Let's find out.

In [None]:
import string
import operator
from collections import OrderedDict
sns.set(font_scale=1.25)

def top_20_words(author):
    # Return a cleaned series of lists of words
    common_words_df = (combined[combined['author'] == author].text
                       .apply(lambda x: ''.join([word for word in x if word not in string.punctuation]))
                       .str.lower()
                       .str.split(' '))
    
    # Returns a dictionary where key = words and values = word counts
    dict_of_word_count = {}
    for text in common_words_df:
        for word in text:
            dict_of_word_count[word] = dict_of_word_count.get(word, 0) + 1

    return sorted(dict_of_word_count.items(), key=operator.itemgetter(1), reverse=True)[:20]

def plot_top_20_words(author):
    plt.figure(figsize=(20, 12))
    topwords = top_20_words(author)
    
    words, freq = list(zip(*topwords))[0], list(zip(*topwords))[1]
    
    x_pos = np.arange(len(words)) 
    
    sns.barplot(x_pos, freq)
    plt.xticks(x_pos, words)
    plt.title('Top 20 words of: ' + author)
    plt.show()

##### EAP

In [None]:
plot_top_20_words('EAP')

In [None]:
wordcloud = WordCloud().generate(str(combined[combined.author=='EAP'].text.tolist()))

plt.figure(figsize=(20, 15))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

##### HPL

In [None]:
plot_top_20_words('HPL')

In [None]:
wordcloud = WordCloud().generate(str(combined[combined.author=='HPL'].text.tolist()))

plt.figure(figsize=(20, 15))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

##### MWS

In [None]:
plot_top_20_words('MWS')

In [None]:
wordcloud = WordCloud().generate(str(combined[combined.author=='MWS'].text.tolist()))

plt.figure(figsize=(20, 15))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Judging by the top 20 most common words used by the 3 authors, it appears that they are very similar. Words such as 'the', 'and', 'of', and 'to' and 'a' tend to come up very frequently. It turns out that simply using the most frequency words alone will not help in differentiating across the 3 different authors.

### Feature Engineering with re

Before using the Natural Language ToolKit (NLTK), let's think about some features we can use to predict the author which wrote the text. Here are some examples of features we can generate:

1. Length of sentence - If we believe certain authors tend to be more verbose, then the length of sentence can help us identify these authors better.
2. Number of words - Similar to the length of sentence, this captures the author's verbosity.
3. Number of punctuation marks - If some authors are more likely to use exclaimation marks (!) or apostrophes (') in general, then the number of punctuation marks will be a good feature that we can consider.
4. Number of capital letters - Some authors might use more capital letters in their writing.
5. Average word length - Some authors tend to use longer words compared to their counterparts.

In [None]:
combined['sent_length'] = combined.text.apply(lambda x: len(x))

In [None]:
def word_count(text):
    return len(''.join([word.lower() for word in text if word not in string.punctuation]).split(' '))

combined['word_length'] = combined.text.apply(word_count)

In [None]:
combined['punc_marks'] = (combined
                          .text
                          .apply(lambda x: 
                                 len(re.findall('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]', x))) /
                          combined.sent_length)

In [None]:
combined['cap_letter'] = (combined
                          .text
                          .apply(lambda x: 
                                 len(re.findall('[A-Z]', x))) /
                          combined.sent_length)

In [None]:
combined['avg_word_length'] = combined.sent_length / combined.word_length

After the creation of these features, let's take a look at their distributions according to the 3 authors!

##### Sentence Length

In [None]:
plt.figure(figsize=(20, 12))

sns.distplot(combined[combined.author == 'EAP'].sent_length, color = 'salmon', 
             bins=np.linspace(0, 1000, 101), kde=False, norm_hist=True, label = 'EAP')
sns.distplot(combined[combined.author == 'HPL'].sent_length, color = 'steelblue', 
             bins=np.linspace(0, 1000, 101), kde=False, norm_hist=True, label = 'HPL')
sns.distplot(combined[combined.author == 'MWS'].sent_length, color = 'seagreen', 
             bins=np.linspace(0, 1000, 101), kde=False, norm_hist=True, label = 'MWS')

plt.title('Sentence Length')
plt.legend()
plt.show()

On average, the 3 authors tend to write sentences of equal length (about 200 - 300 characters). We do note that the author, Lovecraft writes relatively long sentences, while Poe writes relatively short sentences. 

##### Word Length

In [None]:
plt.figure(figsize=(20, 12))

sns.distplot(combined[combined.author == 'EAP'].word_length, color = 'salmon',
             bins=np.linspace(0, 200, 101), kde=False, norm_hist=True, label = 'EAP')
sns.distplot(combined[combined.author == 'HPL'].word_length, color = 'steelblue', 
             bins=np.linspace(0, 200, 101), kde=False, norm_hist=True, label = 'HPL')
sns.distplot(combined[combined.author == 'MWS'].word_length, color = 'seagreen', 
             bins=np.linspace(0, 200, 101), kde=False, norm_hist=True, label = 'MWS')

plt.title('Word Length')
plt.legend()
plt.show()

The word length feature appears to be strongly correlated with the sentence length feature. This isn't surprisingly at all. (**Why?**)

##### Number of Punctuations

In [None]:
plt.figure(figsize=(20, 12))

sns.distplot(combined[combined.author == 'EAP'].punc_marks, color='salmon', label = 'EAP',
             bins = np.linspace(0, 0.2, 101), kde=False, norm_hist=True)
sns.distplot(combined[combined.author == 'HPL'].punc_marks, color='steelblue', label = 'HPL',
             bins = np.linspace(0, 0.2, 101), kde=False, norm_hist=True)
sns.distplot(combined[combined.author == 'MWS'].punc_marks, color='seagreen', label = 'MWS',
             bins = np.linspace(0, 0.2, 101), kde=False, norm_hist=True)
plt.title("Average Number of Punctuation Marks Used")
plt.legend()

plt.show()

As it turns out, the number of punctuation marks used appears to be quite good at separating between the 3 authors. We note that the Poe and Shelley tend to use more punctuation marks in their sentence (controlled for the number of words), compared to Lovecraft.

##### Number of Capital Letters

In [None]:
plt.figure(figsize=(20, 12))

sns.distplot(combined[combined.author == 'EAP'].cap_letter, color='salmon', label = 'EAP',
             bins = np.linspace(0, 0.2, 101), kde=False, norm_hist=True)
sns.distplot(combined[combined.author == 'HPL'].cap_letter, color='steelblue', label = 'HPL',
             bins = np.linspace(0, 0.2, 101), kde=False, norm_hist=True)
sns.distplot(combined[combined.author == 'MWS'].cap_letter, color='seagreen', label = 'MWS',
             bins = np.linspace(0, 0.2, 101), kde=False, norm_hist=True)
plt.title("Average Number of Capital Letters Used")
plt.xlabel('Average Number of Capital Letters')
plt.legend()

plt.show()

On average, the number of capital letters used in a sentence doesn't seem to be terribly informative of the target label. We do note that on average, all 3 authors use similar amounts of capital letters.

##### Average Word Length

In [None]:
plt.figure(figsize=(20, 12))

sns.distplot(combined[combined.author == 'EAP'].avg_word_length, color='salmon', label = 'EAP',
             bins = np.linspace(0, 12, 61), kde=False, norm_hist=True)
sns.distplot(combined[combined.author == 'HPL'].avg_word_length, color='steelblue', label = 'HPL',
             bins = np.linspace(0, 12, 61), kde=False, norm_hist=True)
sns.distplot(combined[combined.author == 'MWS'].avg_word_length, color='seagreen', label = 'MWS',
             bins = np.linspace(0, 12, 61), kde=False, norm_hist=True)
plt.title("Average Word Length")
plt.xlabel('Average Word Length')
plt.legend()

plt.show()

On average, we note that the average word length across the 3 authors don't seem to deviate too much. Most of the words fall between 4 to 8 characters.

### Feature Engineering with NLTK

Let's use the Natural Language Toolkit library in Python to generate more features!

#### Stopwords Stemming

First, we remove punctuations from the text column of our combined dataframe, and convert capital letters into small letters. After which, we remove stopwords from the cleaned text and *lemmatize* the cleaned text using NLTK's inbuilt `WordNetLemmatizer`. In addition, we will also conduct word stemming using NLTK's inbuilt `PorterStemmer`.

In [None]:
# Cleaning text - removing punctuation, and converting capital letters to small letters
def list_of_words(text):
    return ''.join([word.lower() for word in text if word not in string.punctuation]).split(' ')

wordlist = combined.text.apply(list_of_words)

In [None]:
# Cleaning text - removing non-alphanumeric characters
def remove_spaces(text):
    return ' '.join([re.sub('[^a-zA-Z0-9]', ' ', word) for word in text]).split(' ')

wordlist = wordlist.apply(remove_spaces)

In [None]:
# Removing stopwords from the list of words - warning: takes a long time 
from nltk.corpus import stopwords

def list_of_nonstopwords(text):
    return [word for word in text if word not in stopwords.words('english')]

nonstopword_list = wordlist.apply(list_of_nonstopwords)

In [None]:
# Lemmatizing the list of non-stop-words
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatized_words(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text])

combined['lemmatized_words'] = nonstopword_list.apply(lemmatized_words)

In [None]:
# Stemming the list of non-stop-words
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def stemmed_words(text):
    return ' '.join([stemmer.stem(word) for word in text])

combined['stemmed_words'] = nonstopword_list.apply(stemmed_words)

Let's take a look at the first 5 rows of our dataframe.

In [None]:
combined.head()

#### Most Common Lemmatized Words

After using WordNetLemmatizer to lemmatize the text in the combined dataframe, let's take a look at the most common lemmatized terms.

In [None]:
def top_20_words_lemmatized(author):
    # Return a cleaned series of lists of words
    author_stemmed = (combined[combined.author == author].lemmatized_words
                      .apply(lambda text: ''.join([word for word in text])).str.split(' '))

    # Returns a dictionary where key = words and values = word counts
    dict_of_word_count = {}
    for text in author_stemmed:
        for word in text: dict_of_word_count[word] = dict_of_word_count.get(word, 0) + 1

    return sorted(dict_of_word_count.items(), key=operator.itemgetter(1), reverse=True)[:20]

def plot_top_20_words_lemmatized(author):
    plt.figure(figsize=(20, 12))
    topwords = top_20_words_lemmatized(author)
    
    words, freq = list(zip(*topwords))[0], list(zip(*topwords))[1]
    
    x_pos = np.arange(len(words)) 
    sns.barplot(x_pos, freq)
    plt.xticks(x_pos, words)
    plt.title('Top 20 lemmatized_words terms of: ' + author)
    plt.show()

In [None]:
# Plotting top lemmatized words for Edgar Allen Poe
plot_top_20_words_lemmatized('EAP')

In [None]:
wordcloud = WordCloud().generate(str(combined[combined.author=='EAP'].lemmatized_words.tolist()))

plt.figure(figsize=(20, 15))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# Plotting top lemmatized words for H.P. Lovecraft
plot_top_20_words_lemmatized('HPL')

In [None]:
wordcloud = WordCloud().generate(str(combined[combined.author=='HPL'].lemmatized_words.tolist()))

plt.figure(figsize=(20, 15))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# Plotting top lemmatized words for Mary Shelley
plot_top_20_words_lemmatized('MWS')

In [None]:
wordcloud = WordCloud().generate(str(combined[combined.author=='MWS'].lemmatized_words.tolist()))

plt.figure(figsize=(20, 15))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

After removing stopwords and conducting lemmatizing the cleaned dataframe on our dataframe, there are still many common words, such as 'one' and 'seemed'. As these words frequently occur in the corpus, they are not "good" indicators of the authors.

#### Part-of-Speech Tagging

According to NLTK, part-of-speech tagging, or POS-tagging is the process of classifying words into their parts of speech and labeling them accordingly. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset.

For our analysis, we can look at the number of nouns, verbs and adjectives in each sentence, and potentially use them as features. Intuitively, the writing style differs from authors to authors. We would expect certain authors to use more nouns, verbs or adjectives relative to other authors. Let's make use of this observation to create more features.

In [None]:
# Conduct POS Tagging (takes a bit of time to run this code)
pos_tags = (combined.text.apply(lambda text: nltk.pos_tag(nltk.word_tokenize(text))))

In [None]:
def pos_tag_count(list_of_postag):
    # Return dictionary of dataframes with postags as keys and counts as values
    dict_of_postags = {}
    for tag in list_of_postag:
        dict_of_postags[tag[1]] = dict_of_postags.get(tag[1], 0) + 1
    return dict_of_postags
        
postags_df = pd.DataFrame(pos_tags.apply(pos_tag_count).to_dict()).T

Let's remove the punctuations from the POS Tagging dataframe, as they should not be very informative of the target label.

In [None]:
pos_tag_col = [col for col in postags_df.columns if re.findall('[A-Z]+', col)]

postags_df_ = postags_df[pos_tag_col].fillna(0)

To check which tags are most indicative of the target label, we can use a correlation matrix to estimate their importance.

In [None]:
postags_df_['EAP'] = pd.get_dummies(df_train.author).EAP
postags_df_['HPL'] = pd.get_dummies(df_train.author).HPL
postags_df_['MWS'] = pd.get_dummies(df_train.author).MWS

sns.set(font_scale=1)
plt.figure(figsize=(20,12))
sns.heatmap(postags_df_.corr())
plt.show()

Let's include these features in our model!

In [None]:
del postags_df_['EAP']
del postags_df_['HPL']
del postags_df_['MWS']

combined = pd.merge(combined, postags_df_,
                    left_index=True, right_index=True)

### Feature Engineering with Scikit-Learn

After we have engineered features with re and NLTK, let's try to generate more features using Scikit-Learn! We can use the CountVectorizer to find words that occur a minimum amount of times in the dataframe. In our case, we will only focus only on unigrams, bigrams and trigrams which occur at least 3 times.

In [None]:
X_train = combined.iloc[:df_train.shape[0]]
X_test = combined.iloc[df_train.shape[0]:]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Use CountVectorizor to remove stop_words, remove tokens that don't appear in at least 3 documents,
# and remove tokens that appear in more than 10% of the documents
vect = CountVectorizer(min_df=3, ngram_range=(1, 10))

train_counts_transformed = vect.fit_transform(X_train.stemmed_words)
test_counts_transformed = vect.transform(X_test.stemmed_words)

Let's take a quick look at the dimensions of our dataframe.

In [None]:
train_counts_transformed

After vectorizing the n-gram counts, let's use the TfidfTransformer to convert our bag of words! Basically, the [tf-idf](https://en.wikipedia.org/wiki/Tf–idf) (term frequency-inverse document freqency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. 

Intuitively, you would expect relatively important words to occur frequently within a specific text, but does not appear frequently across the corpus. Think of the words 'Sherlock' and 'Holmes' - these words shouldn't occur in many different texts! On the other hand, words such as 'and', 'the' and 'of' appears frequently occur in many different texts, and are unimportant.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True)

train_tfidf = tfidf.fit_transform(train_counts_transformed)
test_tfidf = tfidf.transform(test_counts_transformed)

As it turns out, 14701 features fit the bill. Let's merge the dataframes together.

In [None]:
X_train_ = pd.merge(X_train, pd.DataFrame(train_tfidf.toarray()),
                    left_index=True, right_index=True)

X_test_ = pd.merge(X_test.reset_index(drop=True), pd.DataFrame(test_tfidf.toarray()), 
                   left_index=True, right_index=True)

### Topic Modelling with Gensim

After feature engineering, let's conduct topic modelling to identify potential topics in the corpus! Intuitively, certain authors tend to write about certain topics - things which are closer to their hearts. 

Hence, the identification of such topics could potentially help to improve our prediction rates.

In [None]:
import gensim
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(min_df=3, ngram_range=(1,5), stop_words='english')

# Fit and transform
text_train = vect.fit_transform(X_train.lemmatized_words)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(text_train, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In order to carry out Topic Modelling, we have to impute a parameter, $k$, which tells gensim the number of topics in the corpus. In our case, we settle on 6 different topics. In addition, let's set a random seed to ensure that the analysis is reproducible.

In [None]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

random_state = 9410

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 6,
                                           id2word = id_map, passes = 6,
                                           random_state = random_state)

What are these topics?

In [None]:
ldamodel.show_topics(num_topics=6)

Let's write a function that returns the most likely topic, given the text.

In [None]:
def most_probable_topic(text):
    
    # Transform text into Corpus
    X = vect.transform(text)
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    
    # Return topic distribution
    topic_dist =  ldamodel.inference(corpus)[0]
    
    topics = [max(enumerate(corpus), key=operator.itemgetter(1))[0] for corpus in topic_dist]
    
    return topics

Now that we have obtained the topic numbers, let's convert the topics found using One-Hot Encoding. We can use pandas' inbuilt function `pd.get_dummies()` to do so.

In [None]:
train_topics = pd.get_dummies(most_probable_topic(X_train.lemmatized_words), prefix='topic')
test_topics = pd.get_dummies(most_probable_topic(X_test.lemmatized_words), prefix='topic')

In [None]:
train_topics.head()

In [None]:
X_train = pd.merge(X_train_, train_topics, left_index=True, right_index=True)
X_test = pd.merge(X_test_, test_topics, left_index=True, right_index=True)

What are the dimensions of our training and testing dataframe? Let's take a look.

In [None]:
print('Training Dimension: ', X_train.shape)
print('Testing Dimension: ', X_test.shape)

#### Distribution of Topic Numbers according to the different Authors

In [None]:
plt.figure(figsize=(20,12))

topics = pd.DataFrame(most_probable_topic(X_train.lemmatized_words))

sns.distplot(topics.iloc[combined[combined.author=='EAP'].index.values],
             bins=range(0, 7, 1), kde=False, norm_hist=True, color='steelblue', label='EAP')
sns.distplot(topics.iloc[combined[combined.author=='HPL'].index.values],
             bins=range(0, 7, 1), kde=False, norm_hist=True, color='seagreen', label='HPL')
sns.distplot(topics.iloc[combined[combined.author=='MWS'].index.values],
             bins=range(0, 7, 1), kde=False, norm_hist=True, color='salmon', label='MWS')

plt.title('Topic Distribution')
plt.legend()
plt.show()

### Model Fitting

Let's select the features which we will be using for our prediction.

In [None]:
from sklearn.model_selection import train_test_split

X_train.columns = [str(feat) for feat in X_train.columns.tolist()]
X_test.columns = [str(feat) for feat in X_test.columns.tolist()]
features = [feat for feat in X_train.columns.tolist() 
            if feat not in ['author', 'id', 'text', 'stemmed_words', 'lemmatized_words']]

X, y = X_train[features], X_train.author.values.ravel()
X_test = X_test[features]

X_subtrain, X_subtest, y_subtrain, y_subtest = train_test_split(X, y, test_size=0.2,
                                                                random_state=random_state)

#### Logistic Regression

In our case, we use the Logistic Regression Model as our baseline model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

logregr = LogisticRegression(random_state=random_state)

param_grid = {'C': np.logspace(-2, 2, 5)}

clf = GridSearchCV(logregr, param_grid=param_grid, scoring='neg_log_loss', cv=3)
clf.fit(X_subtrain, y_subtrain)

How well does our model generalize to the hold-out testing dataset? Let's find out.

In [None]:
from sklearn.metrics import log_loss

log_loss(y_subtest, clf.predict_proba(X_subtest))

As it turns out, our model performed pretty well on our holdout testing set, with a logarithmic loss of 0.445. Let's see how well our model generalizes to the testing dataset. Let's fit our model using the whole training dataset (including the holdout dataset).

In [None]:
clf.fit(X, y)

#### Submitting our results

In [None]:
X_test_pred = pd.DataFrame(clf.predict_proba(X_test), 
                           columns = ['EAP', 'HPL', 'MWS'])

In [None]:
submission = pd.read_csv('submission.csv')

X_test_pred['id'] = submission['id']
(X_test_pred.set_index('id')
 .reset_index()
 .to_csv('submission_.csv', index=False))

As it turns out, our model was able to generalize pretty well to the test set, scoring a logarithmic loss of 0.42827 (lower is better). This was enough to place us at rank 214 out of 585 (the 37th percentile in the competition)! (As on 14 Nov 2017)