## Welcome to my Spooky Author Identification Notebook

The goal here is to use the provided dataset, which contains mappings of "text" to their authors, and use that data to build a model which can predict who the author is for any provided text.

## Strategy

In order to map text to an author, we're going to need to do some feature engineering to extract text into some useful features which will hopefully be correlated with author.  Here's some initial ideas I have:

1. Words per sentence
1. Vocabulary (unique words per N words)
1. Punctuation usage rates
1. Word frequency (maybe some authors like specific words)
1. Sentiment analysis
1. N-grams (analyze word groupings)

I'm sure there are many other options, but I think those will be a good place to start.

Let's begin by pulling in the training set and exploring the data a bit.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk # natural language toolkit - http://www.nltk.org/

In [None]:
# We have a train.csv and test.csv available
df = pd.read_csv('../input/train.csv')

df.head()

To get a handle on this data, let's figure out some features that rely on individual word counts.

We'll need nltk to tokenize the plain text into words, and then we'll do a count and assign it as a new column "word_count".  Then we'll store the number of characters per text sentence and then figure out the count of words per sentence.

In [None]:
# I want to remove punctuation from the text for word counting purposes
import string
no_punct_translator=str.maketrans('','',string.punctuation)

# tokenize each sentence and remove punctuation
df['words'] = df['text'].apply(lambda t: nltk.word_tokenize(t.translate(no_punct_translator).lower()))

In [None]:
# create a new column with the count of all words
df['word_count'] = df['words'].apply(lambda words: len(words))

# for normalization, how many characters per sentence w/o punctuation
df['sentence_length'] = df['words'].apply(lambda w: sum(map(len, w)))

# for future calculations, let's keep around the full text length, including punctuation
df['text_length'] = df['text'].apply(lambda t: len(t))

In [None]:
df.head()

Ok, now we have a list of words, and a count of words per sentence.  Let's do some graphing to see what the distribution of words looks like for each author.

In [None]:
# import some graphing libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

In [None]:
# Let's plot how many words per sentence each author uses
sns.boxplot(x = "author", y = "word_count", data=df, color = "red")

In [None]:
# Here's the same thing in text form.  
# Not a huge distinction but EAP seems to average less words per sentence
df.groupby(['author'])['word_count'].describe()

In [None]:
# here's the number of characters in each sentence, we do get a little separation here
df.groupby(['author'])['sentence_length'].describe()

Well I was hoping we'd get more separation here, it doesn't seem like there is a huge difference in words per sentence or characters per sentence, but maybe there is enough destinction to make some decent guesses.

Next up, let's do some simple punctuation counts.  We'll use a ratio of punctuation per character to see how often the author uses non word characters.

In [None]:
# the string library defines `string.punctuation` which is all the punctuation chars
df['punctuation_count'] = df['text'].apply(lambda t: len(list(filter(lambda c: c in t, string.punctuation))))

df['punctuation_per_char'] = df['punctuation_count'] / df['text_length'] 

In [None]:
df.groupby(['author'])['punctuation_per_char'].describe()

Now let's start working on vocabulary by figuring out the ratio of unique words to all words in a sentence.

In [None]:
def unique_words(words):
    word_count = len(words)
    unique_count = len(set(words)) # creating a set from the list 'words' removes duplicates
    return unique_count / word_count

df['unique_ratio'] = df['words'].apply(unique_words)
df.groupby(['author'])['unique_ratio'].describe()

Seems like the usage of unique words are about the same (90% of words are unique), though the distribution varies a bit.  Let's graph the distribution of unique words for each author.

In [None]:
authors = ['MWS', 'HPL', 'EAP']

for author in authors:
    sns.distplot(df[df['author'] == author]['unique_ratio'], label = author, hist=False)

plt.legend();

Pretty cool!

Now let's take a look at the size of the actual words used, in case some authors like to use longer words.

In [None]:
# add up the length of each words and devide by the total number of words
avg_length = lambda words: sum(map(len, words)) / len(words)

df['avg_word_length'] = df['words'].apply(avg_length)
df.groupby(['author'])['avg_word_length'].describe()

In [None]:
for author in authors:
    sns.distplot(df[df['author'] == author]['avg_word_length'], label = author, hist=False)

plt.legend();

## Where we are now

We have broken down the text into lots of different features, some which might be useful to feed into a machine learning algorithm.  Let's see what we have:

1. word_count: number of words in the average sentence
1. sentence_length: number of word characters in each sentence
1. text_length: pure number of characters, including spaces and punctuation
1. punctuation_per_char: how often an author uses punctuation marks per character written 
1. unique_ration: ratio of unique words to total words
1. avg_word_length: how many characters is in the average word written 

In [None]:
df.head(2)

## Sentiment Analysis

Now let's analyze the text itself for sentiment, hoping that some authors are more "positive", "neutral" or "negative" in their sentences than others.  NTLK makes this pretty easy with the VADER sentiment analyzer (http://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.vader)

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# Let's test how this works
print(sid.polarity_scores('Vader text analysis is my favorite thing ever'))
print(sid.polarity_scores('I hate vader and everything it stands for'))

So for any piece of analyzed text, we get 4 outputs, all of which could be useful.  However there is a "compound" property which is normalized between -1 (most negative) and 1 (most positive) which we can use to give us a good overall classification.

Let's update our data with sentiment scores and see how they differ for each author

In [None]:
df['sentiment'] = df['text'].apply(lambda t: sid.polarity_scores(t)['compound'])
df.groupby('author')['sentiment'].describe()

In [None]:
for author in authors:
    sns.distplot(df[df['author'] == author]['sentiment'], label = author, hist=False)

plt.legend();

Now this is really interesting, the is quite a bit of differentiation between the authors.  Everyone clusters around 0 (neutral sentence) but HP Lovecraft is significantly more negative and Mary Shelley is much more positive overall.

In [None]:
sns.boxplot(x="author", y="sentiment", data=df);

In [None]:
# # TODO: do this later after we have more data

# # let's create a correlation matrix
# corr = df.corr()

# # make 
# plt.subplots(figsize=(11, 9))

# # Generate a custom diverging colormap
# cmap = sns.diverging_palette(220, 10, as_cmap=True)

# # Draw the heatmap with the mask and correct aspect ratio
# sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
#             square=True, linewidths=.5, cbar_kws={"shrink": .5})

## Word Frequency

I'd like to try to figure out if some authors have "favorite" words, which could help us determine which author wrote some specific piece of text.

I'm not sure the best way to go about this, but I have a few ideas:

1. Find the top N most commonly used words (for each author?), then score each text based on how frequently that word appears.
   * issue: we would want to strip out common pronounds, conjunctions and other parts of speach
   * issue: the ratio would probably be really close to zero so we'd want to normalize
1. Pick a part of speach, say verbs or adverbs, and see how often certain authors utilize some array of top words
   * issue: where would these words come from, should they just come from the whole training set?
1. Instead of specific words, we could see how often authors use certain parts of speech using nltk



In [None]:
# Let's start by figuring out the most common words in our word dataset

# iterate all rows and create a new dataframe with author->word (single word)
df_words = pd.concat([pd.DataFrame(data={'author': [row['author'] for _ in row['words']], 'word': row['words']})
           for _, row in df.iterrows()], ignore_index=True)

# use NLTK to remove all rows with simple stop words
df_words = df_words[~df_words['word'].isin(nltk.corpus.stopwords.words('english'))]

df_words.shape

In [None]:
# let's use wordclouds to see which words each author likes to use
from wordcloud import WordCloud, STOPWORDS

def authorWordcloud(author):
    # lower max_font_size
    wordcloud = WordCloud(max_font_size=40,background_color="black", max_words=10000).generate(" ".join(df_words[df_words['author'] == author]['word'].values))
    plt.figure(figsize=(11,13))
    plt.title(author, fontsize=16)
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()
    
authorWordcloud('HPL')
authorWordcloud('EAP')
authorWordcloud('MWS')

### Now we have wordclouds for each author

We can tell already that they tend to favor certain words, and they often intersect in interesting ways.  Now let's try to figure out the most commonly used words by each author

In [None]:
# function for a specific author to count occurances of each word
def authorCommonWords(author, numWords):
    authorWords = df_words[df_words['author'] == author].groupby('word').size().reset_index().rename(columns={0:'count'})
    authorWords.sort_values('count', inplace=True)
    return authorWords[-numWords:]

# for example, here's how we get the 10 most common EAP words.
authorCommonWords('EAP', 10)

## Utilizing An Author's Top Words

Now we can get a list of an author's top words, and there are probably lots of ways to use them.  Our initial appoach will be make an indicator tensor (https://www.tensorflow.org/api_docs/python/tf/feature_column/indicator_column) for each possible top word and use it to indicate whether that example contains the given word.  Then we'll let our ML algorithm figure out the scoring relationship by treating that column as categorical data.

In [None]:
# get all top words from our authors.
# this will represent our top words "vocabulary list"
authors_top_words = []
for author in authors:
    authors_top_words.extend(authorCommonWords(author, 10)['word'].values)

# use a set to remove duplicates
authors_top_words = list(set(authors_top_words))

In [None]:
# put all the top words used in each example into a new column    
df['top_words'] = df['words'].apply(lambda w: list(set(filter(set(w).__contains__, authors_top_words))))
df[['author','top_words', 'words']].head()

## TODO: Create feature columns for each top word

## Making an ML Model
Now that we have a lot of features, let's try to make our first ML model and see how it does.  I'm going to use Tensorflow to make a Logistic Regression model using the estimator API.

## Defining our features for Tensorflow
We have both continuous value columns, like words per sentence, and some categorical columns like top_words.  Let's figure out which columns we want to use from our dataframe and then define the feature columns.


In [None]:
# First, let's just pull out the columns we need
# feature_columns = ['author', 'word_count', 'text_length', 'punctuation_per_char', 'unique_ratio', 'avg_word_length', 'sentiment', 'top_words']
# TODO: put back in top_words once we figure it out
feature_columns = ['author', 'word_count', 'text_length', 'punctuation_per_char', 'unique_ratio', 'avg_word_length', 'sentiment']
df_features = df[feature_columns]

# Now let's split into a train and dev set
# use random_state seed so we get the same split each time
df_train=df_features.sample(frac=0.8,random_state=1)
df_dev=df_features.drop(df_train.index)

df_train.head()

In [None]:
import tensorflow as tf

# continual numeric features
feature_word_count = tf.feature_column.numeric_column("word_count")
feature_text_length = tf.feature_column.numeric_column("text_length")
feature_punctuation_per_char = tf.feature_column.numeric_column("punctuation_per_char")
feature_unique_ratio = tf.feature_column.numeric_column("unique_ratio")
feature_avg_word_length = tf.feature_column.numeric_column("avg_word_length")
feature_sentiment = tf.feature_column.numeric_column("sentiment")

# if we just used the single top word we could do it this way (single-hot)
# feature_top_words = tf.feature_column.categorical_column_with_vocabulary_list(
#    "top_words", vocabulary_list=authors_top_words)

# feature_top_words = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(
#     "top_words_test", vocabulary_list=authors_top_words))

base_columns = [
    feature_word_count, feature_text_length, feature_punctuation_per_char, feature_unique_ratio, feature_avg_word_length, feature_sentiment
]

In [None]:
import tempfile

model_dir = tempfile.mkdtemp() # base temp directory for running models

# our Y value labels, i.e. the thing we are classifying
labels_train = df_train['author']

# let's make a training function we can use with our estimators
train_fn = tf.estimator.inputs.pandas_input_fn(
    x=df_train,
    y=labels_train,
    batch_size=100,
    num_epochs=None, # unlimited
    shuffle=True, # shuffle the training data around
    num_threads=5)

# let's try a simple linear classifier
linear_model = tf.estimator.LinearClassifier(
    model_dir=model_dir, 
    feature_columns=base_columns,
    n_classes=len(authors),
    label_vocabulary=authors)

In [None]:
train_steps = 5000

# now let's train that model!
linear_model.train(input_fn=train_fn, steps=train_steps)

In [None]:
# let's see how well we did on our training set
dev_test_fn = tf.estimator.inputs.pandas_input_fn(
    x=df_dev,
    y=df_dev['author'],
    batch_size=100,
    num_epochs=1, # just one run
    shuffle=False, # don't shuffle test here
    num_threads=5)

linear_model.evaluate(input_fn=dev_test_fn)["accuracy"]

# Results so far

Well no we have a model that can guess the right author about 43% of the time.  Better than the 33% of random chance, but really not so great :).  Though we were just using a simple logistic regression model (linear classifier) and we may be able to get more accurate preditictions with a different classifier since the data relationships are probably not linear anyway.  Also, we have yet to incorporate the author's top words yet.


# Work In Progress

More to come soon, stay tuned