# How wrote this?

"I ought to be thy Adam, but I am rather the fallen angel."

"Nothing is so painful to the human mind as a great and sudden change."

"It is the nature of truth in general, as of some ores in particular, to be richest when most superficial."

"The oldest and strongest emotion of mankind is fear, and the oldest and strongest kind of fear is fear of the unknown."

![alt text](img/shelley.jpg) ![alt text](img/poe.png) ![alt text](img/lovecraft.jpg)

Edgar Allan Poe, Mary Shelley, Howard Phillips Lovecraft

# A Study for Authorship Attribution on Sentences

Objective:
Identify the writer of a sentence

Data:
For each writer a list of sentences from works of e.g. like Frankenstein, The Unparalleled Adventure of Hans Pfaall, The Call of Cthulhu

## Data Exploration

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
from os import path
import nltk
from nltk.stem import WordNetLemmatizer
from functools import reduce, partial
from itertools import groupby
import operator

# Utilities
identity = lambda x: x

def compose(*functions):
    return reduce(lambda f, g: lambda x: f(g(x)),
                            functions,
                            identity)

In [2]:
# Import Data and Tokenize
d = pd.read_csv('data/train.csv')
#execute only once to download resources
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('wordnet')
d['tokens'] = d.text.apply(nltk.word_tokenize)
d.head()

Unnamed: 0,id,text,author,tokens
0,id26305,"This process, however, afforded me no means of...",EAP,"[This, process, ,, however, ,, afforded, me, n..."
1,id17569,It never once occurred to me that the fumbling...,HPL,"[It, never, once, occurred, to, me, that, the,..."
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,"[In, his, left, hand, was, a, gold, snuff, box..."
3,id27763,How lovely is spring As we looked from Windsor...,MWS,"[How, lovely, is, spring, As, we, looked, from..."
4,id12958,"Finding nothing else, not even gold, the Super...",HPL,"[Finding, nothing, else, ,, not, even, gold, ,..."


In [3]:
# Further provisions of token preprocessing

# Remove heading apostrophes as there occurrence is highly unbalanced among wirter corpora (might be due to publishing reasons)
filter_heading_apostrophes = compose(list,
                                     partial(filter, lambda tokens: "'"!=tokens[0][0]))

transform_lower_case = compose(list,
                               partial(map, lambda token: token.lower()))

d.tokens = d.tokens.apply(compose(filter_heading_apostrophes,
                                  transform_lower_case))

### Vocabulary in the Corpus

In [4]:
tokenList = partial(reduce, operator.concat)
frequencies = lambda l: sorted([(k, len(list(v))) for k,v in groupby(sorted(l))],
                               key=lambda x: x[1], reverse=True)

allTokens = {}
allTokens['corpus'] = tokenList(d.tokens)
vocabulary = {}
vocabulary['corpus'] = set(allTokens['corpus'])
tokenFreq = {}
tokenFreq['corpus'] = frequencies(allTokens['corpus'])

#### Whole Corpus

In [5]:
print("Number of tokens", len(allTokens['corpus']), 
      ", vocabulary size", len(vocabulary['corpus']))
print("Most occurring tokens", tokenFreq['corpus'][:10])

Number of tokens 589810 , vocabulary size 25127
Most occurring tokens [(',', 38220), ('the', 35559), ('of', 20953), ('.', 19119), ('and', 17953), ('to', 12839), ('i', 10794), ('a', 10720), ('in', 9457), ('was', 6662)]


#### Vocabulary of Writers

In [6]:
writers = ['EAP', 'HPL', 'MWS']
for writer in writers:
    print('\nFor writer', writer)
    allTokens[writer] = tokenList(d[d.author==writer].tokens)
    print('  Number of tokens', len(allTokens[writer]))
    vocabulary[writer] = set(allTokens[writer])
    print('  Vocabulary size', len(vocabulary[writer]))
    tokenFreq[writer] = frequencies(allTokens[writer])
    print('  Most frequent words:', tokenFreq[writer][:30])


For writer EAP
  Number of tokens 229700
  Vocabulary size 15306
  Most frequent words: [(',', 17594), ('the', 14969), ('of', 8970), ('.', 7700), ('and', 5733), ('to', 4761), ('a', 4715), ('in', 4124), ('i', 3778), ('that', 2327), ('it', 2326), ('was', 2229), ('my', 1788), ('with', 1695), ('is', 1668), ('``', 1628), ('at', 1588), ('as', 1570), ('which', 1488), (';', 1354), ('not', 1347), ('for', 1343), ('had', 1318), ('he', 1302), ('this', 1296), ('his', 1278), ('by', 1206), ('but', 1200), ('be', 1097), ('have', 1055)]

For writer HPL
  Number of tokens 172378
  Vocabulary size 14522
  Most frequent words: [('the', 10933), (',', 8581), ('and', 6098), ('of', 5846), ('.', 5707), ('a', 3294), ('to', 3249), ('in', 2736), ('i', 2704), ('was', 2184), ('that', 2021), ('had', 1783), ('he', 1647), ('it', 1402), ('as', 1173), ('his', 1171), (';', 1143), ('with', 1122), ('for', 1020), ('but', 979), ('my', 971), ('at', 940), ('on', 933), ('which', 920), ('from', 910), ('not', 894), ('were', 708),

In [7]:
singletons = list(filter(lambda x: x[1] == 1, tokenFreq['corpus']))
print("Number of Tokens occuring only once", len(singletons))
print("Samples of rare words", list(map(lambda x: x[0], singletons[:10])))

Number of Tokens occuring only once 9307
Samples of rare words ['a.d', 'a.d.', 'a.m', 'aaem', 'ab', 'abaft', 'abased', 'abasement', 'abashed', 'abashment']


## Simple model on most frequent words

In [8]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer()
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import LabelBinarizer
from sklearn import preprocessing

In [9]:
# Vectorize
def first_n_most_frequent_tokens(n):
    return list(map(lambda x: x[0], tokenFreq['corpus'][:n]))

n = 100
print("The", n, "most frequent tokens are", first_n_most_frequent_tokens(n))
print("  with a coverage of", reduce(lambda x, y: x + y[1],
                                         tokenFreq['corpus'][:n], 0))

filter_by_set = lambda coll: compose(list,
                                     partial(filter, lambda x: x in coll))

def frequencies_in_tokens(tokens, vocabulary):
    "Returns the frequencies of the vocs in the tokens as sparse matrix"
    return dv.fit_transform(map(compose(dict,
                                     frequencies,
                                     filter_by_set(vocabulary)),
                            tokens))

The 100 most frequent tokens are [',', 'the', 'of', '.', 'and', 'to', 'i', 'a', 'in', 'was', 'that', 'my', ';', 'it', 'he', 'had', 'with', 'his', 'as', 'for', 'not', 'which', 'but', 'at', 'me', 'from', 'by', '``', 'is', 'this', 'on', 'be', 'her', 'were', 'have', 'all', 'you', 'an', 'we', 'or', 'no', 'one', 'so', 'him', 'when', 'they', 'been', 'upon', 'there', 'could', 'she', 'its', 'would', 'more', 'now', 'their', '?', 'what', 'some', 'our', 'are', 'into', 'than', 'will', 'very', 'who', 'if', 'them', 'only', 'then', 'up', 'these', 'before', 'man', 'about', 'any', 'time', 'did', 'yet', 'out', 'said', 'even', 'your', 'might', 'after', 'do', 'old', 'like', 'can', 'first', 'must', 'us', 'most', 'through', 'over', 'never', 'life', 'night', 'made', 'other']
  with a coverage of 335389


In [10]:
precision = lambda x, y: sum(x == y) / len(x)

def classifier_assessment(model_class, X, Y) -> None:
    return {'in-sample-precision': precision(Y, model_class.fit(X,Y).predict(X)),
            'out-of-sample-precision': np.mean(cross_val_score(model_class, X, Y, cv=8))}

In [11]:
Y = d.author
X = frequencies_in_tokens(d.tokens,
                          first_n_most_frequent_tokens(200))
model = MultinomialNB().fit(X,Y)
Y_pred = model.predict(X)
print(classification_report(Y, Y_pred))
print('Confusion Matrix\n', writers, '\n', confusion_matrix(Y, Y_pred, labels=writers))

             precision    recall  f1-score   support

        EAP       0.67      0.68      0.67      7900
        HPL       0.59      0.66      0.62      5635
        MWS       0.67      0.59      0.63      6044

avg / total       0.65      0.65      0.65     19579

Confusion Matrix
 ['EAP', 'HPL', 'MWS'] 
 [[5350 1414 1136]
 [1307 3701  627]
 [1325 1135 3584]]


### Sensitivity to Vocabulary Size

In [12]:
for n in [5, 10, 50, 100, 200, 400, 600, 1000, 1500, 2000, 5000, 10000]:
    X = frequencies_in_tokens(d.tokens, first_n_most_frequent_tokens(n))
    print(n, classifier_assessment(MultinomialNB(), X, Y))

5 {'in-sample-precision': 0.45559017314469585, 'out-of-sample-precision': 0.45605126059258444}
10 {'in-sample-precision': 0.4691250829970887, 'out-of-sample-precision': 0.4692259731151678}
50 {'in-sample-precision': 0.5732672761632361, 'out-of-sample-precision': 0.5711207508283209}
100 {'in-sample-precision': 0.5990091424485418, 'out-of-sample-precision': 0.5963534444116483}
200 {'in-sample-precision': 0.6453342867357883, 'out-of-sample-precision': 0.6388973395539028}
400 {'in-sample-precision': 0.6934981357576996, 'out-of-sample-precision': 0.6860909748901042}
600 {'in-sample-precision': 0.7227641861177793, 'out-of-sample-precision': 0.7119357667044304}
1000 {'in-sample-precision': 0.7562183972623729, 'out-of-sample-precision': 0.7413032593343052}
1500 {'in-sample-precision': 0.7820113386791971, 'out-of-sample-precision': 0.7619378174824729}
2000 {'in-sample-precision': 0.8008069870779917, 'out-of-sample-precision': 0.7786915094976463}
5000 {'in-sample-precision': 0.8575514581950049, 

Take no more than 200 words as from then on content bearings words appear, e.g. character names like Raymond.

### Sensitivity towards Model Class

In [19]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

n = 200
X = frequencies_in_tokens(d.tokens,
                          first_n_most_frequent_tokens(n))

model_classes = [MultinomialNB(),
                 LinearSVC(),
                 RandomForestClassifier(),
                 GradientBoostingClassifier(n_estimators= 100, max_depth = 5)]

for model_class in model_classes:
    print(model_class, '\n', classifier_assessment(model_class, X, Y))

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) {'in-sample-precision': 0.6453342867357883, 'out-of-sample-precision': 0.6388973395539028}
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0) {'in-sample-precision': 0.6688288472342816, 'out-of-sample-precision': 0.6583091404139636}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False) {'in-sample-precision': 0.9852903621226825, 'out-of-sample-precision': 0.5929291106550424}
GradientBoostingClassifier(crite

In [23]:
# Which of the common words have the most predictive power
from sklearn.neighbors import NearestCentroid, KNeighborsClassifier
for model_class in [NearestCentroid(),
                    KNeighborsClassifier(n_neighbors=10),
                    KNeighborsClassifier(n_neighbors=20)]:
    print(model_class, '\n', classifier_assessment(model_class, X, Y))

NearestCentroid(metric='euclidean', shrink_threshold=None) 
 {'in-sample-precision': 0.477858930486746, 'out-of-sample-precision': 0.47617730584202467}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform') 
 {'in-sample-precision': 0.6090198682261607, 'out-of-sample-precision': 0.5193303149288042}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=20, p=2,
           weights='uniform') 
 {'in-sample-precision': 0.5833290770723735, 'out-of-sample-precision': 0.5329662552610975}


### Features with most predictive power
In an multinomial model
\[ P_\]

In [46]:
1model.feature_log_prob_.shape

(3, 200)

###  Enhance Features with Statistics
Add to the above frequencies of common words the following statistics
 * sentence length
 
The stastistics alone has predictive power. Find a ways for incorporation into above model approaches.

In [41]:
sentence_lengths = np.array(list(map(len, d.tokens))).reshape(-1, 1)
classifier_assessment(MultinomialNB(), sentence_lengths, Y)

{'in-sample-precision': 0.40349353899586293,
 'out-of-sample-precision': 0.4034935578860775}

In [42]:
#X = frequencies_in_tokens(d.tokens, first_n_most_frequent_tokens(n))
type(X)

scipy.sparse.csr.csr_matrix

##  Questions
 * Why does the usual tf-idf approach not work for author attribution?
 * Explain why the in-sample and ou-of-sample error for random forest are so significant

## Outlook

Character Level
 * n-grams 
   * eg. 3-grams |No_| |o_o| |_on| |one|
 * punctuation
   * eg. frequency of commas, semicolons, periods, quotation-marks

Lexical Level
 * word occurences
   * frequences
   * indicators
 * word n-grams

Syntactical Level
 * part of speech
   word classes as lexical items: noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, determiner
 * part of sentence
   consituents as lexical items: subject, predicate, direct/indirect object, modifier, abvervial,
 * sentence 

Semantic
 *


In [None]:
# In the above multinomial model the use of common words in a sentence happens in the following way
# A writer y picks a common words x_j with probability p_j
#  
# https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/naive_bayes.py#L630