# Intro

I've read an [article](http://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) about LDA ("Latent Dirichlet Allocation"). I'm sure I haven't understand a lot of it but I want to try it for the Spooky Authors dataset.

## What is LDA in short?

*Latent Dirichlet allocation (LDA) is a topic model that generates topics based on word frequency from a set of documents. LDA is particularly useful for finding reasonably accurate mixtures of topics within a given document set.*

**It's used to find what is a text about.** That's pretty much the takeaway from the article above. I want to try it because the authors may have some common themes in their texts that they write about. Also, we've found that they have lived in different periods of time so this may also affect their topics.

## Steps to use the model

1. We need to split the text into a list of elements. That's the so-called `Tokenization`.
2. Once we have the tokens we need to clean them. We will remove the stop-words as Lucho did last time.
3. Stemming - removes the endings of the words and reserves only their roots (*rougly*)

## Let's dive in and see what will happen

In [3]:
import pandas as pd
train = pd.read_csv("data/train.zip", index_col=['id'])
test = pd.read_csv("data/test.zip", index_col=['id'])
sample_submission = pd.read_csv("data/sample_submission.zip", index_col=['id'])

print(train.shape, test.shape, sample_submission.shape)
print(set(train.columns) - set(test.columns))

(19579, 2) (8392, 1) (8392, 3)
{'author'}


### Let's tokenize the text first

In [4]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"'?\b[0-9A-Za-z'-]+\b'?")  # matches words and words with apostrophes
# test the regex first as it wasn't a top match in SO and I've changed it a little bit...
tokens1 = tokenizer.tokenize(train.iloc[0]['text'])
tokens2 = tokenizer.tokenize("I'm Martin. It's going to be awesome data exploration")
tokens3 = tokenizer.tokenize("Let's try with some words with dashes like so-called")
print(tokens1)
print(tokens2)
print(tokens3)

['This', 'process', 'however', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon', 'as', 'I', 'might', 'make', 'its', 'circuit', 'and', 'return', 'to', 'the', 'point', 'whence', 'I', 'set', 'out', 'without', 'being', 'aware', 'of', 'the', 'fact', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall']
["I'm", 'Martin', "It's", 'going', 'to', 'be', 'awesome', 'data', 'exploration']
["Let's", 'try', 'with', 'some', 'words', 'with', 'dashes', 'like', 'so-called']


### The regex looks working. Let's change our `text` column to `tokens`

I will make a function that takes a `DataFrame` and returns a new one with the mutated `text` column.

In [5]:
def prepare_df_for_LAD(df):
    def tokenize(text):
        return tokenizer.tokenize(text)
    
    new_df = df.copy()
    
    new_df['tokens'] = new_df.apply(lambda r: tokenize(r['text']), axis=1)
    
    return new_df

# test the new method
new_train = prepare_df_for_LAD(df=train)
new_train.head(10)

Unnamed: 0_level_0,text,author,tokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id26305,"This process, however, afforded me no means of...",EAP,"[This, process, however, afforded, me, no, mea..."
id17569,It never once occurred to me that the fumbling...,HPL,"[It, never, once, occurred, to, me, that, the,..."
id11008,"In his left hand was a gold snuff box, from wh...",EAP,"[In, his, left, hand, was, a, gold, snuff, box..."
id27763,How lovely is spring As we looked from Windsor...,MWS,"[How, lovely, is, spring, As, we, looked, from..."
id12958,"Finding nothing else, not even gold, the Super...",HPL,"[Finding, nothing, else, not, even, gold, the,..."
id22965,"A youth passed in solitude, my best years spen...",MWS,"[A, youth, passed, in, solitude, my, best, yea..."
id09674,"The astronomer, perhaps, at this point, took r...",EAP,"[The, astronomer, perhaps, at, this, point, to..."
id13515,The surcingle hung in ribands from my body.,EAP,"[The, surcingle, hung, in, ribands, from, my, ..."
id19322,I knew that you could not say to yourself 'ste...,EAP,"[I, knew, that, you, could, not, say, to, your..."
id00912,I confess that neither the structure of langua...,MWS,"[I, confess, that, neither, the, structure, of..."


### Seems good. Next step - remove the stop-words

I will use [this Python library](https://pypi.python.org/pypi/stop-words) and update our `prepare_df_for_LAD` function.

In [6]:
from stop_words import get_stop_words


def prepare_df_for_LAD(df):
    def remove_stop_words(tokens):
        english_stop_words = get_stop_words('en')

        return [token for token in tokens if token not in english_stop_words]

    def tokenize(text):
        intiial_tokens = tokenizer.tokenize(text)
        no_stop_words_tokens = remove_stop_words(intiial_tokens)
        
        return no_stop_words_tokens

    new_df = df.copy()
    
    new_df['tokens'] = new_df.apply(lambda r: tokenize(r['text']), axis=1)
    
    return new_df


new_train = prepare_df_for_LAD(df=train)
new_train.head(10)

Unnamed: 0_level_0,text,author,tokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id26305,"This process, however, afforded me no means of...",EAP,"[This, process, however, afforded, means, asce..."
id17569,It never once occurred to me that the fumbling...,HPL,"[It, never, occurred, fumbling, might, mere, m..."
id11008,"In his left hand was a gold snuff box, from wh...",EAP,"[In, left, hand, gold, snuff, box, capered, hi..."
id27763,How lovely is spring As we looked from Windsor...,MWS,"[How, lovely, spring, As, looked, Windsor, Ter..."
id12958,"Finding nothing else, not even gold, the Super...",HPL,"[Finding, nothing, else, even, gold, Superinte..."
id22965,"A youth passed in solitude, my best years spen...",MWS,"[A, youth, passed, solitude, best, years, spen..."
id09674,"The astronomer, perhaps, at this point, took r...",EAP,"[The, astronomer, perhaps, point, took, refuge..."
id13515,The surcingle hung in ribands from my body.,EAP,"[The, surcingle, hung, ribands, body]"
id19322,I knew that you could not say to yourself 'ste...,EAP,"[I, knew, say, 'stereotomy', without, brought,..."
id00912,I confess that neither the structure of langua...,MWS,"[I, confess, neither, structure, languages, co..."


### We are almost ready

What's left? As I told you in the beginning we need to use `Stemming`. As we used the same technique during [the lecture](http://fmi.machine-learning.bg/lectures/08-spooky-author-identification), I changed my mind!

Let's use `Lemmatization`, instead! It's almost the same as the `Stemming` technique but *on steroids*. It can make difference from *-ing* nouns and verbs.

In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/martin056/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [8]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
# I will try it first, as I'm using it for first time
for token in new_train.iloc[0]['tokens']:
    print(lemmatizer.lemmatize(token))

This
process
however
afforded
mean
ascertaining
dimension
dungeon
I
might
make
circuit
return
point
whence
I
set
without
aware
fact
perfectly
uniform
seemed
wall


### Cool. We headed into some problems

1. We need to `lowercase()` the tokens
2. We need to make the differnce if the word is a verb or a noun by ourselves.

Well, it's not that "by ourselves". I've found something called `pos_tag` in **nltk**. Let's explore it.

In [9]:
import nltk

nltk.download('averaged_perceptron_tagger')

nltk.pos_tag(new_train.iloc[0]['tokens'])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/martin056/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('This', 'DT'),
 ('process', 'NN'),
 ('however', 'RB'),
 ('afforded', 'VBD'),
 ('means', 'NNS'),
 ('ascertaining', 'VBG'),
 ('dimensions', 'NNS'),
 ('dungeon', 'VBP'),
 ('I', 'PRP'),
 ('might', 'MD'),
 ('make', 'VB'),
 ('circuit', 'NN'),
 ('return', 'NN'),
 ('point', 'NN'),
 ('whence', 'NN'),
 ('I', 'PRP'),
 ('set', 'VBP'),
 ('without', 'IN'),
 ('aware', 'JJ'),
 ('fact', 'NN'),
 ('perfectly', 'RB'),
 ('uniform', 'JJ'),
 ('seemed', 'VBD'),
 ('wall', 'NN')]

### Not that bad....

In [10]:
from nltk.corpus import wordnet

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
    
tokens_and_tags = nltk.pos_tag([t.lower() for t in new_train.iloc[0]['tokens']])

for token, tag in tokens_and_tags:
    print(lemmatizer.lemmatize(token, get_wordnet_pos(tag)))

this
process
however
afford
mean
ascertain
dimension
dungeon
i
might
make
circuit
return
point
whence
i
set
without
aware
fact
perfectly
uniform
seem
wall


### Seems good (to me)

Let's change `prepare_df_for_LAD`.

In [11]:
from stop_words import get_stop_words
import nltk


def prepare_df_for_LAD(df):
    def tokenize(text):
        return tokenizer.tokenize(text)
    
    def remove_stop_words(tokens):
        english_stop_words = get_stop_words('en')

        return [token for token in tokens if token not in english_stop_words]
    
    def lower(tokens):
        return [token.lower() for token in tokens]
    
    def lemmatize(tokens):
        lemmatizer = nltk.stem.WordNetLemmatizer()
        tokens_and_tags = nltk.pos_tag(tokens)
        
        lemmas = []

        for token, tag in tokens_and_tags:
            lemma = lemmatizer.lemmatize(token, get_wordnet_pos(tag))
            lemmas.append(lemma)
            
        return lemmas
    
    def prepare(text):
        intiial_tokens = tokenize(text)
        no_stop_words_tokens = remove_stop_words(intiial_tokens)
        lowered_tokens = lower(no_stop_words_tokens)
        lemmatized_tokens = lemmatize(lowered_tokens)
        
        return lemmatized_tokens

    new_df = df.copy()
    
    new_df['tokens'] = new_df.apply(lambda r: prepare(r['text']), axis=1)
    
    return new_df


new_train = prepare_df_for_LAD(df=train)
new_train.head(10)

Unnamed: 0_level_0,text,author,tokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id26305,"This process, however, afforded me no means of...",EAP,"[this, process, however, afford, mean, ascerta..."
id17569,It never once occurred to me that the fumbling...,HPL,"[it, never, occur, fumble, might, mere, mistake]"
id11008,"In his left hand was a gold snuff box, from wh...",EAP,"[in, left, hand, gold, snuff, box, caper, hill..."
id27763,How lovely is spring As we looked from Windsor...,MWS,"[how, lovely, spring, a, looked, windsor, terr..."
id12958,"Finding nothing else, not even gold, the Super...",HPL,"[find, nothing, else, even, gold, superintende..."
id22965,"A youth passed in solitude, my best years spen...",MWS,"[a, youth, pass, solitude, best, year, spend, ..."
id09674,"The astronomer, perhaps, at this point, took r...",EAP,"[the, astronomer, perhaps, point, take, refuge..."
id13515,The surcingle hung in ribands from my body.,EAP,"[the, surcingle, hung, ribands, body]"
id19322,I knew that you could not say to yourself 'ste...,EAP,"[i, know, say, 'stereotomy', without, bring, t..."
id00912,I confess that neither the structure of langua...,MWS,"[i, confess, neither, structure, languages, co..."


### It's time for Latent Dirichlet Allocation

Now, we need to convert our new tokens to a `bag-of-words`. This will be the "dictionary" of our LDA model.

In [12]:
params_count_word = {"features__ngram_range": [(1,1), (1,2), (1,3)],
                      "features__analyzer": ['word'],
                      "features__max_df":[1.0, 0.9, 0.8, 0.7, 0.6, 0.5],
                      "features__min_df":[2, 3, 5, 10],
                      "features__lowercase": [False, True]}

In [13]:
def report(results, n_top=5):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [16]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import log_loss
from sklearn.pipeline import Pipeline, FeatureUnion

def random_search():
    params = {
        'lda__learning_method': ['online'],
        'lda__learning_offset': [20., 30., 40., 50., 60.]
    }

    params.update(params_count_word)
    
    # TODO:
    # FeatureUnion: LatentDirichletAllocation + CountVectorizer -> LogisticRegresion / MultinomialNB

#     pipeline = Pipeline([
#         ('features', CountVectorizer()),
#         ('lda', LatentDirichletAllocation())
#     ])

    random_search = RandomizedSearchCV(pipeline, param_distributions=params, 
                                       scoring='neg_log_loss',
                                       n_iter=20, cv=3, n_jobs=4)

    random_search.fit(new_train.text, new_train.author)
    report(random_search.cv_results_)

# random_search()