# **"Boost" your NLP skills for beginners**

From this notebook you can get very good understanding of end-to-end data science & natural language processing pipeline, starting with raw data and running through preparing, modeling, visualizing, and analyzing the data.

Notebook covers:
* A tour of the dataset
* Introduction to text processing with spaCy
* Automatic phrase modeling
* Topic modeling with LDA
* Visualizing topic models with pyLDAvis
* Word vector models with word2vec
* Visualizing word2vec with t-SNE

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import spacy
import itertools as it
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis
import pyLDAvis.gensim
import warnings
import _pickle as pickle
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value
import matplotlib.pyplot as plt
%matplotlib inline
output_notebook()

from subprocess import check_output

 ## Spooky author dataset
 **19579 ** sentences<br>
 **3** authors
* Edgar Allan Poe (EAP)
* HP Lovecraft (HPL)
* Mary Wollstonecraft Shelley (MWS)

In [None]:
train = pd.read_csv('../input/train.csv')
train['author'].value_counts().plot.pie(autopct='%.2f', fontsize=20, figsize=(6, 6))
plt.title('Authorwise distribution')
None

In [None]:
train.isnull().any()  # sanity check for null values

## Spacy - Industrial strength NLP in python
spaCy is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.
spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

* Tokenization
* Text normalization, such as lowercasing, stemming/lemmatization
* Part-of-speech tagging
* Syntactic dependency parsing
* Sentence boundary detection
* Named entity recognition and annotation

spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:

* Large English vocabulary, including stopword lists
* Token "probabilities"
* Word vectors

spaCy is written in optimized Cython, which means it's fast. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the GIL).

In [None]:
nlp = spacy.load('en')

Let's grab a sample sentence to play with.

In [None]:
sample_sent = train.loc[1000, 'text']
print(sample_sent)

In [None]:
parsed_sent = nlp(sample_sent)
print(parsed_sent)

Spacy handed over an object on which we can get sentences, segmentation, .......

In [None]:
for num, sentence in enumerate(parsed_sent.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print()

Named entity detection

In [None]:
for num, entity in enumerate(parsed_sent.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print()

Part of speech tagging

In [None]:
token_text = [token.orth_ for token in parsed_sent]
token_pos = [token.pos_ for token in parsed_sent]

pd.DataFrame({'token_text': token_text, 'part_of_speech': token_pos})

What about text normalization, like stemming/lemmatization and shape analysis?

In [None]:
token_lemma = [token.lemma_ for token in parsed_sent]
token_shape = [token.shape_ for token in parsed_sent]

pd.DataFrame({'token_text': token_text, 'token_lemma': token_lemma, 'token_shape': token_shape})

What about token-level entity analysis?

In [None]:
token_entity_type = [token.ent_type_ for token in parsed_sent]
token_entity_iob = [token.ent_iob_ for token in parsed_sent]

pd.DataFrame({'token_text': token_text, 'token_entity_type': token_entity_type,
              'token_entity_iob': token_entity_iob})

In  the above dataframe B, I, O stand for Begin, Inside, Outside respectively. Example New York is described as 'New' is begin and 'York' is inside of a phrase 'New York'. Because 'New York' occurs frequently.

What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?
* stopword
* punctuation
* whitespace
* represents a number
* whether or not the token is included in spaCy's default vocabulary?

In [None]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_sent]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

## Phrase Modeling
Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. 

Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york would become new_york). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensible gensim library to help us with phrase modeling — the Phrases class in particular.

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:
* normalize text
* First-order phrase modeling →→ apply first-order phrase model to transform sentences
* Second-order phrase modeling →→ apply second-order phrase model to transform sentences
* Apply text normalization and second-order phrase model to text
* We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence function will use spaCy to:
* Iterate over all the sentences 
* Remove punctuation and excess whitespace
* Lemmatize the text

In [None]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def lemmatized_sentence(sent):
    """
    helper function to use spaCy to parse sentences,
    lemmatize the text
    """
    return u' '.join([token.lemma_ for token in nlp(sent)
                             if not punct_space(token)])

In [None]:
train['unigram_text'] = train['text'].map(lambda x: lemmatized_sentence(x))

In [None]:
print(train.loc[1000, 'unigram_text'])

Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "new york", to be linked together to form a new, single token: "new_york".

In [None]:
bigram_model = Phrases(train.loc[:, 'unigram_text'])

In [None]:
train['bigram_text'] = train['unigram_text'].map(lambda x: u''.join(bigram_model[x]))

In [None]:
print(train.loc[1000, 'bigram_text'])

The text to learn is less, so model did not encounter new york many times and it could not generalize well. Which is really not necessary for this problem but might be useful in complex applications. [This link provides good examples on this.](http://nbviewer.jupyter.org/github/naveenrc/YelpChallenge/blob/e4813a131c788242c233b9e9290de0126544f3e0/Modern_NLP.ipynb#Phrase-Modeling)

## Topic Modeling with Latent Dirichlet Allocation (LDA)
Topic modeling is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics".

In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a vector of token counts. There are two layers in this model — documents and tokens — and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:

* Document vectors tend to be large (one dimension for each token ⇒⇒ lots of dimensions)
* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
* The dimensions are fully indepedent from each other — there's no sense of connection between related tokens, such as knife and fork.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its LdaMulticore class.

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's Dictionary class for this.

In [None]:
for idx, sent in train['bigram_text'].iteritems():
    bigram_dictionary = Dictionary([sent.split()])
    
bigram_dictionary.compactify()

Like many NLP techniques, LDA uses a simplifying assumption known as the bag-of-words model. In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded.

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The bigram_bow_generator function implements this. We'll save the resulting bag-of-words reviews as a matrix.

In the following code, "bag-of-words" is abbreviated as bow.

In [None]:
bigram_bow = bigram_dictionary.doc2bow(train['bigram_text'])

In [None]:
lda = LdaMulticore(bigram_bow,
                   num_topics=3,
                   id2word=bigram_dictionary)

In [None]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [None]:
explore_topic(2)

In [None]:
LDAvis_prepared = pyLDAvis.gensim.prepare(lda, bigram_bow, bigram_dictionary)
pyLDAvis.display(LDAvis_prepared)

### Wait, what am I looking at again?
There are a lot of moving parts in the visualization. Here's a brief summary:

* On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)

The plot is rendered in two dimensions according a multidimensional scaling (MDS) algorithm. Topics that are generally similar should be appear close together on the plot, while dissimilar topics should appear far apart.

The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.

An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

* On the right, there is a bar chart showing top terms.

When no topic is selected in the plot on the left, the bar chart shows the top most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.

When a particular topic is selected, the bar chart changes to show the top most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λλ, which can be adjusted with a slider above the bar chart.

Setting the λλ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.

Setting λλ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.

Setting λλ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.

Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

A more detailed explanation of the pyLDAvis visualization can be found here. Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.

### Analyzing our LDA model
The interactive visualization pyLDAvis produces is helpful for both:

Better understanding and interpreting individual topics, and Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most freqeuent and/or "relevant" terms, using different values of the λλ parameter. This can help when you're trying to assign a human interpretable name or "meaning" to each topic.

For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

### Describing text with LDA
Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% Topic A, 20% Topic B, 20% Topic C, and 10% Topic D.

To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:

* Using spaCy to remove punctuation and lemmatize the text
* Applying our first-order phrase model to join word pairs
* Applying our second-order phrase model to join longer phrases
* Removing stopwords
* Creating a bag-of-words representation

Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The lda_description(...) function will perform all these steps for us, including printing the resulting topical description of the input text.

In [None]:
topic_names={0: 'EAP',
             1: 'MWS',
             2: 'HPL'}
def lda_description(text, min_topic_freq=0.08):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_sent = nlp(text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_sent = [token.lemma_ for token in parsed_sent
                      if not punct_space(token)]
    
    # apply the first-order models
    bigram_sent = bigram_model[unigram_sent]
    
    # remove any remaining stopwords
    bigram_sent = [term for term in bigram_sent
                      if not term in spacy.en.English.Defaults.stop_words]
    
    # create a bag-of-words representation
    sent_bow = bigram_dictionary.doc2bow(bigram_sent)
    
    # create an LDA representation
    sent_lda = lda[sent_bow]
    
    # sort with the most highly related topics first
    sent_lda = sorted(sent_lda, key=lambda x: -x[1])
    
    for topic_number, freq in sent_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print('{:25} {}'.format(topic_names[topic_number],
                                round(freq, 3)))

In [None]:
print('Probabilities:')
print(lda_description(train.loc[5, 'text']))
print('Actual:', train.loc[5, 'author'])

In [None]:
print('Probabilities:')
print(lda_description(train.loc[1000, 'text']))
print('Actual:', train.loc[1000, 'author'])

Model is trying to predict the topic probabilities as you can see above. This is an example of how LDA works. Topics are not exactly what we need. I have just mapped the names to show how its used and can be interpreted.

## Word Vector Embedding with Word2Vec
The goal of word vector embedding models, or word vector models for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the meaning or concept the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised — they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is word2vec, originally proposed in 2013. The general idea of word2vec is, for a given focus word, to use the context of the word — i.e., the other words immediately before and after it — to provide hints about what the focus word might mean. To do this, word2vec uses a sliding window technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training epoch. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are close to each other in vector space.

For a deeper dive into word2vec's machine learning process, see here.

Word2vec has a number of user-defined hyperparameters, including:

* The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
* The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
* The number of training epochs.

For using word2vec in Python, gensim comes to the rescue again! It offers a highly-optimized, parallelized implementation of the word2vec algorithm with its Word2Vec class.

In [None]:
train['vec_inp'] = train['bigram_text'].map(lambda x: x.split(' '))

In [None]:
import sys

word2vec = Word2Vec(train['vec_inp'], size=20, window=5,
                        min_count=5, sg=0)

# perform another 100 epochs of training
for i in range(1,200):
    sys.stderr.write('\rOn {}'.format(i))
    word2vec.train(train['vec_inp'], total_examples=word2vec.corpus_count, 
                   epochs=word2vec.iter)

In [None]:
print(u'{:,} terms in the word2vec vocabulary.'.format(len(word2vec.wv.vocab)))

In [None]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in word2vec.wv.vocab.items()]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda x: -x[2])

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)
# create a DataFrame with the word2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(word2vec.wv.syn0[:],
                            index=ordered_terms)

word_vectors

In [None]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in word2vec.most_similar(positive=[token], topn=topn):

        print(u'{:20} {}'.format(word, round(similarity, 3)))

In [None]:
get_related_terms(u'owl')

In [None]:
get_related_terms(u'fear')

In [None]:
get_related_terms(u'blood')

In [None]:
get_related_terms(u'jealous')

### Word algebra!
The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:

* Provide a set of words or phrases that you'd like to add or subtract.
* Look up the vectors that represent those terms in the word vector model.
* Add and subtract those vectors to produce a new, combined vector.
* Look up the most similar vector(s) to this new, combined vector via cosine similarity.
* Return the word(s) associated with the similar vector(s).

But more generally, you can think of the vectors that represent each word as encoding some information about the meaning or concepts of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.

In [None]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = word2vec.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

### NIGHT + FEAR = ?

In [None]:
word_algebra(add=[u'night', u'fear'])

### FEAR - NIGHT = ? (something negative)

In [None]:
word_algebra(add=[u'fear'], subtract=[u'night'])

### NIGHT - FEAR = ? (something pleasant)

In [None]:
word_algebra(add=[u'night'], subtract=[u'fear'])

## Word Vector Visualization with t-SNE
t-Distributed Stochastic Neighbor Embedding, or t-SNE for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space.

scikit-learn provides a convenient implementation of the t-SNE algorithm with its TSNE class.

In [None]:
tsne_input = word_vectors.drop(spacy.en.English.Defaults.stop_words, errors=u'ignore')

In [None]:
tsne = TSNE()
tsne_vectors = tsne.fit_transform(tsne_input.values)

Now we have a two-dimensional representation of our data! Let's take a look.

In [None]:
tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])
tsne_vectors.head()

In [None]:
tsne_vectors[u'word'] = tsne_vectors.index

In [None]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, resize, reset'),
                   active_scroll=u'wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@word') )

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                 color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

Conclusion
Let's round up the major components that we've seen:

* Text processing with spaCy
* Automated phrase modeling
* Topic modeling with LDA  ⟶  ⟶  visualization with pyLDAvis
* Word vector modeling with word2vec  ⟶  ⟶  visualization with t-SNE

Why use these models?<br>
Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like:

* Text classification
* Search
* Recommendations
* Question answering
...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications