<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Continuing with Natural Language Processing 

### Learning Objectives
- Remember the techniques from the previous lesson on word vectorization
- Investigate additional approaches to expand NLP context
- Demonstrate an understanding of word-embedding
- Understand the purpose of latent variable models

### Please install the following libraries before starting this lesson.

**spacy**  - `pip install spacy` followed by `python -m spacy download en_core_web_sm`

**gensim** - `pip install gensim`

**pyldavis** - `pip install pyldavis`

In [None]:
import pandas as pd
import numpy as np

import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

import spacy
from spacy import displacy

import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
#nltk.download('wordnet')
#nltk.download('stopwords')

from gensim import matutils, corpora
from gensim.models import Word2Vec
from gensim.corpora import MmCorpus, dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel
from gensim.models import KeyedVectors


#import pyLDAvis.gensim_models

pd.options.display.float_format = '{:,.8f}'.format
%matplotlib inline

# Applications vs Mathematics and Design


As you progress further into the world of Machine Learning and Natural Language Processing, it becomes more and more imperative to understand the underlying mathematics. For the purpose of today's lesson we are going to focus on overviews and applications. For those wishing to continue further please consider the below articles and references to help explain the underlying maths and principles behind the techniques & libraries discussed here


- [Original paper on Word2vec](https://arxiv.org/abs/1301.3781)
- [Topic Modeling high level maths](https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05)
- [Designing extraction for biomedical topics](https://arxiv.org/abs/2010.00074)
- [Evaluating an LDA topic model](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)

#### The field is rapidly growing with numerous techniques continually 

- [BERT - Bidirectional Encoder Representations from Transformers](https://www.geeksforgeeks.org/explanation-of-bert-model-nlp/)
- [Autoencoders](https://mc.ai/introduction-to-autoencoders/)
- [LSTM - Long Short Term Memory Models](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)

#### As these models continue to emerge it's helpful to review
- Sites with over 1,400 models supporting inference across Sequence Classification, Text Generation, Question Answering, Token Classification with NER/POS, SUmmarization, NL inference, Conversational AI, Machine Translation, Text-to-speech and Commonsense Reasoning
    - [NLP model forge overview](https://medium.com/towards-artificial-intelligence/the-nlp-model-forge-a46faac7b5b0)
    - [Models](https://models.quantumstat.com/)
- Follow key individuals like [Phillip Vollet](https://www.linkedin.com/in/philipvollet/) to stay apraised of updates in the space

All the above are pieces to a larger puzzle you are trying to solve with NLP. While we can overview the techniques today - every machine learning problem you encounter will require extensive design spanning from preprocessing strategies to how you extract and either apply or persist your findings.

## Building on our understanding on NLP

In the last lesson you learned the basics of mining unstructured data including preprocessing, building linguistic rules to uncover patterns and creating a basic classifier on review data. In addition you saw the build in power associated with NLP libraries with basics such as sentiment analysis, objectivity analysis and additional features such as spell check.

The challenge in all NLP problems begins with the transformation of unstructured words into numbers. Those numbers must maintain some meaning to allow the analysis to continue. With a basic approach to vectorizing words you learned the power of word presence within a document using a bag-of-word method (countvectorizer) or capturing the importance of a word through its frequency (Term Frequency-Inverse Document Frequency). Both are helpful for simple tasks.

The challenge becomes when we work on more complex tasks.

# It's all about context

**Context** is the hardest component of machine understanding of human language. While the world is innundated with chatbots, voice assistants and mining algorithms - the field of NLP is only about 30% of the way toward replicating the human intelligence of linguistic.


And it's no surprise - it's hard! Language is imperfect, filled with personal contexts, idiomatic expressions and regional colloquialisms. It's the common things Machines can excel at and the uncommon... well we're getting better but not there yet. At the end of the day

![context is king](assets/context-is-king.jpg)

#### Consider the below.

- Shut up. You are being rude.

- OMG Shut up!

**What do you think is the intended meaning of "shut up" in each sentence?**

It can happen at the level of a word simply using a piece of punctiation

- There

- There!

**How might these two hold different meaning?**


As you can imagine - the way data is designed to capture this nuance is incredibly important if we wish to hold onto the context of what someone is trying to communicate. While the prevalence of frequency of words can be important - communication is often concerned about context. It's these nuances in words, both spoken and written, that hold completely seperate meaning.


### How can we build on context?

Before with our bag of words models we couldn't really address context - only prevalence of words. One way we can build on our understanding is through **Part of Speech Tagging**.

Consider the word **bank** - How many meanings do you associate it with? 


#### **What if I said limit it to a noun**?


#### **How about a verb**?

### Part of Speech (POS) Tags

POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used.


There are different techniques for POS Tagging:
1. **Lexical Based Methods** — Assigns the POS tag the most frequently occurring with a word in the training corpus.


2. **Rule-Based Methods** — Assigns POS tags based on rules. For example, we can have a rule that says, words ending with “ed” or “ing” must be assigned to a verb. Rule-Based Techniques can be used along with Lexical Based approaches to allow POS Tagging of words that are not present in the training corpus but are there in the testing data.


3. **Probabilistic Methods** — This method assigns the POS tags based on the probability of a particular tag sequence occurring. Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) are probabilistic approaches to assign a POS Tag.


4. **Deep Learning Methods** — Recurrent Neural Networks can also be used for POS tagging.

By understanding the word's POS - we can better understand the intended meaning. Furthermore by building "parse trees" we are able to build other contextual items like Named Entity Recognition Systems and Entity Linking capabilities.

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

<center><img src='assets/spacy.png' alt='sPacy' width='300'></center>


[Spacy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python. Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage. Key abilities include:

- Non-destructive tokenization
- Named entity recognition
- Support for 59+ languages
- 46 statistical models for 16 languages
- Pretrained word vectors
- State-of-the-art speed
- Easy deep learning integration
- Part-of-speech tagging
- Labelled dependency parsing
- Syntax-driven sentence segmentation
- Built in visualizers for syntax and NER
- Convenient string-to-hash mapping
- Export to numpy data arrays
- Efficient binary serialization
- Easy model packaging and deployment
- Robust, rigorously evaluated accuracy


In the below we will review it for POS tagging, Named Enttty recognition and named entity linking. However, those interested in NLP should review it's offerings at a greater depth for its range of applications. You can  see a list of [Universal POS tags here](https://universaldependencies.org/docs/u/pos/).


### POS Tagging
In the below we take a boiler plate example of spacey to show how it is able to parse sentences into their component parts. This example is limited to displaying the word, its associated POS and any additional details contained in the Tag.

In [None]:
# Loading in the Spacy Model

nlp = spacy.load("en_core_web_sm")

# Pass the model a sentence to review
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")

#Show the ability to parse out each word. 
for token in doc:
    print(f'"{token.text}" is a "{token.pos_}" tagged as a "{token.tag_}"')
    print()

### Spacy will also natively map the relationships between words to show associations built on traditional grammer rules and conventions

In [None]:
# We can take this further by using displacy to render how words are tied together
displacy.render(doc, style="dep")

In [None]:
# There are ranges of other features including the ability to capture, dissect and retain structure for noun chunks

doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(f'{chunk.text}  | {chunk.root.text}   | {chunk.root.dep_}   |{chunk.root.head.text}')

## Named entity recognition

Named Entity Recognition (NER) - is an information extraction technique that automatically identifies named entities in a text and classifies them into predefined categories. Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more.

With named entity recognition, you can obtain key information to understand what a text is about, making it a great starting point for all kinds of text analysis and data organization.

### NER falls into four primary approaches:

1. **Lexicon Approach** - Relying on a knowledge base containing all the words or terms related to a particular topic, grouped into categories. While this can be a strong approach, it is solely dependent on the data/dictionaries you have on hand. The obvious downside is it does not work for terms not in the lexicon


2. **Rule-based systems** - Employ a series of grammatical rules hand-crafted by computational linguists. Rules work well to extract entities like street names, phone numbers, social security numbers, or any other type of data that follows specific patterns.

    With rule-based systems, you can get results of high precision but low recall. This means that, while most of the predictions for predefined categories are true positives (e.g, the majority of the words that a model tags as “company name” are actually companies), the ability of a model to identify all relevant instances a company is mentioned is low. Defining rules and patterns takes time and they can’t be adapted to new domains; they only work well for the purpose they’ve been created, and it’s hard to modify them.
    
    
3. **Machine Learning Based Systems** - learn to recognize entities in text based on previous examples. To build an entity extractor, you must begin with a large volume of labeled training data completed with both positives and negatives so that it can _learn_ what an entity is. 


4. **Hybrid Approach** - Traditionally a hybrid approach builds on a rules-based approach by adding a machine learning approach. The combination will allow you to extract entities with a high level of precision.


### Real-life use cases

1. **Article Tagging** - designing an internal search algorithm for an online publisher that has millions of articles. If for every search query the algorithm ends up searching all the words in millions of articles, the process will take a lot of time. Instead, if Named Entity Recognition can be run once on all the articles and the relevant entities (tags) associated with each of those articles are stored separately, this could speed up the search process considerably. With this approach, a search term will be matched with only the small list of entities discussed in each article leading to faster search execution.

2. **Recommendation Systems** - An expanding use case is adding additional meta-data to recommendation systems. NER can extract entities from news articles or summary documents then recommending similar entities mentioned based on a users choices

3. **Ticket Categorization** - Automate repetitive tasks by leveraging entity extraction to pull relevant pieces of data from your incoming tickets, like company names, product names, or series numbers, making it easier to route tickets to the most suitable agent or team for handling that issue.

4. **Customer Feedback** - Enhance survey responses by using NER to extract locations and products to filter/route feedback appropriately

5. **Resume Review** - Quickly parse out contact information, certifications and relevant experience


#### In the below we'll leverage sPacy to explore NER

In [None]:
text = "Apple is looking at buying a U.K. startup for $1 billion. I want a red apple."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")

#### Try it yourself

In [None]:
# Run the following and input your own sentence.

text = input()

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")

# Designing a system with POS and NER

Now that you know how to capture a POS or a NER - what's next? Being able to extract this context is fundamental in the preprocessing of larger models. By including them you are able to add additional context that will allow a followon model to be more generative or discriminatory depending on the purpose


### Where could these help with the model's we've discussed in class so far?


## Latent Variable models

With the focus on **context** the goal of today's lesson to continue with methods to refine the extraction of structure, organization and/or context in the text. Some of these methods come from having larger dictionarys, more data or expanded methods to build toward the underpinning of language and its underlying meaning.

Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).


Unlike our previous methods - **Latent Variable Models** are different in that they try to understand language based on **how** the words are used. We assume the data we are observing has some hidden, underlying structure that we can’t see, and which we’d like to learn.

These hidden, underlying structures are the latent (i.e. hidden) variables we want our model to understand.


In the previous lesson we learned we could equate 'bad' and 'badly' as they share the same root. Today we'll determine they are related because they are often used in the same way or near similar words.


|Traditional NLP Models|Latent Variable Models|
|------|------|
|Focused on theoretical understanding of language  |Focused on how the language is actually used in practice    |
|Tries to learn the rules of a particular language |Infers meaning from how words are used together    |
|Preprogrammed set of rules |Uses unsupervised learning to discover patterns or structure   |


There arises a problem that we'll soon discover needs to be addressed through all the latent models we'll explore below - Size of the matrix. By trying to add in context we run the risk of creating a very large, noisy and sparse matrix. Thus enters the need for dimensionality reduction.

There are many techniques to perform dimensionality reduction automatically and most follow a very similar approach

1. Identify correlated columns
2. Replace them with a new column that encapsulates the others

![dimensionalityreduction](./assets/dimreduce.png)

### Word Embeddings - Flipping our understanding of the Document Term Matrix

The core concept of word embeddings is that every word used in a language can be represented by a set of real numbers (a vector). If we take our usual document-word matrix and take its transpose, instead of talking about words as being features of a document, we can talk about documents as being features of a specific word.

#### Word embeddings open a variety of possible applications:

- Automatic summarization
- Machine translation 
- Named entity resolution
- Sentiment analysis
- Information retrieval
- Speech recognition
- Question answering
- Music/Video Recommendation


All of these are powered by finding context and determining how similar an item is to another. As the goal is to transform words into vectors - one often used method is **cosine similarity** - a measurement of similarity between two non-zero vectors of an inner product space.


![CosineSimilarity](assets/cosine.png)

<div style="text-align: right"><a href="https://dataaspirant.com/five-most-popular-similarity-measures-implementation-in-python/" title="Cosine Similarity">Source</a></div>

## Word2Vec - Exploring word embeddings


<center><img src='assets/word2vecmodel.png' alt='Word2vec' width='750'></center>

Word2Vec is one of the common libraries used for discovering and leveraging word embeddings in python. A word is defined by the company it keeps. That’s the premise behind Word2Vec as a method of converting words to numbers and representing them in a multi-dimensional space. Words frequently found close together in a collection of documents (corpus) will also appear close together in this space. They are said to be related contextually and allow us to capture the much needed context

Word2vec is not a single algorithm but a combination of two techniques – 

   - **Continuous bag-of-words (CBOW)** — The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.
   - **Continuous skip-gram** weighs nearby context words more heavily than more distant context words. While order still is not captured each of the context vectors are weighed and compared independently vs CBOW which weighs against the average context
    
Both of these are shallow neural networks which map one or more words to a target variable which is also one or more words. Both of these techniques learn weights which act as word vector representations. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence

#### Example
The most famous example of the context learned through word2vec is from the example below. Here we can see a few common word vectors based on a set of pretrained data

<center><img src='assets/word2vec1.png' alt='Word2vec' width='600'></center>


Through turning these words to vectors the following equation was possible.

### (king – man) + woman = ?

![Word2vec](assets/word2vec2.png)

You can reproduce this result by utilizing the below code and this [file](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). As this is based on a neural network - it needs quite a bit of data to understand nuance/context. Warning: This file is 3.4 gb unzipped



In [None]:
#Example - Do not run

#from gensim.models import KeyedVectors

# load the google word2vec model
#filename = 'GoogleNews-vectors-negative300.bin'
#model = KeyedVectors.load_word2vec_format(filename, binary=True)

# calculate: (king - man) + woman = ?
#result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
#print(result)

### Trialing Word2Vec

Lets see what we can do with less data. In the below we'll be doing an example with data from the longest living cartoon on the planet - [The Simpsons](https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons?select=simpsons_dataset.csv)

We'll build the model in three steps:

1. Build our model with parameters to allow it to parse through our data.
2. Build the vocabulary from a sequence of sentences
3. Train the model

In [None]:
Simpsons=pd.read_csv('./data/simpsons.csv')

### Step 1: Set Parameters:

- **min_count** = int - Ignores all words with total absolute frequency lower than this - (2, 100)
- **window** = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
- **size** = int - Dimensionality of the feature vectors. - (50, 300)
- **sampl**e = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
- **alpha** = float - The initial learning rate - (0.01, 0.05)
- **min_alpha** = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
- **negative** = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
- **workers** = int - Use these many worker threads to train the model (=faster training with multicore machines)

In [None]:
model = Word2Vec(min_count=20,
                     window=2,
                     vector_size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20)

### Preparing the Input
Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):

In [None]:
sentences = Simpsons.sentences.map(lambda sentences: sentences.split())

In [None]:
#Visualizing the split
sentences

### Step 2: Building the Vocabulary 
Word2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them):

In [None]:
model.build_vocab(sentences, progress_per=10000)

### Step 3: Training of the model:
Parameters of the training:
- total_examples = int - Count of sentences;
- epochs = int - Number of iterations (epochs) over the corpus - [10, 20, 30]

In [None]:
# Will take few seconds for the model to train
model.train(sentences, total_examples=model.corpus_count, epochs=30, report_delay=1)

#### Let's use our knowledge of the world's longest running cartoon to play around with some similarities. What is most similar to homer?

In [None]:
model.wv.most_similar(positive=["homer"])

#### What is most similar to Principal Skinner??

In [None]:
model.wv.most_similar(positive=["skinner"])

#### How about looking at similarities between words within our corpus?

In [None]:
model.wv.similarity("moe", 'tavern')

In [None]:
model.wv.similarity('maggie', 'baby')

#### Who is the odd one out in the below?

In [None]:
model.wv.doesnt_match(["nelson", "bart", "milhouse"])

#### Now back to the (king – man) + woman = ? example. Which word is to "woman" as "Bart" is to "man"?

In [None]:
model.wv.most_similar(positive=["girl", "bart"], negative=["boy"], topn=5)

### Key Takeaways:
    
1. Word embeddings allow us to build on our prior tokenization methods by adding "context" to the analysis
2. Word2Vec allows us to flip our DTM transposing the focus from the document to the term
3. Word2Vec has a built in dimensionality reduction model bringing similar contexts together

Note - [GloVe](https://nlp.stanford.edu/projects/glove/) is another method which adds a global context to Word2Vec