# NLP pre model activities - Distillation

### Natural Language Processing (NLP) 
Is the process of extracting information from text data. 
### Distillation 
Distill the unstructured or structured data source into a 
Condensed extracted representation of the unstructured data with weightages of importance and relevance.

## 1. Stemming
Stemming is the process of reducing the words(generally modified or derived) to their word stem or root form. The objective of stemming is to reduce related words to the same stem even if the stem is not a dictionary word.
For example, in the English language-

* beautiful and beautifully are stemmed to beauti 
* good, better and best are stemmed to good, better and best respectively

In [5]:
#!pip install stemming
from stemming.porter2 import stem
stem("beautifully")

'beauti'

## 2. Lemmatisation
Lemmatisation is the process of reducing a group of words into their lemma or dictionary form. It takes into account things like POS(Parts of Speech), the meaning of the word in the sentence, the meaning of the word in the nearby sentences etc. before reducing the word to its lemma. For example, in the English Language-

* beautiful and beautifully are lemmatised to beautiful and beautifully respectively.
* good, better and best are lemmatised to good, good and good respectively.


In [10]:
#!pip install spacy
#!python -m spacy download en
import spacy
nlp=spacy.load("en")
doc="good better best"

for token in nlp(doc):
    print(token,token.lemma_)

Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 3.1MB/s ta 0:00:0121
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25ldone
[?25hSuccessfully installed en-core-web-sm-2.0.0

[93m    Linking successful[0m
    /Users/mk194903/anaconda3/lib/python3.6/site-packages/en_core_web_sm -->
    /Users/mk194903/anaconda3/lib/python3.6/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')

good good
better better
best good


# 3. Word Embeddings
Word Embeddings is the name of the techniques which are used to represent Natural Language in vector form of real numbers. They are useful because of computers’ inability to process Natural Language. So these Word Embeddings capture the essence and relationship between words in a Natural Language using real numbers. In Word Embeddings, a word or a phrase is represented in a fixed dimension vector of length say 100.

So for example-

A word “man” might be represented in a 5-dimension vector as
<img src="images/word-vector.png" alt="Word Vector" />
where each of these numbers is the magnitude of the word in a particular direction.
<img src="images/Word-Vectors-direction.png" alt="Word Vector" />

**Implementation:** Here is how you can obtain pre-trained Word Vector of a word using the gensim package.

Download the Google News pre-trained Word Vectors from here(https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download).

In [16]:
#!pip install gensim
from gensim.models.keyedvectors import KeyedVectors
word_vectors=KeyedVectors.load_word2vec_format('dataset/GoogleNews-vectors-negative300.bin',binary=True)
word_vectors['human']

array([ 5.59082031e-02,  9.22851562e-02,  1.07910156e-01,  2.83203125e-01,
       -2.43164062e-01,  1.90429688e-02,  4.08203125e-01, -3.17382812e-02,
       -4.78515625e-02,  6.34765625e-02, -9.32617188e-02, -4.46777344e-02,
       -2.41210938e-01, -1.58203125e-01, -5.83496094e-02,  2.51953125e-01,
       -3.24707031e-02,  1.00097656e-01, -4.56542969e-02,  1.35742188e-01,
       -2.07031250e-01, -3.73046875e-01,  4.39453125e-02,  4.24804688e-02,
        6.93359375e-02, -2.42187500e-01, -2.75390625e-01,  1.95312500e-01,
        2.26562500e-01, -1.90429688e-01, -2.35351562e-01, -5.56640625e-02,
       -1.25000000e-01, -8.78906250e-02, -2.33398438e-01,  9.61914062e-02,
       -4.83398438e-02,  4.54101562e-02,  9.81445312e-02,  5.76171875e-02,
       -4.17480469e-02,  2.02148438e-01, -9.03320312e-02,  2.75390625e-01,
       -6.34765625e-02,  4.93164062e-02,  2.92968750e-02,  2.57812500e-01,
        1.32812500e-01,  7.42187500e-02,  6.64062500e-02, -1.37695312e-01,
       -1.73828125e-01,  

**Implementation:** Here is how you can train your own word vectors using gensim

In [14]:
import gensim
sentence=[['first','sentence'],['second','sentence']]
model = gensim.models.Word2Vec(sentence, min_count=1,size=300,workers=4)
model['sentence']

# 4. Part-Of-Speech Tagging
In Simplistic terms, Part-Of-Speech Tagging is the process of marking up of words in a sentence as nouns, verbs, adjectives, adverbs etc. For example, in the sentence-

In [17]:
# POS using Spacy
#!pip install spacy
#!python -m spacy download en 
nlp=spacy.load('en')
sentence="A look at what lies ahead for a Trump National Golf Club housekeeper who disclosed her status as an undocumented immigrant."
for token in nlp(sentence):
   print(token,token.pos_)

A DET
look NOUN
at ADP
what NOUN
lies VERB
ahead ADV
for ADP
a DET
Trump PROPN
National PROPN
Golf PROPN
Club PROPN
housekeeper NOUN
who NOUN
disclosed VERB
her ADJ
status NOUN
as ADP
an DET
undocumented ADJ
immigrant NOUN
. PUNCT


In [18]:
# POS using NLTK
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
# Clear content POS
tokens = nltk.word_tokenize(sentence)
nltk.pos_tag(tokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mk194903/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/mk194903/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('Automatic', 'JJ'),
 ('summarization', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('process', 'NN'),
 ('of', 'IN'),
 ('shortening', 'VBG'),
 ('a', 'DT'),
 ('text', 'NN'),
 ('document', 'NN'),
 ('with', 'IN'),
 ('software', 'NN'),
 (',', ','),
 ('in', 'IN'),
 ('order', 'NN'),
 ('to', 'TO'),
 ('create', 'VB'),
 ('a', 'DT'),
 ('summary', 'JJ'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('major', 'JJ'),
 ('points', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('original', 'JJ'),
 ('document', 'NN'),
 ('.', '.'),
 ('Technologies', 'NNS'),
 ('that', 'WDT'),
 ('can', 'MD'),
 ('make', 'VB'),
 ('a', 'DT'),
 ('coherent', 'NN'),
 ('summary', 'JJ'),
 ('take', 'NN'),
 ('into', 'IN'),
 ('account', 'NN'),
 ('variables', 'NNS'),
 ('such', 'JJ'),
 ('as', 'IN'),
 ('length', 'NN'),
 (',', ','),
 ('writing', 'VBG'),
 ('style', 'NN'),
 ('and', 'CC'),
 ('syntax.Automatic', 'JJ'),
 ('data', 'NNS'),
 ('summarization', 'NN'),
 ('is', 'VBZ'),
 ('part', 'NN'),
 ('of', 'IN'),
 ('machine', 'NN'),
 ('learning', 'NN'),
 ('and', '

# 5. Named Entity Recognition
Named Entity Recognition is the task of identifying entities in a sentence and classifying them into categories like a person, organisation, date, location, time etc. For example, a NER would take in a sentence like –

In [19]:
import spacy
nlp=spacy.load('en')
sentence="Ram of Apple Inc. travelled to Sydney on 5th October 2017"
for token in nlp(sentence):
   print(token, token.ent_type_)

Ram 
of 
Apple ORG
Inc. ORG
travelled 
to 
Sydney GPE
on 
5th DATE
October DATE
2017 DATE


# 6. Sentiment Analysis
Sentiment Analysis is a broad range of subjective analysis which uses Natural Language processing techniques to perform tasks such as identifying the sentiment of a customer review, positive or negative feeling in a sentence, judging mood via voice analysis or written text analysis etc.

In [9]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentences = ["VADER is smart, handsome, and funny.", # positive sentence example
"VADER is smart, handsome, and funny!"] # punctuation emphasis handled correctly (sentiment intensity adjusted)
paragraph = "It was one of the worst movies I've seen, despite good reviews. \
 Unbelievably bad acting!! Poor direction. VERY poor production. \
 The movie was bad. Very bad movie. VERY bad movie. VERY BAD movie. VERY BAD movie!"
from nltk import tokenize
lines_list = tokenize.sent_tokenize(paragraph)
sentences.extend(lines_list)
sentences

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mk194903/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


['VADER is smart, handsome, and funny.',
 'VADER is smart, handsome, and funny!',
 "It was one of the worst movies I've seen, despite good reviews.",
 'Unbelievably bad acting!!',
 'Poor direction.',
 'VERY poor production.',
 'The movie was bad.',
 'Very bad movie.',
 'VERY bad movie.',
 'VERY BAD movie.',
 'VERY BAD movie!']

In [16]:
sid = SentimentIntensityAnalyzer()
for sentence in sentences:
     print(sentence)
     ss = sid.polarity_scores(sentence)
     for k in sorted(ss):
         print('{0}: {1}, '.format(k, ss[k]), end='')
     print() #negation-Contradiction

VADER is smart, handsome, and funny.
compound: 0.8316, neg: 0.0, neu: 0.254, pos: 0.746, 
VADER is smart, handsome, and funny!
compound: 0.8439, neg: 0.0, neu: 0.248, pos: 0.752, 
It was one of the worst movies I've seen, despite good reviews.
compound: -0.7584, neg: 0.394, neu: 0.606, pos: 0.0, 
Unbelievably bad acting!!
compound: -0.6572, neg: 0.686, neu: 0.314, pos: 0.0, 
Poor direction.
compound: -0.4767, neg: 0.756, neu: 0.244, pos: 0.0, 
VERY poor production.
compound: -0.6281, neg: 0.674, neu: 0.326, pos: 0.0, 
The movie was bad.
compound: -0.5423, neg: 0.538, neu: 0.462, pos: 0.0, 
Very bad movie.
compound: -0.5849, neg: 0.655, neu: 0.345, pos: 0.0, 
VERY bad movie.
compound: -0.6732, neg: 0.694, neu: 0.306, pos: 0.0, 
VERY BAD movie.
compound: -0.7398, neg: 0.724, neu: 0.276, pos: 0.0, 
VERY BAD movie!
compound: -0.7616, neg: 0.735, neu: 0.265, pos: 0.0, 


# 7. Semantic Text Similarity

Semantic Text Similarity is the process of analysing similarity between two pieces of text with respect to the meaning and essence of the text rather than analysing the syntax of the two pieces of text. Also, similarity is different than relatedness.<br>
Words can be similar in two ways **lexically** and **semantically**. Words are similar lexically if they have a *similar character sequence*. Words are similar semantically if they have the *same thing*, are opposite of each other, used in the same way, used in the *same context* and one is a type of another.
<br>
### > Lexical Similarity 
#### 1. String-Based Similarity 
Operate on string sequences and character composition. <br>
- **Character-Based Similarity -** N-gram is a sub-sequence of n items from a given sequence of text. **N-gram** similarity algorithms compare the n-grams from each character or word in two strings. Distance is computed by dividing the number of similar n-grams by maximal number of n-grams<br>
- **Term-based Similarity -** Most famous **Cosine similarity** is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.
<br>

### > Semantic Similarity 
#### 1. Corpus-Based
Determines the similarity between words according to information gained from large corpora. A Corpus is a large collection of written or spoken texts that is used for language research.<br>
- **Latent Semantic Analysis (LSA) -**  is the most popular Corpus-Based similarity technique. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique which called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows.
- **Probabilistic latent semantic analysis(pLSA) -** Probabilistic latent semantic analysis, also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data.

#### 2. Knowledge-Based 
Determines the degree of similarity between words using information derived from semantic networks.<br>
- **Vector -** measure creates a co–occurrence matrix for each word used in the WordNet glosses from a given corpus, and then represents each gloss/concept with a vector that is the average of these co–occurrence vectors. The most popular packages that cover knowledge-based similarity measures are **WordNet::Similarity** and **Natural Language Toolkit (NLTK)**.

# 8.  Text Summarisation
Text Summarisation is the process of shortening up of a text by identifying the important points of the text and creating a summary using these points. The goal of Text Summarisation is to retain maximum information along with maximum shortening of text without altering the meaning of the text.<br>
Here is how you can quickly summarise your text using the gensim package.

In [17]:
from gensim.summarization import summarize
sentence="Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.Automatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a subset of data which contains the information of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.There are two general approaches to automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might express. Such a summary might include verbal innovations. Research to date has focused primarily on extractive methods, which are appropriate for image collection summarization and video summarization."
summarize(sentence)

'Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images.\nExtractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary.'

# Reference
* [A Survey of Text Similarity Approaches](https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf)
* [How to solve 90% of NLP problems: a step-by-step guide](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e)