# Overview

In this notebook I'll illustrate how text analytics can be done using Python and public models

**There are a set of exercises to do by hand, followed by code. We love computers because they add faster than we do and never get tired, but it is essential that you do the hand exercises to fully understand what the computer is doing.**

The main work behind this is to take text information and convert it to numerical values so that a computer can perform tasks similar to humans.
These tasks include:

*  Finding similar words
*  Clustering documents
*  Natural Language Processing (NLP) 
    *  Sentiment analysis
    *  Language translation
    *  Photo captioning

NLP is a very broad topic and this is just an introduction for more information start [here](https://en.wikipedia.org/wiki/Natural_language_processing)

Many text models are based on [GloVe](https://nlp.stanford.edu/projects/glove/) and [word2vec](https://en.wikipedia.org/wiki/Word2vec)

## Word2vec

Word2vec was originally published in 2013 by Tomas Mikolov and patented while he was working at Google. You can build your own `word2vec` model on any corpus of text but my recommendation is to use a pre-trained model. These are usually based on very large collection of text like Newsgroups, Quora, or Wikipedia.
`gensim` is a very popular Python package for using prebuilt language models 



## GloVe

GloVe is a collection of models that were trained on different corpus. The most common was trained on Wikipedia and includes 6 billion tokens and 400k words.



Here is a diagram that we will follow from [Adam Geitgey](https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e)


![NLP Pipeline](./NLP_pipeline.png)

# Setup

Here are the python modules needed to run this code

This example will use the `nltk` package for the text tasks

In [None]:
from pprint import pprint
import nltk 
from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.chunk import conlltags2tree, tree2conlltags

nltk.download('punkt') 
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# The text for this example is Abraham Lincoln's famous [Gettysburg Address](https://en.wikipedia.org/wiki/Gettysburg_Address)

The text is written below and assigned to a variable named `gettysburg_address`

In [None]:
gettysburg_address = """Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal".

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battle field of that war. We have come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. This we may, in all propriety do. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow, this ground-- The brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract. The world will little note, nor long remember what we say here; while it can never forget what they did here.

It is rather for us, the living, to stand here, we here be dedica-ted to the great task remaining before us -- that, from these honored dead we take increased devotion to that cause for which they here, gave the last full measure of devotion -- that we here highly resolve these dead shall not have died in vain; that the nation, shall have a new birth of freedom, and that government of the people by the people for the people, shall not perish from the earth."""


# Break the text into sentences.

The first step is to break the text into sentences. Below is the first sentence, please add the rest to the cell. There are a total of 8 sentences.

* Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal".


In [None]:
gettysburg_sentences = nltk.sent_tokenize(gettysburg_address)
gettysburg_sentences

# Break the text into tokens.

Now we need to break each sentence into the individual tokens (words)

Here is the first sentence: 

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth,', 'upon', 'this', 'continent,', 'a', 'new', 'nation,', 'conceived', 'in', 'liberty,', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', '"', 'all', 'men', 'are', 'created', 'equal', '"', '.']

You will need to do this for sentences 2 & 3



In [None]:
gettysburg_word_tokens = nltk.tokenize.word_tokenize(gettysburg_address)

# Show the first 15 words
gettysburg_word_tokens[:15]

# Part of Speech Tagging

To efficiently use NLP knowing the part of speech is important. This might seem like going back to grammar school and diagramming sentences because it is :)

Here are the parts of speech that `NLTK` identifies:
```
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
VB verb, base form take
VBD verb, past tense, took
VBG verb, gerund/present participle taking
VBN verb, past participle is taken
VBP verb, sing. present, known-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb where, when
```

Here is the solution for the first sentence:
```
[('Four', 'CD'),
 ('score', 'NN'),
 ('and', 'CC'),
 ('seven', 'CD'),
 ('years', 'NNS'),
 ('ago', 'RB'),
 ('our', 'PRP$'),
 ('fathers', 'NNS'),
 ('brought', 'VBD'),
 ('forth', 'NN'),
 (',', ','),
 ('upon', 'IN'),
 ('this', 'DT'),
 ('continent', 'NN'),
 (',', ','),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('nation', 'NN'),
 (',', ','),
 ('conceived', 'VBN'),
 ('in', 'IN'),
 ('liberty', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('dedicated', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('proposition', 'NN'),
 ('that', 'IN'),
 ('``', '``'),
 ('all', 'DT'),
 ('men', 'NNS'),
 ('are', 'VBP'),
 ('created', 'VBN'),
 ('equal', 'JJ'),
 ("''", "''")]
```

Take a few minutes to identify the parts of speech for sentences 2 & 3:

In [None]:
nltk.pos_tag(nltk.tokenize.word_tokenize(gettysburg_address))[:36]

# Stemming
In order to improve the quality of search and clustering, words are stemmed to their root. Stemming makes children and child the same since the only difference is the quantity. 

Here are the stems for the first sentence:
```
Four  :  four
score  :  score
and  :  and
seven  :  seven
years  :  year
ago  :  ago
our  :  our
fathers  :  father
brought  :  brought
forth  :  forth
,  :  ,
upon  :  upon
this  :  thi
continent  :  contin
,  :  ,
a  :  a
new  :  new
nation  :  nation
,  :  ,
conceived  :  conceiv
in  :  in
liberty  :  liberti
,  :  ,
and  :  and
dedicated  :  dedic
to  :  to
the  :  the
proposition  :  proposit
that  :  that
``  :  ``
all  :  all
men  :  men
are  :  are
created  :  creat
equal  :  equal
''  :  ''
```

Stem the words for sentences 2 & 3. Some of the stems are shorter than you might expect (created => creat), just do your best.


In [None]:
ps = PorterStemmer() 
   
for w in nltk.tokenize.word_tokenize(gettysburg_address)[:36]: 
    print(w, " : ", ps.stem(w)) 

# Lemmatization

Stemming and Lemmatization appear very similar and for many words they are identical. Lemmatization is preferred over stemming because takes into account other items like part of speech in addition to just stemming (see [Morphology](https://en.wikipedia.org/wiki/Morphology_(linguistics)) ).

Run the code below to see the lemmatization of sentences 2 & 3 then compare them to your solution

In [None]:
lemmatizer = WordNetLemmatizer() 
for w in nltk.tokenize.word_tokenize(gettysburg_address)[37:78]: 
    print(w, " : ", lemmatizer.lemmatize(w)) 

# Remove stop words

Stop words are those that 

Below is the first sentence with stop words removed. The sentence was 36 tokens originally with the stop words removed it is 25.

```
['Four',
 'score',
 'seven',
 'years',
 'ago',
 'fathers',
 'brought',
 'forth',
 ',',
 'upon',
 'continent',
 ',',
 'new',
 'nation',
 ',',
 'conceived',
 'liberty',
 ',',
 'dedicated',
 'proposition',
 '``',
 'men',
 'created',
 'equal',
 "''"]
```

To see the list of stop words in `nltk`, run the code below (there are 179). Then remove any word that is in sentences 2 & 3 AND in the stop word list (the list is alphabetic). 

The correct answer has 25 tokens

In [None]:
stop_words = set(stopwords.words('english')) 
print(sorted(stop_words))

In [None]:
stop_words_removed = [w for w in nltk.tokenize.word_tokenize(gettysburg_address)[:36] if not w in stop_words] 
stop_words_removed

# Named Entity Recognition (NER)

This task attempts to take the tokens of the text and categorize them into pre-defined groups such as names or people, organizations, locations, times, monetary values, and so on.

Run the code below to see the NER for the speech

## First sentence

In [None]:
ne_tree = nltk.ne_chunk(nltk.pos_tag(nltk.tokenize.word_tokenize(gettysburg_address))[:36])
iob_tagged = tree2conlltags(ne_tree)
pprint(iob_tagged)

## Sentences 2 & 3

In [None]:
ne_tree = nltk.ne_chunk(nltk.pos_tag(nltk.tokenize.word_tokenize(gettysburg_address))[37:78])
iob_tagged = tree2conlltags(ne_tree)
pprint(iob_tagged)

# Spacy
Another popular package, in addition to `nltk`, is `spacy`. The render display of named entity recognition is much better in my opinion but it only found two named entities using default options.

Here is the screen shot of the analysis:

![spacy](./spacy_gburg.png)
