# Overview

In this notebook I'll illistrate how text analyics can be done using Python and public models

**There are a set of exercises to do by hand, followed by code. We love computers because they add faster than we do and never get tiered, but it is essential that you do the hand exercies to fully understand what the computer is doing.**

The main work behind this is to take text information and convert it to numerical values so that a computer can perform tasks similar to humans.
These tasks include:

*  Finding similar words
*  Clustering documents
*  Natural Language Processing (NLP) 
    *  Sentiment analysis
    *  Language translation
    *  Photo captioning

NLP is a very broad topic and this is just an introduction for more information start [here](https://en.wikipedia.org/wiki/Natural_language_processing)

Many text models are based on [GloVe](https://nlp.stanford.edu/projects/glove/) and [word2vec](https://en.wikipedia.org/wiki/Word2vec)

## Word2vec

Word2vec was orgionally published in 2013 by Tomas Mikolov and patented while he was working at Google. You can build your own `word2vec` model on any corpus of text but my recommendation is to use a pretrained model. These are usually based on very large collection of text like Newsgroups, quora, or wikipedia.
`gensim` is a very popular Python package for using prebuilt language models 



## GloVe

GloVe is a collection of models that were trained on different corpus. The most common was trained on Wikipedia and includes 6 billion tokens and 400k words.



Here is a diagram from [Adam Geitgey](https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e) that illistrates the steps of NLP. Many people break the steps down in slighly different ways.


![NLP Pipeline](./NLP_pipeline.png)

# Bag of Words

This step takes a document (sentence in our case) and converts it to a numerical form as well as counts the word frequency.

Here is an a paragraph about [Geoffrey Hinton's education](https://en.wikipedia.org/wiki/Geoffrey_Hinton) from Wikipedia:

"Hinton was educated at King's College, Cambridge, graduating in 1970 with a Bachelor of Arts in experimental psychology. He continued his study at the University of Edinburgh where he was awarded a PhD in artificial intelligence in 1978 for research supervised by Christopher Longuet-Higgins."


1. Break the paragraph into sentences.
    *  Hinton was educated at King's College, Cambridge, graduating in 1970 with a Bachelor of Arts in experimental psychology.
    *  He continued his study at the University of Edinburgh where he was awarded a PhD in artificial intelligence in 1978 for research supervised by Christopher Longuet-Higgins.

**I'll show you with the first sentence and you need do the second sentence.**

1. Take the sentence and break it into word tokens:
    "Hinton", "was", "educated", "at", "King's", "College", "Cambridge", "graduating", "in", "1970", "with", "a", "Bachelor", "of", "Arts", "in", "experimental", "psychology", "."

1. Count the frequency of each word (token)
    bag_of_words1 = {'Mason':1, 'likes':1, 'to':1, 'learn':1, 'about':1, 'computers':1}

    bag_of_words2 = {}

1. Now combine the two sentences **hint:** 'Mason':1 & 'likes':2

    `bag_of_words1 + bag_of_words2 = bag_of_words3`

    bag_of_words3 = {}

## What is the value (answer() of `bag_of_words3`?

# Setup

In [32]:
import nltk 
nltk.download('punkt') 
nltk.download('averaged_perceptron_tagger')
from gensim.models import Word2Vec
import gensim
from gensim import corpora
from pprint import pprint


[nltk_data] Downloading package punkt to /Users/jadean/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jadean/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


## Break the text into sentences and then tokens

In [10]:
paragraph = ["Hinton was educated at King's College, Cambridge, graduating in 1970 with a Bachelor of Arts in experimental psychology. He continued his study at the University of Edinburgh where he was awarded a PhD in artificial intelligence in 1978 for research supervised by Christopher Longuet-Higgins."]

texts = [[text for text in doc.split()] for doc in paragraph]
word_dict = corpora.Dictionary(texts)
print(word_dict)

Dictionary(37 unique tokens: ['1970', '1978', 'Arts', 'Bachelor', 'Cambridge,']...)


Here are the unique id's for each token (word) in the text.

You can see that is sorted alphabetically. Every unique word is in here -- including 'He' and 'he'

In [11]:
print(word_dict.token2id)

{'1970': 0, '1978': 1, 'Arts': 2, 'Bachelor': 3, 'Cambridge,': 4, 'Christopher': 5, 'College,': 6, 'Edinburgh': 7, 'He': 8, 'Hinton': 9, "King's": 10, 'Longuet-Higgins.': 11, 'PhD': 12, 'University': 13, 'a': 14, 'artificial': 15, 'at': 16, 'awarded': 17, 'by': 18, 'continued': 19, 'educated': 20, 'experimental': 21, 'for': 22, 'graduating': 23, 'he': 24, 'his': 25, 'in': 26, 'intelligence': 27, 'of': 28, 'psychology.': 29, 'research': 30, 'study': 31, 'supervised': 32, 'the': 33, 'was': 34, 'where': 35, 'with': 36}


## Apply a pretrained word2vec model on the paragraph

In [25]:
gettysburg_address = """Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal".

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battle field of that war. We have come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. This we may, in all propriety do. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow, this ground-- The brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract. The world will little note, nor long remember what we say here; while it can never forget what they did here.

It is rather for us, the living, to stand here, we here be dedicated to the great task remaining before us -- that, from these honored dead we take increased devotion to that cause for which they here, gave the last full measure of devotion -- that we here highly resolve these dead shall not have died in vain; that the nation, shall have a new birth of freedom, and that government of the people by the people for the people, shall not perish from the earth."""

Break the speech into sentences

In [27]:
gettysburg_sentences = nltk.sent_tokenize(gettysburg_address)
gettysburg_sentences

Tokenize the speech into words

In [35]:
gettysburg_word_tokens = nltk.tokenize.word_tokenize(gettysburg_address)

# Show the first 10 words
gettysburg_word_tokens[:10]

['Four',
 'score',
 'and',
 'seven',
 'years',
 'ago',
 'our',
 'fathers',
 'brought',
 'forth']

Part of Speech Tagging

Here is a list of the codes:
```
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
VB verb, base form take
VBD verb, past tense, took
VBG verb, gerund/present participle taking
VBN verb, past participle is taken
VBP verb, sing. present, known-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb where, when
```

In [33]:
nltk.pos_tag(nltk.tokenize.word_tokenize(gettysburg_address))

[('Four', 'CD'),
 ('score', 'NN'),
 ('and', 'CC'),
 ('seven', 'CD'),
 ('years', 'NNS'),
 ('ago', 'RB'),
 ('our', 'PRP$'),
 ('fathers', 'NNS'),
 ('brought', 'VBD'),
 ('forth', 'NN'),
 (',', ','),
 ('upon', 'IN'),
 ('this', 'DT'),
 ('continent', 'NN'),
 (',', ','),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('nation', 'NN'),
 (',', ','),
 ('conceived', 'VBN'),
 ('in', 'IN'),
 ('liberty', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('dedicated', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('proposition', 'NN'),
 ('that', 'IN'),
 ('``', '``'),
 ('all', 'DT'),
 ('men', 'NNS'),
 ('are', 'VBP'),
 ('created', 'VBN'),
 ('equal', 'JJ'),
 ("''", "''"),
 ('.', '.'),
 ('Now', 'RB'),
 ('we', 'PRP'),
 ('are', 'VBP'),
 ('engaged', 'VBN'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('civil', 'JJ'),
 ('war', 'NN'),
 (',', ','),
 ('testing', 'VBG'),
 ('whether', 'IN'),
 ('that', 'DT'),
 ('nation', 'NN'),
 (',', ','),
 ('or', 'CC'),
 ('any', 'DT'),
 ('nation', 'NN'),
 ('so', 'RB'),
 ('conceived', 'JJ'),
 (',', ','),
 ('and

In [14]:

# train model
model = Word2Vec(word_dict, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

TypeError: 'int' object is not iterable