# Overview

In this notebook I'll illistrate how text analyics can be done using Python and public models

**There are a set of exercises to do by hand, followed by code. We love computers because they add faster than we do and never get tiered, but it is essential that you do the hand exercies to fully understand what the computer is doing.**

The main work behind this is to take text information and convert it to numerical values so that a computer can perform tasks similar to humans.
These tasks include:

*  Finding similar words
*  Clustering documents
*  Natural Language Processing (NLP) 
    *  Sentiment analysis
    *  Language translation
    *  Photo captioning

NLP is a very broad topic and this is just an introduction for more information start [here](https://en.wikipedia.org/wiki/Natural_language_processing)

Many text models are based on [GloVe](https://nlp.stanford.edu/projects/glove/) and [word2vec](https://en.wikipedia.org/wiki/Word2vec)

## Word2vec

Word2vec was orgionally published in 2013 by Tomas Mikolov and patented while he was working at Google. You can build your own `word2vec` model on any corpus of text but my recommendation is to use a pretrained model. These are usually based on very large collection of text like Newsgroups, quora, or wikipedia.
`gensim` is a very popular Python package for using prebuilt language models 



## GloVe

GloVe is a collection of models that were trained on different corpus. The most common was trained on Wikipedia and includes 6 billion tokens and 400k words.



Here is a diagram that we will follow from [Adam Geitgey](https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e)


![NLP Pipeline](./NLP_pipeline.png)

# Bag of Words

This step takes a document (sentence in our case) and converts it to a numerical form as well as counts the word frequency.

Here is an a paragraph about [Geoffrey Hinton's education](https://en.wikipedia.org/wiki/Geoffrey_Hinton) from Wikipedia:

"Hinton was educated at King's College, Cambridge, graduating in 1970 with a Bachelor of Arts in experimental psychology. He continued his study at the University of Edinburgh where he was awarded a PhD in artificial intelligence in 1978 for research supervised by Christopher Longuet-Higgins."


1. Break the paragraph into sentences.
    *  Hinton was educated at King's College, Cambridge, graduating in 1970 with a Bachelor of Arts in experimental psychology.
    *  He continued his study at the University of Edinburgh where he was awarded a PhD in artificial intelligence in 1978 for research supervised by Christopher Longuet-Higgins.

**I'll show you with the first sentence and you need do the second sentence.**

1. Take the sentence and break it into word tokens:
    "Hinton", "was", "educated", "at", "King's", "College", "Cambridge", "graduating", "in", "1970", "with", "a", "Bachelor", "of", "Arts", "in", "experimental", "psychology", "."

1. Count the frequency of each word (token)
    bag_of_words1 = {'Mason':1, 'likes':1, 'to':1, 'learn':1, 'about':1, 'computers':1}

    bag_of_words2 = {}

1. Now combine the two sentences **hint:** 'Mason':1 & 'likes':2

    `bag_of_words1 + bag_of_words2 = bag_of_words3`

    bag_of_words3 = {}

## What is the value (answer() of `bag_of_words3`?

# Setup

Here are the python modules needed to run this code

In [53]:
from pprint import pprint
import nltk 
from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.chunk import conlltags2tree, tree2conlltags

nltk.download('punkt') 
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('words')

#from gensim.models import Word2Vec
#import gensim
#from gensim import corpora
#from pprint import pprint


[nltk_data] Downloading package punkt to /Users/jadean/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jadean/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jadean/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jadean/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/jadean/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/jadean/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

# The text for this example is the Abraham Lincolns famous [Gettysburg Address](https://en.wikipedia.org/wiki/Gettysburg_Address)

The text is written below and assigned to a variable named `gettysburg_address`

In [4]:
gettysburg_address = """Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal".

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battle field of that war. We have come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. This we may, in all propriety do. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow, this ground-- The brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract. The world will little note, nor long remember what we say here; while it can never forget what they did here.

It is rather for us, the living, to stand here, we here be dedica-ted to the great task remaining before us -- that, from these honored dead we take increased devotion to that cause for which they here, gave the last full measure of devotion -- that we here highly resolve these dead shall not have died in vain; that the nation, shall have a new birth of freedom, and that government of the people by the people for the people, shall not perish from the earth."""


## Break the text into sentences.

The first step is to break the text into sentences. Below is the first sentence, please at the rest to the cell. There are a total of 8 sentences.

* Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal".


# Solution

* Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal".
* Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure.
* We are met on a great battle field of that war.
* We have come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live.
* This we may, in all propriety do.
* But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow, this ground-- The brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract.
* The world will little note, nor long remember what we say here; while it can never forget what they did here.
* It is rather for us, the living, to stand here, we here be dedicated to the great task remaining before us -- that, from these honored dead we take increased devotion to that cause for which they here, gave the last full measure of devotion -- that we here highly resolve these dead shall not have died in vain; that the nation, shall have a new birth of freedom, and that government of the people by the people for the people, shall not perish from the earth.

In [5]:
gettysburg_sentences = nltk.sent_tokenize(gettysburg_address)
gettysburg_sentences

['Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal".',
 'Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure.',
 'We are met on a great battle field of that war.',
 'We have come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live.',
 'This we may, in all propriety do.',
 'But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow, this ground-- The brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract.',
 'The world will little note, nor long remember what we say here; while it can never forget what they did here.',
 'It is rather for us, the living, to stand here, we here be dedica-ted to the great task remaining before us -- that, from th

## Break the text into tokens.

Now we need to break each sentence into the individual tokens (words)

Here is the first sentence: 

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth,', 'upon', 'this', 'continent,', 'a', 'new', 'nation,', 'conceived', 'in', 'liberty,', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', '"', 'all', 'men', 'are', 'created', 'equal', '"', '.']

You will need to do this for sentences 2 & 3



## Solution
['Now', 'we', 'are', 'engaged', 'in', 'a', 'great', 'civil', 'war,', 'testing', 'whether', 'that', 'nation,', 'or', 'any', 'nation', 'so', 'conceived,', 'and', 'so', 'dedicated,', 'can', 'long', 'endure', '.', 'We', 'are', 'met', 'on', 'a', 'great', 'battle', 'field', 'of', 'that', 'war', '.']

In [8]:
gettysburg_word_tokens = nltk.tokenize.word_tokenize(gettysburg_address)

# Show the first 15 words
gettysburg_word_tokens[:15]

['Four',
 'score',
 'and',
 'seven',
 'years',
 'ago',
 'our',
 'fathers',
 'brought',
 'forth',
 ',',
 'upon',
 'this',
 'continent',
 ',']

## Part of Speech Tagging

To efficiently use NLP knowing the part of speech is important. This might seem like going back to grammar school and diagramming sentences because it is :)

Here are the parts of speech that `NLTK` identifies:
```
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
VB verb, base form take
VBD verb, past tense, took
VBG verb, gerund/present participle taking
VBN verb, past participle is taken
VBP verb, sing. present, known-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb where, when
```

Here is the solution for the first sentence:
```
[('Four', 'CD'),
 ('score', 'NN'),
 ('and', 'CC'),
 ('seven', 'CD'),
 ('years', 'NNS'),
 ('ago', 'RB'),
 ('our', 'PRP$'),
 ('fathers', 'NNS'),
 ('brought', 'VBD'),
 ('forth', 'NN'),
 (',', ','),
 ('upon', 'IN'),
 ('this', 'DT'),
 ('continent', 'NN'),
 (',', ','),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('nation', 'NN'),
 (',', ','),
 ('conceived', 'VBN'),
 ('in', 'IN'),
 ('liberty', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('dedicated', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('proposition', 'NN'),
 ('that', 'IN'),
 ('``', '``'),
 ('all', 'DT'),
 ('men', 'NNS'),
 ('are', 'VBP'),
 ('created', 'VBN'),
 ('equal', 'JJ'),
 ("''", "''")]
```

Take a few minutes to identify the parts of speech for sentences 2 & 3:

## Solution

In [15]:
nltk.pos_tag(nltk.tokenize.word_tokenize(gettysburg_address))[37:78]

[('Now', 'RB'),
 ('we', 'PRP'),
 ('are', 'VBP'),
 ('engaged', 'VBN'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('civil', 'JJ'),
 ('war', 'NN'),
 (',', ','),
 ('testing', 'VBG'),
 ('whether', 'IN'),
 ('that', 'DT'),
 ('nation', 'NN'),
 (',', ','),
 ('or', 'CC'),
 ('any', 'DT'),
 ('nation', 'NN'),
 ('so', 'RB'),
 ('conceived', 'JJ'),
 (',', ','),
 ('and', 'CC'),
 ('so', 'RB'),
 ('dedicated', 'JJ'),
 (',', ','),
 ('can', 'MD'),
 ('long', 'VB'),
 ('endure', 'NN'),
 ('.', '.'),
 ('We', 'PRP'),
 ('are', 'VBP'),
 ('met', 'VBN'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('battle', 'NN'),
 ('field', 'NN'),
 ('of', 'IN'),
 ('that', 'DT'),
 ('war', 'NN'),
 ('.', '.')]

## Stemming
In order to improve the quality of search and clustering, words are stemmed to their root. Stemming makes children and child the same since the only difference is the quantity. 

Here are the stems for the first sentence:
```
Four  :  four
score  :  score
and  :  and
seven  :  seven
years  :  year
ago  :  ago
our  :  our
fathers  :  father
brought  :  brought
forth  :  forth
,  :  ,
upon  :  upon
this  :  thi
continent  :  contin
,  :  ,
a  :  a
new  :  new
nation  :  nation
,  :  ,
conceived  :  conceiv
in  :  in
liberty  :  liberti
,  :  ,
and  :  and
dedicated  :  dedic
to  :  to
the  :  the
proposition  :  proposit
that  :  that
``  :  ``
all  :  all
men  :  men
are  :  are
created  :  creat
equal  :  equal
''  :  ''
```

Stem the words for sentences 2 & 3. Some of the stems are shorter than you might expect (created => creat), just do your best.


In [20]:
# Solution
ps = PorterStemmer() 
   
for w in nltk.tokenize.word_tokenize(gettysburg_address)[37:78]: 
    print(w, " : ", ps.stem(w))

Now  :  now
we  :  we
are  :  are
engaged  :  engag
in  :  in
a  :  a
great  :  great
civil  :  civil
war  :  war
,  :  ,
testing  :  test
whether  :  whether
that  :  that
nation  :  nation
,  :  ,
or  :  or
any  :  ani
nation  :  nation
so  :  so
conceived  :  conceiv
,  :  ,
and  :  and
so  :  so
dedicated  :  dedic
,  :  ,
can  :  can
long  :  long
endure  :  endur
.  :  .
We  :  We
are  :  are
met  :  met
on  :  on
a  :  a
great  :  great
battle  :  battl
field  :  field
of  :  of
that  :  that
war  :  war
.  :  .


# Lemmatization

Stemming and Lemmatization appear very similar and for many words they are identical. Lemmatization is preferred over stemming because takes into account other items like part of speech in addition to just stemming (see [Morphology](https://en.wikipedia.org/wiki/Morphology_(linguistics)) ).

Run the code below to see the lemmatization of sentences 2 & 3 then compare them to your solution

In [40]:
lemmatizer = WordNetLemmatizer() 
for w in nltk.tokenize.word_tokenize(gettysburg_address)[37:78]: 
    print(w, " : ", lemmatizer.lemmatize(w)) 

Now  :  Now
we  :  we
are  :  are
engaged  :  engaged
in  :  in
a  :  a
great  :  great
civil  :  civil
war  :  war
,  :  ,
testing  :  testing
whether  :  whether
that  :  that
nation  :  nation
,  :  ,
or  :  or
any  :  any
nation  :  nation
so  :  so
conceived  :  conceived
,  :  ,
and  :  and
so  :  so
dedicated  :  dedicated
,  :  ,
can  :  can
long  :  long
endure  :  endure
.  :  .
We  :  We
are  :  are
met  :  met
on  :  on
a  :  a
great  :  great
battle  :  battle
field  :  field
of  :  of
that  :  that
war  :  war
.  :  .


# Remove stop words

Stop words are those that 

Below is the first sentence with stop words removed. The sentence was 36 tokens originally with the stop words removed it is 25.

```
['Four',
 'score',
 'seven',
 'years',
 'ago',
 'fathers',
 'brought',
 'forth',
 ',',
 'upon',
 'continent',
 ',',
 'new',
 'nation',
 ',',
 'conceived',
 'liberty',
 ',',
 'dedicated',
 'proposition',
 '``',
 'men',
 'created',
 'equal',
 "''"]
```

To see the list of stop words in `nltk`, run the code below (there are 179). Then remove any word that is in sentences 2 & 3 AND in the stop word list (the list is alphabetic). 

The correct answer has 25 tokens

In [32]:
stop_words = set(stopwords.words('english')) 
print(sorted(stop_words))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

In [30]:
stop_words_removed = [w for w in nltk.tokenize.word_tokenize(gettysburg_address)[:36] if not w in stop_words] 
stop_words_removed

['Four',
 'score',
 'seven',
 'years',
 'ago',
 'fathers',
 'brought',
 'forth',
 ',',
 'upon',
 'continent',
 ',',
 'new',
 'nation',
 ',',
 'conceived',
 'liberty',
 ',',
 'dedicated',
 'proposition',
 '``',
 'men',
 'created',
 'equal',
 "''"]

In [41]:
# Solution

stop_words_removed = [w for w in nltk.tokenize.word_tokenize(gettysburg_address)[37:78] if not w in stop_words] 
len(stop_words_removed)
stop_words_removed

['Now',
 'engaged',
 'great',
 'civil',
 'war',
 ',',
 'testing',
 'whether',
 'nation',
 ',',
 'nation',
 'conceived',
 ',',
 'dedicated',
 ',',
 'long',
 'endure',
 '.',
 'We',
 'met',
 'great',
 'battle',
 'field',
 'war',
 '.']

In [55]:
ne_tree = nltk.ne_chunk(nltk.pos_tag(nltk.tokenize.word_tokenize(gettysburg_address))[:36])
iob_tagged = tree2conlltags(ne_tree)
pprint(iob_tagged)

[('Four', 'CD', 'O'),
 ('score', 'NN', 'O'),
 ('and', 'CC', 'O'),
 ('seven', 'CD', 'O'),
 ('years', 'NNS', 'O'),
 ('ago', 'RB', 'O'),
 ('our', 'PRP$', 'O'),
 ('fathers', 'NNS', 'O'),
 ('brought', 'VBD', 'O'),
 ('forth', 'NN', 'O'),
 (',', ',', 'O'),
 ('upon', 'IN', 'O'),
 ('this', 'DT', 'O'),
 ('continent', 'NN', 'O'),
 (',', ',', 'O'),
 ('a', 'DT', 'O'),
 ('new', 'JJ', 'O'),
 ('nation', 'NN', 'O'),
 (',', ',', 'O'),
 ('conceived', 'VBN', 'O'),
 ('in', 'IN', 'O'),
 ('liberty', 'NN', 'O'),
 (',', ',', 'O'),
 ('and', 'CC', 'O'),
 ('dedicated', 'VBD', 'O'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'O'),
 ('proposition', 'NN', 'O'),
 ('that', 'IN', 'O'),
 ('``', '``', 'O'),
 ('all', 'DT', 'O'),
 ('men', 'NNS', 'O'),
 ('are', 'VBP', 'O'),
 ('created', 'VBN', 'O'),
 ('equal', 'JJ', 'O'),
 ("''", "''", 'O')]


In [51]:
print(ne_tree)

(S
  Four/CD
  score/NN
  and/CC
  seven/CD
  years/NNS
  ago/RB
  our/PRP$
  fathers/NNS
  brought/VBD
  forth/NN
  ,/,
  upon/IN
  this/DT
  continent/NN
  ,/,
  a/DT
  new/JJ
  nation/NN
  ,/,
  conceived/VBN
  in/IN
  liberty/NN
  ,/,
  and/CC
  dedicated/VBD
  to/TO
  the/DT
  proposition/NN
  that/IN
  ``/``
  all/DT
  men/NNS
  are/VBP
  created/VBN
  equal/JJ
  ''/'')


In [54]:
iob_tagged = tree2conlltags(ne_tree)
pprint(iob_tagged)

[('Four', 'CD', 'O'),
 ('score', 'NN', 'O'),
 ('and', 'CC', 'O'),
 ('seven', 'CD', 'O'),
 ('years', 'NNS', 'O'),
 ('ago', 'RB', 'O'),
 ('our', 'PRP$', 'O'),
 ('fathers', 'NNS', 'O'),
 ('brought', 'VBD', 'O'),
 ('forth', 'NN', 'O'),
 (',', ',', 'O'),
 ('upon', 'IN', 'O'),
 ('this', 'DT', 'O'),
 ('continent', 'NN', 'O'),
 (',', ',', 'O'),
 ('a', 'DT', 'O'),
 ('new', 'JJ', 'O'),
 ('nation', 'NN', 'O'),
 (',', ',', 'O'),
 ('conceived', 'VBN', 'O'),
 ('in', 'IN', 'O'),
 ('liberty', 'NN', 'O'),
 (',', ',', 'O'),
 ('and', 'CC', 'O'),
 ('dedicated', 'VBD', 'O'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'O'),
 ('proposition', 'NN', 'O'),
 ('that', 'IN', 'O'),
 ('``', '``', 'O'),
 ('all', 'DT', 'O'),
 ('men', 'NNS', 'O'),
 ('are', 'VBP', 'O'),
 ('created', 'VBN', 'O'),
 ('equal', 'JJ', 'O'),
 ("''", "''", 'O')]


Another popular package, in addition to `nltk`, is `spacy`. The render display of named entity recognition is much better in my opinion but it only found two named entities using default options.

Here is the screen shot of the analysis:

![spacy](./spacy_gburg.png)


In [10]:
paragraph = ["Hinton was educated at King's College, Cambridge, graduating in 1970 with a Bachelor of Arts in experimental psychology. He continued his study at the University of Edinburgh where he was awarded a PhD in artificial intelligence in 1978 for research supervised by Christopher Longuet-Higgins."]

texts = [[text for text in doc.split()] for doc in paragraph]
word_dict = corpora.Dictionary(texts)
print(word_dict)

Dictionary(37 unique tokens: ['1970', '1978', 'Arts', 'Bachelor', 'Cambridge,']...)


Here are the unique id's for each token (word) in the text.

You can see that is sorted alphabetically. Every unique word is in here -- including 'He' and 'he'

In [11]:
print(word_dict.token2id)

{'1970': 0, '1978': 1, 'Arts': 2, 'Bachelor': 3, 'Cambridge,': 4, 'Christopher': 5, 'College,': 6, 'Edinburgh': 7, 'He': 8, 'Hinton': 9, "King's": 10, 'Longuet-Higgins.': 11, 'PhD': 12, 'University': 13, 'a': 14, 'artificial': 15, 'at': 16, 'awarded': 17, 'by': 18, 'continued': 19, 'educated': 20, 'experimental': 21, 'for': 22, 'graduating': 23, 'he': 24, 'his': 25, 'in': 26, 'intelligence': 27, 'of': 28, 'psychology.': 29, 'research': 30, 'study': 31, 'supervised': 32, 'the': 33, 'was': 34, 'where': 35, 'with': 36}


## Apply a pretrained word2vec model on the paragraph

In [25]:
gettysburg_address = """Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal".

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battle field of that war. We have come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. This we may, in all propriety do. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow, this ground-- The brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract. The world will little note, nor long remember what we say here; while it can never forget what they did here.

It is rather for us, the living, to stand here, we here be dedica-ted to the great task remaining before us -- that, from these honored dead we take increased devotion to that cause for which they here, gave the last full measure of devotion -- that we here highly resolve these dead shall not have died in vain; that the nation, shall have a new birth of freedom, and that government of the people by the people for the people, shall not perish from the earth."""

Break the speech into sentences

In [27]:
gettysburg_sentences = nltk.sent_tokenize(gettysburg_address)
gettysburg_sentences

Tokenize the speech into words

In [30]:
gettysburg_word_tokens = nltk.tokenize.word_tokenize(gettysburg_address)
gettysburg_word_tokens

['Four',
 'score',
 'and',
 'seven',
 'years',
 'ago',
 'our',
 'fathers',
 'brought',
 'forth',
 ',',
 'upon',
 'this',
 'continent',
 ',',
 'a',
 'new',
 'nation',
 ',',
 'conceived',
 'in',
 'liberty',
 ',',
 'and',
 'dedicated',
 'to',
 'the',
 'proposition',
 'that',
 '``',
 'all',
 'men',
 'are',
 'created',
 'equal',
 "''",
 '.',
 'Now',
 'we',
 'are',
 'engaged',
 'in',
 'a',
 'great',
 'civil',
 'war',
 ',',
 'testing',
 'whether',
 'that',
 'nation',
 ',',
 'or',
 'any',
 'nation',
 'so',
 'conceived',
 ',',
 'and',
 'so',
 'dedicated',
 ',',
 'can',
 'long',
 'endure',
 '.',
 'We',
 'are',
 'met',
 'on',
 'a',
 'great',
 'battle',
 'field',
 'of',
 'that',
 'war',
 '.',
 'We',
 'have',
 'come',
 'to',
 'dedicate',
 'a',
 'portion',
 'of',
 'it',
 ',',
 'as',
 'a',
 'final',
 'resting',
 'place',
 'for',
 'those',
 'who',
 'died',
 'here',
 ',',
 'that',
 'the',
 'nation',
 'might',
 'live',
 '.',
 'This',
 'we',
 'may',
 ',',
 'in',
 'all',
 'propriety',
 'do',
 '.',
 'But',

Part of Speech Tagging

In [33]:
nltk.pos_tag(nltk.tokenize.word_tokenize(gettysburg_address))

[('Four', 'CD'),
 ('score', 'NN'),
 ('and', 'CC'),
 ('seven', 'CD'),
 ('years', 'NNS'),
 ('ago', 'RB'),
 ('our', 'PRP$'),
 ('fathers', 'NNS'),
 ('brought', 'VBD'),
 ('forth', 'NN'),
 (',', ','),
 ('upon', 'IN'),
 ('this', 'DT'),
 ('continent', 'NN'),
 (',', ','),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('nation', 'NN'),
 (',', ','),
 ('conceived', 'VBN'),
 ('in', 'IN'),
 ('liberty', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('dedicated', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('proposition', 'NN'),
 ('that', 'IN'),
 ('``', '``'),
 ('all', 'DT'),
 ('men', 'NNS'),
 ('are', 'VBP'),
 ('created', 'VBN'),
 ('equal', 'JJ'),
 ("''", "''"),
 ('.', '.'),
 ('Now', 'RB'),
 ('we', 'PRP'),
 ('are', 'VBP'),
 ('engaged', 'VBN'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('civil', 'JJ'),
 ('war', 'NN'),
 (',', ','),
 ('testing', 'VBG'),
 ('whether', 'IN'),
 ('that', 'DT'),
 ('nation', 'NN'),
 (',', ','),
 ('or', 'CC'),
 ('any', 'DT'),
 ('nation', 'NN'),
 ('so', 'RB'),
 ('conceived', 'JJ'),
 (',', ','),
 ('and

In [14]:

# train model
model = Word2Vec(word_dict, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

TypeError: 'int' object is not iterable