### Exercise to compare different types of sentences in Geological and Historical documents

#### Concept of the Notebook
from: https://medium.com/@Intellica.AI/comparison-of-different-word-embeddings-on-text-similarity-a-use-case-in-nlp-e83e08469c1c

The concept is to bring in a history document (Monson, Maine) and train a machine to identify pertinent geneological relationships from the text by using similarity scoring which could then be used to further extract exact relationships in another notebook

##### Prerequisite Setup Instructions,
1. A suitable text you wish to train the machine on.
2. Create a folder called for the project where you can find it later.  I called mine Capstone.
3. Next create a data folder inside the project folder where you will store the original and any transformations you need.
4. Python (I have used a Jupyter notebook for this exercise)
5. Install NLTK library and data upporting by uncommenting \"remove the number sign\" in the first two line of code below.

In [4]:
# pip install nltk
# pip install Unidecode
# pip install gensim

#### Libraries

In [33]:
# directory navigation and system commands 
import os
import re

#string manipulations
import string

# natural language processing
import nltk
from nltk import word_tokenize, sent_tokenize
from unidecode import unidecode
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#Working with dataframes
import pandas as pd

#Vectorizing text
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

#GenSim
from gensim.models import Word2Vec
import numpy as np
import gensim
from gensim import corpora
from pprint import pprint

##### NLTK Corpus Explained,

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). In NLTK a corpus package defines a collection english (or other language) classes, which can be used to access the contents of a diverse set of corpora. The list of available corpora is given at: http://www.nltk.org/nltk_data/ Each corpus reader class is specialized to handle a specific corpus format. In addition, the nltk corpus reader functions can be given lists of item names; they return a concatenation of:
    - words(): list of str
    - sents(): list of (list of str)
    - paras(): list of (list of (list of str))
    - tagged_words(): list of (str,str) tuple
    - tagged_sents(): list of (list of (str,str))
    - tagged_paras(): list of (list of (list of (str,str)))
    - chunked_sents(): list of (Tree w/ (str,str) leaves)
    - parsed_sents(): list of (Tree with str leaves)
    - parsed_paras(): list of (list of (Tree with str leaves))
    - xml(): A single xml ElementTree
    - raw(): unprocessed corpus contents


##### Procedural Overview of Data Preparations,

1.  <u>Data Preparation</u> -  See separate notebook in the root folder named <b>Exploration and Sentence Marking.ipynb</b>.  In this notebook I convert a PDF document in the data folder into sentences and expoort them to a csv for marking and saving back as a marked file named <b>data/markedsentences.csv</b> Below I reuse the code:

    <u>Load file(s) from specified location</u> - <b>data/MonsonMaineHistory</b>

        * set the path
        * read first file
        * lowercase entire text
        * clean extraneous marks
        * save to textfile
        * Loop

In [6]:
path = 'data/'
arr = os.listdir(path)
for filename in arr:
    if filename.endswith('.txt'):
        xpath = os.path.join(path,filename)
        with open(xpath, 'r') as f:
                text = f.read()
                text = text.lower()
                text = unidecode(text)
                reclean = re.compile('<.*?>')
                clean = re.sub(reclean,'', text)
                cleantext = " ".join(clean.split())
                outFile = open(path + os.path.splitext(filename)[0] +".txt",'w')
                outFile.write(cleantext)
                outFile.close()

2.  <u>Verify my set of stopwords </u> from english corpus, Stopwords are the English words which do not add clarity or value to a sentence. They can safely be ignored without sacrificing the meaning of the sentence -- most of the time.  Be cautious though because the use of "married" versus "<b>not</b> married" to a person may return a different outcome. Examples of linguistic noise might include words like "the", "he", "have" etc. Such words are already captured the NLTK corpus named <b>corpus</b>. 

In [7]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lucky\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
#Look at and verify the English language stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

5.  <u>Preprocess </u> <b> Tokenization and removal of Stop Words </b> is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.  In the code below I tokenize and save the text from both methods:

In [18]:
#Remove "Not" from the stopwords
nltk_stopwords = stopwords.words('english')
nltk_stopwords.remove('not')

word_tokens = word_tokenize(cleantext)
word_tokens_without_stopwords = [word for word in word_tokens if not word in nltk_stopwords]

filtered_sentences = (" ").join(word_tokens_without_stopwords)
sentence_tokens = sent_tokenize(filtered_sentences)
print(filtered_sentences)

monson . maine history 1822 - 1972 history monson , maine foreword following data gathered sembled much accuracy possible . history could not compiled without many hours help freely given citizens monson . main sources information book listed back , convenience reader . may history bring pleasure readers future , especially memories past . jeanne brown reed althea haggstrom french elizabeth emanuelson davis history monson , maine white house washington june 19 , 1972 people monson , maine observance one hundred fiftieth anniversary occasion deep pride well nation . high purpose vital communityspirit reflected eventful history best tradition american way life . armedwith qualities years ahead , knowthat strive vanguard constructive civic accomplish ment . welcomeyour full partnership demanding tasks face nation , good promises come united efforts . history monson , maine kenneth curtis may 3 , 1972 citizens monson town manager 's office monson , maine dear citizens : governor state main

In [21]:
# Alternative code using Re for breaking blob of text up into sentences
# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

clean_sentences = split_into_sentences(cleantext)
clean_sentences

['monson.',
 'maine history 1822 - 1972 history of monson, maine foreword the following data has been gathered and as sembled with as much accuracy as possible.',
 'this history could not have been compiled without the many hours of help freely given by the citizens of monson.',
 'the main sources of information for this book are listed at the back, for the convenience of the reader.',
 'may this history bring pleasure to the readers now and in the future, but especially to those with memories of the past.',
 'jeanne brown reed althea haggstrom french elizabeth emanuelson davis history of monson, maine the white house washington june 19, 1972 to the people of monson, maine the observance of your one hundred and fiftieth anniversary is an occasion of deep pride for you as well as for the nation.',
 'the high purpose and vital communityspirit that are reflected in your eventful history are in the best tradition of our american way of life.',
 'armedwith these same qualities in the years 

5. <u> Lemmitizing Sentences</u>  as modified from from https://medium.com/@gaurav5430/using-nltk-for-lemmatizing-sentences-c1bfff963258 
    
    <b>Lemmatization</b> in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, or dictionary form.  For example  -> rocks : rock-> corpora : corpus -> better : good .  Lemmatization converts words (in a sentence) to their stem while respecting their context. For example, the sentence “You are not better than me” would become “You be not good than me”. This is useful when dealing with NLP preprocessing, for example to train doc2vec models.

    <b>WordNet</b> is a part of Python's Natural Language Toolkit. It is a large word database that can be employed to tag  Parts of speech like English Nouns, Adjectives, Adverbs and Verbs. These are grouped into some set of cognitive synonyms, which are called synsets. In the wordnet, there are some groups of words, whose meaning are same.
    
    By default, the lemmatizer takes in an input string and tries to lemmatize it, so if you pass in a word, it would lemmatize it treating it as a noun, it does take the POS tag into account, but it doesn’t magically determine it.
        * nltk.stem.WordNetLemmatizer().lemmatize('loving') returns 'loving'
        * nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v') u'love'
    
    To make the lemmatization better and context dependent, we would need to find out the POS tag and pass it on to the lemmatizer. We would first find out the POS tag for each token using NLTK, use that to find the corresponding tag in WordNet and then use the lemmatizer to lemmatize the token based on the tag.. Here is what it would look like:


In [23]:
### Create Lemitization Functions and POS tags

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag) using the function created above to apply a lambda
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)   

#### Creating and Storing global variables of Tokenized, Tagged and Lemmitized versions created for analysis

In [29]:
#sent_lem  = lemmatizer.lemmatize(cleantext) #lemitized word of the tokenized sentences blob
#sent_pos = lemmatizer.lemmatize(cleantext,"v") ##lemitized POS of the tokenized sentences
sent_lem = lemmatize_sentence(filtered_sentences)

In [30]:
##  replace the varibale name in print command with a variable above to view respective variables
print(sent_lem)

monson . maine history 1822 - 1972 history monson , maine foreword follow data gather sembled much accuracy possible . history could not compile without many hour help freely give citizen monson . main source information book list back , convenience reader . may history bring pleasure reader future , especially memory past . jeanne brown reed althea haggstrom french elizabeth emanuelson davis history monson , maine white house washington june 19 , 1972 people monson , maine observance one hundred fiftieth anniversary occasion deep pride well nation . high purpose vital communityspirit reflect eventful history best tradition american way life . armedwith quality year ahead , knowthat strive vanguard constructive civic accomplish ment . welcomeyour full partnership demand task face nation , good promise come united effort . history monson , maine kenneth curtis may 3 , 1972 citizen monson town manager 's office monson , maine dear citizen : governor state maine want express best wish cit

In [32]:
sentencedf = DataFrame(sent_lem,columns=['sentences'])

NameError: name 'DataFrame' is not defined

#### Extracting Features (VECTORIZING)

in NLTK Features are a sequence of words or sentences in a <b> numeric vector </b> The vector for each word is a semantic description of how that word is used in context, so two words that are used similarly in text will get similar vector represenations. Once you map words into vector space, you can then use vector math to find words that have similar semantics (proximity). In this notebook <b>TF-IDF</b>, <b>Word2Vec</b> and <b>Smooth Inverse Frequency (SIF)</b>.

<u><b> TF-IDF </b></u> converts a collection of raw documents to a matrix of TF-IDF features.  Using TF-IDF embeddings, words will be represented as a single scaler number based on TF-IDF scores. TF-IDF is the combination of TF (Term Frequency) and IDF (Inverse Document Frequency). TF gives the count of word t in document.  Mathematically we can write tf(t,d). IDF gives information about how the word is common or rare across all document. It is the logarithmically scaled inverse fraction of the documents that contain the word.
    
    Mathematically,
    idf(t,D) = log (N/dfi) , where N or |D| = Total Number of Document, and dfi = Number of document where the term t appears.
    TF-IDF(t, d, D) = tf(t,d) . idf(t,D)
    
    scikit-learn library provides easy implementation of TF-IDF. 
    

In order to start using TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), limit vocabulary size, apply stop words and etc. The code below does just that.

In [25]:
docs = clean_sentences

#instantiate CountVectorizer()
cv=CountVectorizer()
 
# this steps generates word counts for the words in your docs
word_count_vector=cv.fit(docs)

#instantiate CountVectorizer()
cv=CountVectorizer()
 
# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)
 
word_count_vector.shape


(1286, 3728)

Sweet, this is what I want! Now it’s time to compute the IDFs. 

#### Compute the IDF values

Now I need to compute the IDF values by calling tfidf_transformer.fit(word_count_vector) on the word counts above.

In [26]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit_transform(word_count_vector)

<1286x3728 sparse matrix of type '<class 'numpy.float64'>'
	with 16589 stored elements in Compressed Sparse Row format>

To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. The values will be sorted in ascending order.

In [27]:
# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['idf_weights'])

# The higher the score the more unique to the document

Unnamed: 0,idf_weights
the,1.584993
and,1.945461
of,2.035386
in,2.163617
was,2.348928
...,...
goldsmith,7.466922
gospel,7.466922
got,7.466922
incidentally,7.466922


The lower the IDF value of a word, the less unique it is to any particular document.

#### Compute the TFIDF score

The first line below, gets the word counts for the documents. I could have actually used word_count_vector from above. However, in practice, I will be computing tf-idf scores on a set of new unseen documents. So I need to first cv.transform(new_docs) to generate the matrix of word counts.

Then, by invoking tfidf_transformer.transform(count_vector) I then compute the tf-idf scores for docs. Internally this is computing the tf * idf  multiplication where the term frequency is weighted by its IDF values.

In [38]:
# count matrix
df = pd.DataFrame(['sentence'])

count_vector=cv.transform(docs)
 
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(count_vector)

feature_names = cv.get_feature_names()
 
#get tfidf vector
document_vector=tf_idf_vector[0]
 
#print the scores
df = pd.DataFrame(document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
monson,1.0
00,0.0
outlet,0.0
orrin,0.0
ors,0.0
...,...
eldridge,0.0
elect,0.0
elected,0.0
electric,0.0


1.  <u>Python | Word Embedding uses Word2Vec</u>. Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. The input layer contains the current word and the output layer contains the context wordshe result is that for each word such as the image below:

![Title](vectors.png)


#### Hyperparameters

In [10]:
settings = {'window_size': 2,# context window +- center word
            'n': 10,# dimensions of word embeddings, also refer to size of hidden layer
            'epochs': 50,# number of training epochs
            'learning_rate': 0.01 # learning rate
}

In [45]:
from gensim.models import Word2Vec

In [49]:
# Initialize the model

from gensim.test.utils import common_texts, get_tmpfile
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

In [50]:
model.train(sentence_tokens, total_examples=1, epochs=1)

(0, 87287)

### ----------------Word to Vec for Sentences--------------

In [95]:
def get_sentences(cleantext):
    while True:
        line = input_file_pointer.readline()
        if not line:
            break

        yield line

Clean sentences by trimming leading and trailing spaces, lower case, remove punctuation, remove unnecessary characters and reduce duplicate space into a single space 

In [96]:
def clean_sentence(cleantext):
    sentences = cleantext.lower().strip()
    sentence = re.sub(r'[^a-z0-9\s]', ' ', sentence)
    return re.sub(r'\s{2,}', ' ', sentence)

Tokenize each line by a simple space delimiter (more advanced techniques for tokenization exist, but tokenize by a simple space gave us good results and works well n practice), and remove stop-words. Removing stop-words is task dependent and in some NLP tasks, keeping the stop-words yields better results. One should evaluate both approaches. For this task, we used Spacy’s stop-word set.

In [97]:
from spacy.lang.en.stop_words import STOP_WORDS

In [98]:
def tokenize(sentence):
    return [token for token in sentence.split() if token not in STOP_WORDS]

In [101]:
from gensim.models.phrases import Phrases, Phraser
def build_phrases(sentences):
    phrases = Phrases(sentences,
                      min_count=5,
                      threshold=7,
                      progress_per=1000)
    return Phraser(phrases)

In [102]:
phrases_model.save('data/phrases_model.txt')
phrases_model= Phraser.load('data/phrases_model.txt')

NameError: name 'phrases_model' is not defined

Now that we have a phrases model, we can use it to extract bi-grams for a given sentence:

In [39]:
def sentences_to_bi_grams(n_grams, input_file_name, output_file_name):
    with open(input_file_name, 'r') as input_file_pointer:
        with open(output_file_name, 'w+') as out_file:
            for sentence in get_sentences(input_file_pointer):
                cleaned_sentence = clean_sentence(sentence)
                tokenized_sentence = tokenize(cleaned_sentence)
                parsed_sentence = sentence_to_bi_grams(n_grams, tokenized_sentence)
                out_file.write(parsed_sentence + '\n')

##### Class Gensim.Sentence

In [40]:
from gensim.test.utils import datapath
sentences = LineSentence(datapath(cleantext))
for sentence in sentences:
    pass

NameError: name 'LineSentence' is not defined

#### Cosine Similarity Approach

In [13]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform

sentences = sentence_tokens
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)
cs_title = squareform(pdist(X.toarray(), 'cosine'))

In [14]:
X

<989x3630 sparse matrix of type '<class 'numpy.int64'>'
	with 10994 stored elements in Compressed Sparse Row format>

In [15]:
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]], dtype=int64)

In [16]:
vectorizer.get_feature_names()

['00',
 '000',
 '01',
 '010',
 '01010106',
 '0121photo',
 '014',
 '01o1o',
 '04426',
 '04464',
 '04986',
 '07',
 '0ioioeoloolozoioi010101010',
 '0nson',
 '0zoeoeoioectoeoeo',
 '0zoeoi0joz',
 '10',
 '100',
 '100th',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '11',
 '110',
 '111',
 '116',
 '119',
 '12',
 '13',
 '14',
 '149',
 '14th',
 '15',
 '150th',
 '15th',
 '16',
 '17',
 '171',
 '18',
 '1811',
 '1816',
 '1817',
 '1818',
 '1819',
 '1820',
 '1821',
 '1822',
 '1823',
 '1824',
 '1825',
 '1826',
 '1827',
 '1828',
 '1830',
 '1831',
 '1836',
 '1839',
 '1841',
 '1844',
 '1845',
 '1847',
 '1848',
 '1852',
 '1860',
 '1861',
 '1862',
 '1864',
 '1867',
 '1870',
 '1871',
 '1872',
 '1873',
 '1874',
 '1875',
 '1877',
 '1878',
 '1879',
 '1880',
 '1882',
 '1883',
 '1884',
 '1885',
 '1886',
 '1888',
 '1889',
 '1890',
 '1892',
 '1893',
 '1895',
 '1896',
 '1897',
 '1899',
 '18th',
 '19',
 '1900',
 '1901',
 '1902',
 '1903',
 '1904',
 '1905',
 '1906',
 '1907',
 '1908',
 '1909'

In [19]:
np.set_printoptions(precision=2)
print(cs_title)

[[0.   0.77 0.71 ... 0.42 1.   1.  ]
 [0.77 0.   0.8  ... 0.6  1.   1.  ]
 [0.71 0.8  0.   ... 0.78 1.   1.  ]
 ...
 [0.42 0.6  0.78 ... 0.   0.87 1.  ]
 [1.   1.   1.   ... 0.87 0.   0.9 ]
 [1.   1.   1.   ... 1.   0.9  0.  ]]


#### SIF

    Smooth inverse frequency embeddings were originally conceived by [1] and the corresponding paper has been presented at the 2017 ICLR. The code for the original paper is available at Github. The authors present a nice probabilistic motivation for the inverse frequency weighted continuous bag-of-words model. We are not going into the technical details of the math, but rather into the optimization of the algorithm for computing the SIF embeddings. If you have to compute the SIF embeddings for millions of sentences, you need a routine to accomplish the task in a lifetime. 

#### USING GENSIM

In [31]:
text = sentence_tokens

tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary = corpora.Dictionary(tokens)

print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")

for k, v in gensim_dictionary.token2id.items():
    print(f'{k:{15}} {v:{10}}')

The dictionary has: 3841 tokens
.                        0
monson                   1
,                        2
-                        3
1822                     4
1972                     5
accuracy                 6
data                     7
following                8
foreword                 9
gathered                10
history                 11
maine                   12
much                    13
possible                14
sembled                 15
citizens                16
compiled                17
could                   18
freely                  19
given                   20
help                    21
hours                   22
many                    23
not                     24
without                 25
back                    26
book                    27
convenience             28
information             29
listed                  30
main                    31
reader                  32
sources                 33
bring                   34
especially             

buy                   1379
chairs                1380
curtains              1381
earn                  1382
eighth                1383
fitting               1384
gattrell              1385
last                  1386
money                 1387
necessities           1388
pete                  1389
recess                1390
refreshments          1391
showers               1392
stage                 1393
supervision           1394
26                    1395
baptist               1396
congregational        1397
distict               1398
lutheran              1399
methodist             1400
pleasant              1401
rev                   1402
swedish               1403
wilkins               1404
braydon               1405
churches              1406
colony                1407
douglas               1408
former                1409
parsonage             1410
character             1411
influence             1412
maintain              1413
provided              1414
sparcely-settled       1415


14th                  2872
alert                 2873
elaborate             2874
evening               2875
hills                 2876
japan                 2877
longing               2878
mountains             2879
nestled               2880
news                  2881
nights                2882
programs              2883
quiet                 2884
surrender             2885
tues.                 2886
wed.                  2887
welcome               2888
announcement          2889
broadcasts            2890
get                   2891
safe                  2892
say                   2893
tuned                 2894
68                    2895
hearts                2896
individual            2897
joy                   2898
loose                 2899
scheduled             2900
ways                  2901
blew                  2902
dead                  2903
ear                   2904
siren                 2905
sweet                 2906
sweeter               2907
automobile            2908
d

In [38]:
# find the dictionary id of "married"
print(gensim_dictionary.token2id["wife"])

455


In [39]:
# find the word from an id
print(list(gensim_dictionary.token2id.keys())[list(gensim_dictionary.token2id.values()).index(455)])

wife


In [41]:
# Print tokens and ids
print(gensim_dictionary.token2id)

{'.': 0, 'monson': 1, ',': 2, '-': 3, '1822': 4, '1972': 5, 'accuracy': 6, 'data': 7, 'following': 8, 'foreword': 9, 'gathered': 10, 'history': 11, 'maine': 12, 'much': 13, 'possible': 14, 'sembled': 15, 'citizens': 16, 'compiled': 17, 'could': 18, 'freely': 19, 'given': 20, 'help': 21, 'hours': 22, 'many': 23, 'not': 24, 'without': 25, 'back': 26, 'book': 27, 'convenience': 28, 'information': 29, 'listed': 30, 'main': 31, 'reader': 32, 'sources': 33, 'bring': 34, 'especially': 35, 'future': 36, 'may': 37, 'memories': 38, 'past': 39, 'pleasure': 40, 'readers': 41, '19': 42, 'althea': 43, 'anniversary': 44, 'brown': 45, 'davis': 46, 'deep': 47, 'elizabeth': 48, 'emanuelson': 49, 'fiftieth': 50, 'french': 51, 'haggstrom': 52, 'house': 53, 'hundred': 54, 'jeanne': 55, 'june': 56, 'nation': 57, 'observance': 58, 'occasion': 59, 'one': 60, 'people': 61, 'pride': 62, 'reed': 63, 'washington': 64, 'well': 65, 'white': 66, 'american': 67, 'best': 68, 'communityspirit': 69, 'eventful': 70, 'hig

In [None]:
##  Insert more tokens and Ids from other text sources
# text = ["""Colloquially, the term "artificial intelligence" is used to
           describe machines that mimic "cognitive" functions that humans
           associate with other human minds, such as "learning" and "problem solving"""]

#tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary.add_documents(tokens)

#print("The dictionary has: " + str(len(gensim_dictionary)) + " tokens")
#print(gensim_dictionary.token2id)

In [34]:
import gensim.downloader as api
w2v_embedding = api.load("glove-wiki-gigaword-100")