In this mini-lecture, we systematically study the building blocks of the Gensim library. 

In the world of NLP, NLTK is the most dominant foundational package. But there are other important packages. The Gensim library is another powerful package that can handle a variety of tasks, such latent semantic indexing as well as other topic extraction. It can also employ models based on deep learning techniques such as word embeddings. We have seen the functionalities of NLTK before; in this mini-lecture we explore basic concepts in the Gensim library. Majority of the materials come from the official Gensim documentation. 

In this tutorial, we will use the stable version of Gensim 3.8.3. Currently Gensim is going through some big changes for versions above 4.0.1. The new version is more powerful but a bit unstable. So we will use the stable version in this tutorial. Details of the upgrade can be found here:

   - https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
import string
import nltk
import re
import gensim 
import pprint
import logging

from collections import defaultdict
from wordcloud import STOPWORDS

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

%matplotlib inline

In [2]:
warnings.filterwarnings("ignore")

path="C:\\Users\\GAO\\python workspace\\GAO_Jupyter_Notebook\\Datasets"
os.chdir(path)

print('Gensim version: ', gensim.__version__)

# path="C:\\Users\\pgao\\Documents\\PGZ Documents\\Programming Workshop\\PYTHON\\Open Courses on Python\\Udemy Course on Python\Introduction to Data Science Using Python\\datasets"
# os.chdir(path)

# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Gensim version:  3.8.3


### I. Core Concepts in Gensim

We will first go over some of the core concepts in this library before plodding forward with other word embedding methods.

The core concepts of Gensim library are:

   1. **(Gensim) document**: some text.
   2. **(Gensim) corpus**: a collection of documents.
   3. **(Gensim) vector**: a mathematically convenient representation of a document.
   4. **(Gensim) model**: an algorithm for transforming vectors from one representation to another.

A (Gensim) document is an object of the text sequence type (commonly known as 'str' in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. Below is an example:

In [4]:
document="Human machine interface for lab and computer applications."

A (Gensim) corpus is a collection of document objects. The distinction between a document and a vector is that the former is text, and the latter is a mathematically convenient representation of the text (say using a list or a dictionary etc.). Corpora serve two roles in Gensim:

   1. Input for training a (Gensim) model object: during training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters. Gensim focuses on unsupervised models so that no human intervention, such as costly annotations or tagging documents by hand, is required.
   2. (Gensim) documents to organize: after training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus). Such corpora can be indexed for an operation in Gensim called **(Gensim) similarity queries**.

Here is an example of a corpus. It consists of 9 documents, where each document is a string consisting of a single sentence:

In [5]:
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",]

This is a particularly small example of a corpus for illustration purposes. Another example could be a list of all the plays written by Shakespeare, list of all wikipedia articles, or all tweets by a particular person of interest.

After collecting our corpus, there are typically a number of preprocessing steps we want to undertake. For now, we will keep it simple and just remove some commonly used English words (such as 'the') and words that occur only once in the corpus. In the process of doing so, we’ll tokenize our data. Tokenization breaks up the documents into words (in this case using space as a delimiter):

In [6]:
mini_stoplist = set('for a of the and to in'.split(' ')) # creating a set of frequent words, which is a subset of the NLTK's stopwords

texts = [[word for word in document.lower().split() if word not in mini_stoplist] for document in text_corpus] # lowercasing each document and splitting it by white space and filtering out elements from mini_stopwords
print("texts:\n ", texts, "\n")

frequency = defaultdict(int) # counting word frequencies
for t in texts:
    for token in t:
        frequency[token] += 1 

processed_corpus = [[token for token in t if frequency[token] > 1] for t in texts] # only keeping words that appear more than once
pprint.pprint(processed_corpus)

texts:
  [['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']] 

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


Before proceeding, we want to associate each word in the corpus with a unique integer ID. We can do this using the gensim.corpora.Dictionary() class. This method defines the vocabulary of all words that our processing knows about.

In [7]:
dictionary = gensim.corpora.Dictionary(processed_corpus)
print(dictionary)
print("first element of the dictionary object: ", dictionary[0])
type(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
first element of the dictionary object:  computer


gensim.corpora.dictionary.Dictionary

To infer the latent structure in our corpus we need a way to represent documents that we can manipulate mathematically. One approach is to represent each document as a vector of features. For example, a single feature may be thought of as a question-answer pair:

   - How many times does the word 'splonge' appear in the document? Zero.
   - How many paragraphs does the document consist of? Two.
   - How many fonts does the document use? Five.

The question is usually represented only by its integer IDs (such as 1, 2 and 3). The representation of this document then becomes a series of pairs like (1, 0.0), (2, 2.0), (3, 5.0). This is known as a **(Gensim) dense vector**, because it contains an explicit answer to each of the above questions. If we know all the questions in advance, we may leave them implicit and simply represent the document as (0, 2, 5). This sequence of answers is the vector for our document (in this case a 3-dimensional dense vector). For practical purposes, only questions to which the answer is (or can be converted to) a single floating point number are allowed in Gensim.

In practice, vectors often consist of many zero values. To save memory, Gensim omits all vector elements with value 0.0. The above example thus becomes (2, 2.0), (3, 5.0). This is known as a **(Gensim) sparse vector** or **(Gensim) bag-of-words vector**. The values of all missing features in this sparse representation can be unambiguously resolved to zero.

Conceptually, assuming the questions are the same, we can compare the vectors of two different documents to each other. For example, assume we are given two vectors (0.0, 2.0, 5.0) and (0.1, 1.9, 4.9). Because the vectors are very similar to each other, we can conclude that the documents corresponding to those vectors are similar, too. Of course, the correctness of that conclusion depends on how well we picked the questions in the first place. This is a hard topic because it involves measuring dissimilarities between two bulks of texts. But for now, that intuition based on the vector value makes sense certainly in many situations. 

Another approach to represent a document as a vector is the bag-of-words (BOW) model. Under the BOW model, each document is represented by a vector containing the frequency counts of each word in the dictionary. For example, assume we have a dictionary containing the words ['coffee', 'milk', 'sugar', 'spoon']. A document consisting of the string 'coffee milk coffee' would then be represented by the vector [2, 1, 0, 0] where the entries of the vector are (in order) the occurrences of 'coffee', 'milk', 'sugar' and 'spoon' in the document. The length of the vector is the number of entries in the dictionary. 

In our example, our processed corpus has 12 unique words in it, which means that each document will be represented by a 12-dimensional vector under the bag-of-words model. We can use the dictionary to turn tokenized documents into these 12-dimensional vectors. We can see what these IDs correspond to:

In [8]:
pprint.pprint(dictionary.token2id)

{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}


Suppose we wanted to vectorize the phrase "Human computer interaction" (note that this phrase was not in our original corpus). We can create the bag-of-word representation for a document using the doc2bow() method of the dictionary, which simply counts the number of occurrences of each distinct word, converts the word to its integer word ID and returns a sparse representation of the word counts. The first entry in each tuple corresponds to the ID of the token in the dictionary, the second corresponds to the count of this token: 

In [9]:
new_doc = "Human vs computer and computer vs computer and human vs human interaction are all confusing. Users will not like this idea, but this one particular user figured out everything!"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(0, 3), (1, 3), (7, 1)]


Remember in the dictionary defined before, we see that three words reside in that dictionary: computer (with the token ID equal to 0), human (indexed by the token ID 1), and user (indicated by the token ID 7). These three words occur in the next text named 'new_doc'. The object 'new_vec' simply tells us the corresponding word frequencies. Note that the word 'vs' did not occur in the original corpus and so it was not included in the vectorization. Also note that this vector only contains entries for words that actually appeared in the document. Because any given document will only contain a few words out of the many words in the dictionary, words that do not appear in the vectorization are represented as implicitly zero as a space saving measure.

We can convert our entire original corpus to a list of vectors:

In [10]:
bow_corpus = [dictionary.doc2bow(txt) for txt in processed_corpus]
pprint.pprint(bow_corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


Now that we have vectorized our corpus we can begin to transform it using a **(Gensim) model**, which is used as an abstract term referring to a transformation from one document representation to another. In Gensim, documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The model learns the details of this transformation during training, when it reads the training corpus.

One simple example of a model is TF-IDF. The TF-IDF model transforms vectors from the BOW representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

Here’s a simple example. Let’s initialize the TF-IDF model, training it on our corpus and transforming the string 'system minors'. As shown below, the TF-IDF model again returns a list of tuples, where the first entry is the token ID and the second entry is the TF-IDF weighting. Note that the ID corresponding to 'system' (which occurred 4 times in the original corpus) has been weighted lower than the ID corresponding to 'minors' (which only occurred twice).

In [11]:
tfidf = gensim.models.TfidfModel(bow_corpus)

words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)]) # transforming the "system minors" string

[(5, 0.5898341626740045), (11, 0.8075244024440723)]


Once we have created the model, we can do all sorts of cool stuff. For example, to transform the whole corpus via TF-IDF and index it, in preparation for operations called 'similarity queries':

In [12]:
index = gensim.similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)
print(type(index))

query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]] 
print(list(enumerate(sims))) # querying the similarity of 'query_document' against every document in the corpus


<class 'gensim.similarities.docsim.SparseMatrixSimilarity'>
[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


To interpret the result above, the third document has a similarity score of 0.718=72%, while the second document has a similarity score of 42% etc. We can make this slightly more readable by sorting:

In [13]:
for doc_num, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(doc_num, score)

3 0.7184812
2 0.41707572
1 0.32448703
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0


### II. Corpus Streaming and Vector Space

Note that all the corpora above reside fully in memory, as a plain Python list. When the dataset is large, storing all of them in RAM won’t do. Instead, let’s assume the documents are stored in a file on disk, one document per line. Gensim only requires that a corpus be able to return one document vector at a time. To construct the dictionary without loading all texts into memory, we can do the following:

In [15]:
dictionary = gensim.corpora.Dictionary(line.lower().split() for line in open('Shakespeare sonnets.txt'))

stop_ids = [dictionary.token2id[stopword] for stopword in mini_stoplist if stopword in dictionary.token2id] # generating token IDs to remove stopwords 
 
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids) # generating token IDs to remove words that appear only once   

dictionary.compactify()  # removing gaps in ID sequence after words that were removed by assigning new word ids to all words.
print(dictionary) # this is a gensim.corpora.dictionary.Dictionary object
pprint.pprint(dictionary[1046])

Dictionary(1455 unique tokens: ['by', 'i', 'creatures', 'desire', 'fairest']...)
'breath'


When we have a corpus in vector format, there exist several file formats for serializing a vector space corpus to disks. Gensim implements them via the streaming corpus interface mentioned earlier: texts are read from disk in a lazy fashion, one document at a time, without the whole corpus being read into the main memory at once. One of the more notable file formats is the 'Market Matrix format'. To save a corpus in this format, let's use an example below after creating a toy corpus as a plain Python list:

In [16]:
toy_corpus = [[(1, 5),(2, 4), (3, 5)], []]  # make one document empty, just for the heck of it
gensim.corpora.MmCorpus.serialize('new_corpus.mm', toy_corpus) # mm stands for 'Marktet Matrix format'

There are other formats for sure (c.f. the official Gensim documentation). And each format has associated methods interacting under that format.

Corpus objects are streams HERE, so typically we won’t be able to print them directly. To load the data using the 'Market Matrix format', here is what we can do:

In [17]:
loaded_corpus = gensim.corpora.MmCorpus('new_corpus.mm')
print(loaded_corpus)
list(loaded_corpus)

MmCorpus(2 documents, 4 features, 3 non-zero entries)


[[(1, 5.0), (2, 4.0), (3, 5.0)], []]

Gensim also contains efficient utility functions to help converting from/to numpy matrices (pay attention to the way matrices are stored below):

In [18]:
numpy_matrix = np.random.randint(10, size=[5, 2])  # random matrix as an example
gensim_corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
print(gensim_corpus)
print(numpy_matrix)
list(gensim_corpus)

<gensim.matutils.Dense2Corpus object at 0x0000028636F18E08>
[[4 3]
 [8 2]
 [8 1]
 [8 5]
 [6 8]]


[[(0, 4.0), (1, 8.0), (2, 8.0), (3, 8.0), (4, 6.0)],
 [(0, 3.0), (1, 2.0), (2, 1.0), (3, 5.0), (4, 8.0)]]

### III. Topics and Transformations

In this section we examine common ways of transforming the text documents. Specifically, we want to bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and  more semantic way. In addition, we want to make the document representation more compact. This both improves efficiency (new representation consumes less resources) as well as efficacy (noise-reduction).

We will use the wine review dataset again. 

In [19]:
filename="winemag-data-130k-v2.csv"
df = pd.read_csv(filename, index_col=0) # we don't read in row name (index) as a separated column
df.head(3)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


Let's get descriptions on all the reviews for only Chardonnay and Cabernet Suavignon:

In [20]:
df1=df[df['variety'].isin(['Chardonnay', 'Cabernet Sauvignon'])]
both_reviews=" ".join(review for review in df1.description)
print ("There are {} words in the combination of all review.".format(len(both_reviews)))

There are 5113176 words in the combination of all review.


In [23]:
stopwords_list=list(STOPWORDS)
rm_punkt_list=[l for l in string.punctuation]

tokenized_text1=sent_tokenize(both_reviews)
tokenized_text2=[st.strip() for st in tokenized_text1]

tokenized_text3=[]
for j in tokenized_text2:
    tokens=word_tokenize(j)
    tokens2=[l.lower() for l in tokens if (l not in rm_punkt_list) & (l not in stopwords_list)] # removing punctuations as well as stopwords in the list
    tokenized_text3.append(tokens2)
    
print(tokenized_text3[0:3])

dictionary = gensim.corpora.Dictionary(tokenized_text3)
print('A particular dictionary element: ', dictionary[405])
corpus = [dictionary.doc2bow(t) for t in tokenized_text3] # this is a list

[['soft', 'supple', 'plum', 'envelopes', 'oaky', 'structure', 'cabernet', 'supported', '15', 'merlot'], ['coffee', 'chocolate', 'complete', 'picture', 'finishing', 'strong', 'end', 'resulting', 'value-priced', 'wine', 'attractive', 'flavor', 'immediate', 'accessibility'], ['slightly', 'reduced', 'wine', 'offers', 'chalky', 'tannic', 'backbone', 'juicy', 'explosion', 'rich', 'black', 'cherry', 'whole', 'accented', 'throughout', 'firm', 'oak', 'cigar', 'box']]
A particular dictionary element:  lifted


Now transformations can be done. Gensim implements several popular vector space model algorithms:

   - The basic BOW TF-IDF model.
   - Latent symantic indexing model (LSI)
   - Random projection model (RP)
   - Latent Dirichlet allocation model (LDA)

We are familiar with LSI and LDA models. Other methods can be learned through reading the official documentation. We won't elaborate on the detail here. 

### IV. Gensim Similarity Queries

In previous sections, we covered what it means to create a corpus in the vector space model and how to transform it between different vector spaces in Gensim. A common reason for such a charade is that we want to determine similarity between pairs of documents, or the similarity between a specific document and a set of other documents (such as a user query vs. indexed documents).

To show how this can be done in Gensim, let us consider the same corpus as in the the first section:

In [24]:
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

stoplist = set('for a of the and to in'.split()) # removing common words and tokenize
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

frequency = defaultdict(int) # removing words that appear only once
for text in texts:
    for token in text:
        frequency[token] += 1

tokenized_texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = gensim.corpora.Dictionary(tokenized_texts)
corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

Let's first use LSI to transform the data:

In [25]:
K=2
lsi = gensim.models.LsiModel(corpus, id2word=dictionary, num_topics=K)

For the purposes of this tutorial, there are only two things we need to know about LSI. First, it’s just another transformation: it transforms vectors from one space to another. Second, the benefit of LSI is that enables identifying patterns and relationships between terms (in our case, words in a document) and topics. Our LSI space is K-dimensional (_num\_topics_ = $K$) so there are $K$ topics in this example. 

Now suppose a user typed in the query "Shall I compare thee to a summer's day. Thou art more lovely and more temperate. Computer human INTERACTION!!!". We would like to sort our corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities: that is, on apparent semantic relatedness of their texts (words). We will use the cosine similarity measure. The cosine similarity measure looks at the angle of two vectors. Remember the cosine function in mathematics looks at the angle between two vectors. So if $a, b$ are 2 vectors, with the angle in between them to be defined as $\theta$, then we can define $\cos(\theta)=\frac{a \cdot b}{||a||||b||}$:

In [26]:
doc="Shall I compare thee to a summer's day. Thou art more lovely and more temperate. Computer human INTERACTION!!!"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, -0.4618210045327156), (1, -0.07002766527899836)]


To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same documents used for training LSI, converted to 2-D LSA space. 

In [27]:
index = gensim.similarities.MatrixSimilarity(lsi[corpus])

Cosine measure returns similarities in the range \[-1, 1\] (the greater, the more similar). To obtain similarities of our query document against the nine indexed documents, we can do the following:

In [28]:
sims = index[vec_lsi]  # performing a similarity query against the corpus
print(list(enumerate(sims))[0:3], "\n") # sims is a list of tuples

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, "(similarity score): ", documents[doc_position]) # tokenized_text2 works as well

[(0, 0.998093), (1, 0.93748635), (2, 0.9984453)] 

0.9984453 (similarity score):  The EPS user interface management system
0.998093 (similarity score):  Human machine interface for lab abc computer applications
0.9865886 (similarity score):  System and human system engineering testing of EPS
0.93748635 (similarity score):  A survey of user opinion of computer system response time
0.90755945 (similarity score):  Relation of user perceived response time to error measurement
0.050041765 (similarity score):  Graph minors A survey
-0.09879464 (similarity score):  Graph minors IV Widths of trees and well quasi ordering
-0.10639259 (similarity score):  The intersection graph of paths in trees
-0.12416792 (similarity score):  The generation of random binary unordered trees


The thing to note here is that documents "The EPS user interface management system" and "Relation of user perceived response time to error measurement" would never be returned by a standard boolean full-text search, because they do not share any common words with "Human computer interaction". However, after applying LSI, we can observe that both of them received quite high similarity scores, which corresponds better to our intuition of them sharing a 'computer-human' related topic with the query. In fact, this semantic generalization is the reason why we apply transformations and do topic modelling in the first place.

### References: 

   - https://machinelearningmastery.com/what-are-word-embeddings/
   - https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
   - https://radimrehurek.com/gensim/auto_examples/index.html#documentation
   - https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html
   - https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html 
   - https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html
   - https://math.nist.gov/MatrixMarket/formats.html
   - https://www.coursera.org/lecture/probabilistic-models-in-nlp/continuous-bag-of-words-model-hW72r
   - https://www.kaggle.com/jannesklaas/17-nlp-and-word-embeddings
   - https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/ 
   - https://www.tensorflow.org/tutorials/text/word2vec
   - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
   - https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ 
   - https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 
