## Word Embeddings in Python with Gensim

https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

## Word2vec by Google and GloVe by Stanford

Word embeddings are an **improvement over simpler bag-of-words model** word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words. It provides a **dense vector representation of words that capture something about their meaning**.

It is **defining a word by the company that it keeps** that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.

We are going to look at how to use two different word embedding methods called **word2vec by researchers at Google and GloVe by researchers at Stanford**.

## Gensim library

Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling
1. It supports an implementation of the Word2Vec word embedding for **learning new word vectors** from text
2. It also provides tools for **loading pre-trained word embeddings** in a few formats and for making use and querying a loaded embedding.

In [1]:
!pip install nltk
!pip install gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from gensim.models import Word2Vec

In [3]:
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
            ['this', 'is', 'the', 'second', 'sentence'],
            ['yet', 'another', 'sentence'],
            ['one', 'more', 'sentence'],
            ['and', 'the', 'final', 'sentence']]

# train model
model = Word2Vec(sentences, min_count=1)

There are many parameters on this constructor for **Word2Vec()**; a few noteworthy arguments you may wish to configure are:
* **size**: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
* **window**: (default 5) The maximum distance between a target word and words around the target word.
* **min_count**: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
* **workers**: (default 3) The number of threads to use while training.
* **sg**: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

In [4]:
# summarize the loaded model
print(model)

Word2Vec<vocab=14, vector_size=100, alpha=0.025>


After the model is trained, it is accessible via the “wv” attribute. This is the actual word vector model in which queries can be made

In [10]:
model.wv.key_to_index

{'sentence': 0,
 'the': 1,
 'is': 2,
 'this': 3,
 'final': 4,
 'and': 5,
 'more': 6,
 'one': 7,
 'another': 8,
 'yet': 9,
 'second': 10,
 'word2vec': 11,
 'for': 12,
 'first': 13}

In [11]:
# summarize vocabulary
words = list(model.wv.key_to_index)
# words = list(model.vocab)
print(words)

['sentence', 'the', 'is', 'this', 'final', 'and', 'more', 'one', 'another', 'yet', 'second', 'word2vec', 'for', 'first']


In [13]:
# access vector for one word
print(model.wv['another'])

[-9.5785465e-03  8.9431154e-03  4.1650687e-03  9.2347348e-03
  6.6435025e-03  2.9247368e-03  9.8040197e-03 -4.4246409e-03
 -6.8033109e-03  4.2273807e-03  3.7290000e-03 -5.6646108e-03
  9.7047603e-03 -3.5583067e-03  9.5494064e-03  8.3472609e-04
 -6.3384566e-03 -1.9771170e-03 -7.3770545e-03 -2.9795230e-03
  1.0416972e-03  9.4826873e-03  9.3558477e-03 -6.5958775e-03
  3.4751510e-03  2.2755705e-03 -2.4893521e-03 -9.2291720e-03
  1.0271263e-03 -8.1657059e-03  6.3201892e-03 -5.8000805e-03
  5.5354391e-03  9.8337233e-03 -1.6000033e-04  4.5284927e-03
 -1.8094003e-03  7.3607611e-03  3.9400971e-03 -9.0103243e-03
 -2.3985039e-03  3.6287690e-03 -9.9568366e-05 -1.2012708e-03
 -1.0554385e-03 -1.6716016e-03  6.0495257e-04  4.1650953e-03
 -4.2527914e-03 -3.8336217e-03 -5.2816868e-05  2.6935578e-04
 -1.6880632e-04 -4.7855065e-03  4.3134023e-03 -2.1719194e-03
  2.1035396e-03  6.6652300e-04  5.9696771e-03 -6.8423809e-03
 -6.8157101e-03 -4.4762576e-03  9.4358288e-03 -1.5918827e-03
 -9.4292425e-03 -5.45041

In [14]:
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Word2Vec<vocab=14, vector_size=100, alpha=0.025>


## Load Google’s Word2Vec Embedding
Training your own word vectors may be the best approach for a given NLP problem. But it can take a long time, a fast computer with a lot of RAM and disk space, and perhaps some expertise in finessing the input data and training algorithm.

An alternative is to simply use an existing pre-trained word embedding. Along with the paper and code for word2vec, Google also published a pre-trained word2vec model on the <a href='https://code.google.com/archive/p/word2vec/'>Word2Vec Google Code Project</a>

A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabytes file. You can download it from here: <a href='https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing'>GoogleNews-vectors-negative300.bin.gz</a>.  Unzipped, the binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes.

The Gensim library provides tools to load this file. Specifically, you can call the **KeyedVectors.load_word2vec_format()** function to load this model into memory, for example:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
!ln -s /content/gdrive/My\ Drive/ /mydrive

MessageError: ignored

In [None]:
#!ls /mydrive/NITW-NLP

In [None]:
%cd /content
ls -l
# !wget https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
#######!unzip /mydrive/NITW-NLP/GoogleNews-vectors-negative300.bin.gz
# !gunzip /mydrive/NITW-NLP/GoogleNews-vectors-negative300.bin.gz
# ! cp /content/GoogleNews-vectors-negative300.bin  /mydrive/NITW-NLP

In [None]:
from gensim.models import KeyedVectors
filename = '/mydrive/NITW-NLP/GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)
# It takes about a minute to load this

An interesting thing that you can do is do a little linear algebra arithmetic with words. For example, a popular example described in lectures and introduction papers is: queen = (king - man) + woman

That is the word queen is the closest word given the subtraction of the notion of man from king and adding the word woman. The “man-ness” in king is replaced with “woman-ness” to give us queen. A very cool concept.

Gensim provides an interface for performing these types of operations in the **most_similar()** function on the trained or loaded model. For example:

In [None]:
model['queen']

In [None]:
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

## Load Stanford’s GloVe Embedding
Stanford researchers also have their own word embedding algorithm like word2vec called Global Vectors for Word Representation, or GloVe for short.Generally, NLP practitioners seem to prefer GloVe over Word2Vec at the moment based on results.

Like word2vec, the GloVe researchers also provide pre-trained word vectors, in this case, a great selection to choose from.

You can download the GloVe pre-trained word vectors and load them easily with gensim.

The first step is to convert the GloVe file format to the word2vec file format. The only difference is the addition of a small header line. This can be done by calling the glove2word2vec() function. For example:

In [None]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !cp glove.6B.zip /mydrive/NITW-NLP/

In [None]:
!ls -l /mydrive/NITW-NLP/

In [None]:
%cd /content
!unzip /mydrive/NITW-NLP/glove.6B.zip
#!gunzip /mydrive/NITW-NLP/GoogleNews-vectors-negative300.bin.gz

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
from gensim.models import KeyedVectors
# load the Stanford GloVe model
filename = 'word2vec.txt'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

In [None]:
result = model.most_similar(positive=['sad', 'fun'], negative=['negative'], topn=1)
print(result)

## A

## Visualize Word Embedding
After you learn word embedding for your text data, it can be nice to explore it with visualization. You can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph. The visualizations can provide a qualitative diagnostic for your learned model.

We can retrieve all of the vectors from a trained model as follows:

In [None]:
X = model[model.wv.vocab]

In [None]:
X.shape

We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot. Let’s look at an example with Principal Component Analysis or PCA.
### Plot Word Vectors Using PCA
We can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class as follows.

In [None]:
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

In [None]:
pca = PCA(n_components=2)
result = pca.fit_transform(X)

The resulting projection can be plotted using matplotlib as follows, pulling out the two dimensions as x and y coordinates

In [None]:
pyplot.scatter(result[:, 0], result[:, 1])

We can go one step further and annotate the points on the graph with the words themselves. A crude version without any nice offsets looks as follows

In [None]:
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

It is hard to pull much meaning out of the graph given such a tiny corpus was used to fit the model.