In [None]:
pip install --upgrade Gensim

In [None]:
sentences = ...
model = Word2Vec(sentences)

Word2vec is one algorithm for learning a word embedding from a text corpus.

There are two main training algorithms that can be used to learn the embedding from text; they are continuous bag of words (CBOW) and skip grams.

There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).

window: (default 5) The maximum distance between a target word and words around the target word.

min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.

workers: (default 3) The number of threads to use while training.

sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

After the model is trained, it is accessible via the “wv” attribute. This is the actual word vector model in which queries can be made.

For example, you can print the learned vocabulary of tokens (words) as follows:

In [None]:
words = list(model.wv.vocab)
print(words)

You can review the embedded vector for a specific token as follows:

In [None]:
print(model['word'])

Finally, a trained model can then be saved to file by calling the save_word2vec_format() function on the word vector model.

By default, the model is saved in a binary format to save space. For example:

In [None]:
model.wv.save_word2vec_format('model.bin')

When getting started, you can save the learned model in ASCII format and review the contents.

You can do this by setting binary=False when calling the save_word2vec_format() function, for example:

In [None]:
model.wv.save_word2vec_format('model.txt', binary=False)

The saved model can then be loaded again by calling the Word2Vec.load() function. For example

In [None]:
model = Word2Vec.load('model.bin')

In [None]:
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],['this', 'is', 'the', 'second', 'sentence'],
            ['yet', 'another', 'sentence'],
            ['one', 'more', 'sentence'],
            ['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Along with the paper and code for word2vec, Google also published a pre-trained word2vec model on the Word2Vec Google Code Project.

# Loading google Pretrained model

A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors.

It is a 1.53 Gigabytes file. You can download it from here:

"GoogleNews-vectors-negative300.bin.gz"    
                                          copy the link to get model


Unzipped, the binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes.
https://code.google.com/archive/p/word2vec/

The Gensim library provides tools to load this file. Specifically, you can call the KeyedVectors.load_word2vec_format() function to load this model into memory, for example:

In [None]:
from gensim.models import KeyedVectors
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

In [None]:
from gensim.models import KeyedVectors
# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

In [None]:
#expected answer
[('queen', 0.7118192315101624)]

# Load Stanford’s GloVe Embedding

The first step is to convert the GloVe file format to the word2vec file format. The only difference is the addition of a small header line. This can be done by calling the glove2word2vec() function. For example:

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.txt'
word2vec_output_file = 'word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

You can download the smallest GloVe pre-trained model from the GloVe website. It an 822 Megabyte zip file with 4 different models (50, 100, 200 and 300-dimensional vectors) trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary.
there are other more pretrained Glove models on 42B and 840B tokens with 300D vectors. 
https://nlp.stanford.edu/projects/glove/

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
from gensim.models import KeyedVectors
# load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)