# Using Sent2Vec via Gensim

This tutorial is about using sent2vec model in Gensim. Here, we'll learn to work with the sent2vec library for training sentence-embedding models, saving & loading them and performing similarity operations.

# What is Sent2Vec?

Sent2Vec delivers numerical representations (features) for short texts or sentences, which can be used as input to any machine learning task later on. Think of it as an unsupervised version of FastText, and an extension of word2vec (CBOW) to sentences. The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see the [paper](https://arxiv.org/abs/1703.02507) for more details.

The sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams present in each sentence, and averaging the n-gram embeddings along with the words

# Training model

For this tutorial, we'll be training our model using the Lee Background Corpus included in gensim. This corpus contains 314 documents selected from the Australian Broadcasting Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

In [1]:
import gensim
import os
from gensim.models.word2vec import LineSentence
from gensim.models.sent2vec import Sent2Vec
from gensim.utils import tokenize
import re
import time
import smart_open
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

# Prepare training data

In [2]:
from gensim.test.utils import datapath

lee_train_file = datapath('lee_background.cor')
lee_data = []
with smart_open.smart_open(lee_train_file) as f1:
    for line in f1:
        if line not in ['\n', '\r\n']:
            line = re.split('\.|\?|\n', line.strip())
            for sentence in line:
                if len(sentence) > 1:
                    sentence = tokenize(sentence)
                    lee_data.append(list(sentence))

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...) from 9 documents (total 29 corpus positions)


In [3]:
# Print sample training data
for sentence in lee_data[:5]:
    print sentence,'\n'

[u'Hundreds', u'of', u'people', u'have', u'been', u'forced', u'to', u'vacate', u'their', u'homes', u'in', u'the', u'Southern', u'Highlands', u'of', u'New', u'South', u'Wales', u'as', u'strong', u'winds', u'today', u'pushed', u'a', u'huge', u'bushfire', u'towards', u'the', u'town', u'of', u'Hill', u'Top'] 

[u'A', u'new', u'blaze', u'near', u'Goulburn', u'south', u'west', u'of', u'Sydney', u'has', u'forced', u'the', u'closure', u'of', u'the', u'Hume', u'Highway'] 

[u'At', u'about', u'pm', u'AEDT', u'a', u'marked', u'deterioration', u'in', u'the', u'weather', u'as', u'a', u'storm', u'cell', u'moved', u'east', u'across', u'the', u'Blue', u'Mountains', u'forced', u'authorities', u'to', u'make', u'a', u'decision', u'to', u'evacuate', u'people', u'from', u'homes', u'in', u'outlying', u'streets', u'at', u'Hill', u'Top', u'in', u'the', u'New', u'South', u'Wales', u'southern', u'highlands'] 

[u'An', u'estimated', u'residents', u'have', u'left', u'their', u'homes', u'for', u'nearby', u'Mittago

# Using gensim implementation of sent2vec

In [4]:
# Train new sent2vec model
% time sent2vec_model = Sent2Vec(lee_data, size=100, epochs=20, seed=42, workers=4, compute_loss=True)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 0.06 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 1531, tokens read: 60302
INFO:gensim.models.sent2vec:training model with 4 workers on 1531 vocabulary and 100 features, using sample=0.0001 negative=10
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.base_any2vec:EPOCH - 1 : training on 49186 raw words (12893 effective words) took 0.6s, 20504 effective words/s
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.mod

INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.base_any2vec:EPOCH - 17 : training on 49186 raw words (16707 effective words) took 0.6s, 27611 effective words/s
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.base_any2vec:EPOCH - 18 : training on 49186 raw words (26538 effective words) took 0.6s, 42971 effective words/s
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.mode

# Training hyperparameters

Sent2Vec supports the following parameters:

- sentences : For larger corpora (like the Toronto corpus), consider an iterable that streams the sentences directly from disk/network.
- size : Dimensionality of the feature vectors. Default 100
- lr : Initial learning rate. Default 0.2
- seed : For the random number generator for reproducible reasons. Default 42
- min_count : Ignore all words with total frequency lower than this. Default 5
- max_vocab_size : Limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Default is 30000000.
- t : Threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).
- loss_type : Default is 'ns', negative sampling will be used.
- neg : Specifies how many "noise words" should be drawn (usually between 5-20). Default is 10.
- epochs : Number of iterations (epochs) over the corpus. Default is 5.
- lr_update_rate : Change the rate of updates for the learning rate. Default is 100.
- word_ngrams : Max length of word ngram. Default is 2.
- bucket : Number of hash buckets for vocabulary. Default is 2000000.
- minn : Min length of char ngrams. Default is 3.
- maxn : Max length of char ngrams. Default is 6.
- dropout_k : Number of ngrams dropped when training a sent2vec model. Default is 2.
- batch_words : Target size (in words) for batches of examples passed to worker threads (and thus cython routines). Default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
- workers : Use this many worker threads to train the model (=faster training with multicore machines). Default is 3.

In [5]:
# Print sentence vector
sent2vec_model[['This', 'is', 'an', 'awesome', 'gift']]

array([ 0.64343165,  0.53788457,  0.45045869,  0.48838587,  0.29893824,
        0.45747082,  0.63852038,  0.65869833,  0.52699917,  0.5857261 ,
        0.47132542,  0.58043349,  0.47777968,  0.60814609,  0.49833449,
        0.38110881,  0.69858303,  0.66416424,  0.81931443,  0.4155114 ,
        0.57073023,  0.55380497,  0.28195541,  0.62358099,  0.52095577,
        0.51947797,  0.56886435,  0.58877376,  0.57949602,  0.47210854,
        0.4432609 ,  0.57589434,  0.4530839 ,  0.41110785,  0.48628275,
        0.54942964,  0.64731411,  0.42769373,  0.29451166,  0.64898617,
        0.68523008,  0.45938408,  0.63909112,  0.21243462,  0.29584199,
        0.73011848,  0.40396956,  0.54390835,  0.30489099,  0.35600193,
        0.46076595,  0.3620182 ,  0.39831532,  0.45431657,  0.81282109,
        0.49062278,  0.45564559,  0.6451843 ,  0.45339406,  0.54580863,
        0.30856791,  0.34208836,  0.47415894,  0.47500173,  0.43383262,
        0.5011096 ,  0.52924484,  0.38580272,  0.42339968,  0.43

In [6]:
# Print cosine similarity between two sentences
sent2vec_model.similarity(['This', 'is', 'an', 'awesome', 'gift'], ['This', 'present', 'is', 'great'])

0.96293012708085046

# Saving and loading models

Models can be saved and loaded via the load and save methods.

In [7]:
# Save trained sent2vec model
sent2vec_model.save('s2v1')

INFO:gensim.utils:saving Sent2Vec object under s2v1, separately None
INFO:gensim.utils:storing np array 'wi' to s2v1.wi.npy
INFO:gensim.utils:saved s2v1


In [8]:
# Load pretrained sent2vec model
loaded_model = Sent2Vec.load('s2v1')

INFO:gensim.utils:loading Sent2Vec object from s2v1
INFO:gensim.utils:loading wi from s2v1.wi.npy with mmap=None
INFO:gensim.utils:loaded s2v1
