# Using Sent2Vec via Gensim

This tutorial is about using sent2vec model in Gensim. Here, we'll learn to work with the sent2vec library for training sentence-embedding models, saving & loading them and performing similarity operations. This notebook also contains a comparison of the gensim implementation with the [original c++ implementation](https://github.com/epfml/sent2vec). 

# What is Sent2Vec?

Sent2Vec delivers numerical representations (features) for short texts or sentences, which can be used as input to any machine learning task later on. Think of it as an unsupervised version of FastText, and an extension of word2vec (CBOW) to sentences. The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see the [paper](https://arxiv.org/abs/1703.02507) for more details.

The sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams present in each sentence, and averaging the n-gram embeddings along with the words

In [1]:
import gensim
import os
from gensim.models.word2vec import LineSentence
from gensim.models.sent2vec import Sent2Vec as s2v
from gensim.utils import tokenize
import scipy
import re
from numpy import dot
from gensim import matutils
import time
import numpy as np

Using TensorFlow backend.


# Training models

For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model.

In [2]:
# Prepare training data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'
lee_data = []
with open(lee_train_file) as f1, open("./input.txt",'w') as f2:
    for line in f1:
        if line not in ['\n', '\r\n']:
            line = re.split('\.|\?|\n', line.strip())
            for sentence in line:
                if len(sentence) > 1:
                    sentence = tokenize(sentence)
                    lee_data.append(list(sentence))
                    f2.write(' '.join(lee_data[-1]) + '\n')

In [3]:
# Print sample training data
for sentence in lee_data[:5]:
    print sentence,'\n'

[u'Hundreds', u'of', u'people', u'have', u'been', u'forced', u'to', u'vacate', u'their', u'homes', u'in', u'the', u'Southern', u'Highlands', u'of', u'New', u'South', u'Wales', u'as', u'strong', u'winds', u'today', u'pushed', u'a', u'huge', u'bushfire', u'towards', u'the', u'town', u'of', u'Hill', u'Top'] 

[u'A', u'new', u'blaze', u'near', u'Goulburn', u'south', u'west', u'of', u'Sydney', u'has', u'forced', u'the', u'closure', u'of', u'the', u'Hume', u'Highway'] 

[u'At', u'about', u'pm', u'AEDT', u'a', u'marked', u'deterioration', u'in', u'the', u'weather', u'as', u'a', u'storm', u'cell', u'moved', u'east', u'across', u'the', u'Blue', u'Mountains', u'forced', u'authorities', u'to', u'make', u'a', u'decision', u'to', u'evacuate', u'people', u'from', u'homes', u'in', u'outlying', u'streets', u'at', u'Hill', u'Top', u'in', u'the', u'New', u'South', u'Wales', u'southern', u'highlands'] 

[u'An', u'estimated', u'residents', u'have', u'left', u'their', u'homes', u'for', u'nearby', u'Mittago

# Using gensim implementation of sent2vec

In [4]:
# Train new sent2vec model
sent2vec_model = s2v(vector_size=100, epochs=20)
sent2vec_model.train(lee_data)

Creating dictionary...
Read 0.060302M words
Dictionary created, dictionary size: 1307 ,tokens read: 60302
Initializing model...
Training...
Begin epoch 0 :
Progress:  3.96330138304 % lr:  0.192073397234  loss:  3.79220661775
Begin epoch 1 :
Progress:  7.93016815363 % lr:  0.184139663693  loss:  3.6578121557
Begin epoch 2 :
Progress:  11.8970349242 % lr:  0.176205930152  loss:  3.55657217233
Begin epoch 3 :
Progress:  15.8639016948 % lr:  0.16827219661  loss:  3.46168299212
Begin epoch 4 :
Progress:  19.8307684654 % lr:  0.160338463069  loss:  3.37070538195
Begin epoch 5 :
Progress:  23.797635236 % lr:  0.152404729528  loss:  3.28155531049
Begin epoch 6 :
Progress:  27.7645020066 % lr:  0.144470995987  loss:  3.19723932184
Begin epoch 7 :
Progress:  31.7313687772 % lr:  0.136537262446  loss:  3.11839643823
Begin epoch 8 :
Progress:  35.6982355477 % lr:  0.128603528905  loss:  3.04488396391
Begin epoch 9 :
Progress:  39.6651023183 % lr:  0.120669795363  loss:  2.97539395926
Begin epoch 1

# Training hyperparameters

Sent2Vec supports the folllowing parameters:

 - vector_size: Size of embeddings to be learnt (Default 100)
 - alpha: Initial learning rate (Default 0.2)
 - min_count: Ignore words with number of occurrences below this (Default 5)
 - loss: Training objective. Allowed values: `ns` (Default `ns`)
 - negative: Number of negative words to sample, for `ns` (Default 10)
 - epochs: Number of epochs (Default 5)
 - bucket: Number of hash buckets for vocabulary (Default 2000000)
 - lr_update_rate: Change the rate of updates for the learning rate (Default 100)
 - t: Sampling threshold (Default 0.0001)
 - dropoutk: Number of ngrams dropped when training a sent2vec model (Default 2)
 - word_ngrams: Max length of word ngram (Default: 2)
 - min_n: min length of char ngrams (Default 3)
 - max_n: max length of char ngrams (Default 6)

In [5]:
# Print sentence vector
sent2vec_model.sentence_vectors("This is an awesome gift.")

array([ 0.68231279,  0.27833666,  0.16755685, -0.42549644,  0.44356017,
        0.03805984,  0.11425159, -0.18843327, -0.35050581,  0.37883625,
       -0.04714561,  0.12688964,  0.03892192,  0.50464108,  0.41076339,
        0.46617015,  0.43055977,  0.63853766,  0.29246627, -0.41670265,
       -0.20256535,  0.2954225 , -0.59524875,  0.3225076 ,  0.83679135,
       -0.44026166,  0.53004422,  0.36827652,  0.46622952, -0.41337579,
        0.23997068,  0.0134885 ,  0.19703447,  0.16905064,  0.40177966,
        0.41733581, -0.08684079, -0.07431549,  0.56583327,  0.3994801 ,
       -0.19969126,  0.39581204,  0.14362896, -0.75979071,  0.38220697,
       -0.08246933,  0.65596725,  1.07621031, -0.971896  ,  0.03194186,
        0.25659506,  0.18591891,  0.22224946,  0.02544531,  0.99685255,
        0.12317446,  0.53159963,  0.33656909, -0.11080538,  0.44820802,
        0.28795333,  0.19592373,  0.26167464,  0.29693983,  0.98503746,
        0.20594575, -0.51032157,  0.25811482,  0.39786814,  0.77

In [7]:
# Print cosine similarity between two sentences
print sent2vec_model.similarity("The sky is blue.", "I am going to a party.")
print sent2vec_model.similarity("This is an awesome gift.", "This present is great.")

0.0819638686729
0.792567220458


# Saving and loading models

Models can be saved and loaded via the load and save methods.

In [8]:
# Save trained sent2vec model
sent2vec_model.save('sent2vec1')

In [5]:
# Load pretrained sent2vec model
loaded_model = s2v.load('sent2vec1')

# Unsupervised similarity evaluation

Unsupervised evaluation of the the learnt sentence embeddings is performed using the sentence cosine similarity, on the [SICK 2014](http://clic.cimec.unitn.it/composes/sick.html) datasets. These similarity scores are compared to the gold-standard human judgements using [Pearson’s correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) scores. The SICK dataset consists of about 10,000 sentence pairs along with relatedness scores of the pairs.

In [6]:
# Prepare evaluation data
train_sick = []
test_sick = []
trial_sick = []
with open("./SICK/SICK.txt") as f, open('./train_sick.txt', 'w') as f1, open('./test_sick.txt', 'w') as f2, open('./trial_sick.txt', 'w') as f3:
    for line in f:
        tokens = line.strip().split('\t')
        if tokens[0].isdigit():
            if tokens[11] == 'TRAIN':
                train_sick.append((tokens[1], tokens[2], tokens[4]))
                f1.write(tokens[1] + '\n' + tokens[2] + '\n')
            elif tokens[11] == 'TEST':
                test_sick.append((tokens[1], tokens[2], tokens[4]))
                f2.write(tokens[1] + '\n' + tokens[2] + '\n')
            else:
                trial_sick.append((tokens[1], tokens[2], tokens[4]))
                f3.write(tokens[1] + '\n' + tokens[2] + '\n')

In [7]:
#Print sample evaluation data
# Evaluation data is of the form: (sentence1, sentence2, similarity_score)
for example in train_sick[:5]:
    print example

('A group of kids is playing in a yard and an old man is standing in the background', 'A group of boys in a yard is playing and a man is standing in the background', '4.5')
('A group of children is playing in the house and there is no man standing in the background', 'A group of kids is playing in a yard and an old man is standing in the background', '3.2')
('The young boys are playing outdoors and the man is smiling nearby', 'The kids are playing outdoors near a man with a smile', '4.7')
('The kids are playing outdoors near a man with a smile', 'A group of kids is playing in a yard and an old man is standing in the background', '3.4')
('The young boys are playing outdoors and the man is smiling nearby', 'A group of kids is playing in a yard and an old man is standing in the background', '3.7')


In [8]:
# Calculate pearson correlation score for gensim implementation for sent2vec
def pearson_score_gensim(input_):
    input_cosine = []
    input_sick = []
    for example in input_:
        input_cosine.append(loaded_model.similarity(example[0], example[1]))
        input_sick.append(float(example[2]))
    return scipy.stats.pearsonr(input_cosine, input_sick), input_sick, input_cosine

In [9]:
train_score, train_sick_score, train_cosine = pearson_score_gensim(train_sick)
test_score, test_sick_score, test_cosine = pearson_score_gensim(test_sick)
trial_score, trial_sick_score, trial_cosine = pearson_score_gensim(trial_sick)
print train_score[0], test_score[0], trial_score[0]

0.374514112341 0.385498607302 0.388873105103


# Comparison with original c++ implementation of sent2vec

In order to build and train c++ implementation of sent2vec, use the following commands. This will produce object files for all the classes as well as the main binary sent2vec.

In [9]:
# Train model using original c++ implementation of sent2vec
! git clone https://github.com/epfml/sent2vec.git
% cd sent2vec
! make
start_time = time.time()
! ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 20 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.0001 -dropoutK 2 -bucket 2000000
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

Read 0M words
Number of words:  1837
Number of labels: 0
Progress: 100.0%  words/sec/thread: 27019  lr: 0.000000  loss: 3.113516  eta: 0h0m 511982  eta: 0h0m 3.496721  eta: 0h0m   eta: 0h0m 


Total training time: 26.651419878 seconds


In [11]:
# Get sentence vectors from trained sent2vec model
! ./fasttext print-sentence-vectors my_model.bin < ../train_sick.txt > train_output.txt
! ./fasttext print-sentence-vectors my_model.bin < ../test_sick.txt > test_output.txt
! ./fasttext print-sentence-vectors my_model.bin < ../trial_sick.txt > trial_output.txt

In [10]:
def similarity(sent1, sent2):
    return dot(matutils.unitvec(sent1), matutils.unitvec(sent2))

In [11]:
# Calculate pearson correlation score for original c++ implementation for sent2vec
def pearson_score_original(filename, input_sick):
    input_cosine = []
    with open(filename) as f:
        input_ = []
        for i, line in enumerate(f):
            line = line.strip().split()
            input_.append([float(j) for j in line])
            if i % 2 != 0:
                input_cosine.append(similarity(np.array(input_[i]), np.array(input_[i-1])))
    return scipy.stats.pearsonr(input_cosine, input_sick), input_cosine

In [14]:
train_score, train_cosine = pearson_score_original('train_output.txt', train_sick_score)
test_score, test_cosine = pearson_score_original('test_output.txt', test_sick_score)
trial_score, trial_cosine = pearson_score_original('trial_output.txt', trial_sick_score)
print train_score[0], test_score[0], trial_score[0]

0.310250645609 0.292698006351 0.319031428662


# Evaluation Results

| S.No. | Model             | Training Time (in seconds) | Pearson score on SICK train set | Pearson score on SICK test set | Pearson score on SICK trial set |
|-------|-------------------|----------------------------|---------------------------------|--------------------------------|---------------------------------|
| 1.    | Gensim sent2vec   | 276.94                     | 0.37                            | 0.38                           | 0.38                            |
| 2.    | Original sent2vec | 26.65                      | 0.31                            | 0.29                           | 0.31                            |