# STS Benchmark Datasets

### Preparation

Setup all required libraries

In [1]:
import logging
import re
import sys

import numpy as np
import pandas as pd

from gensim.models.keyedvectors import KeyedVectors, FastTextKeyedVectors

from fse.models import Average, SIF, uSIF, MaxPooling
from fse import CSplitIndexedList

from re import sub

from scipy.stats import pearsonr

from nltk import word_tokenize

logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

Next, we require the sentences from the STS benchmark dataset.

In [2]:
file= "../fse/eval/sts-test.csv"
similarities, sent_a, sent_b = [], [], []
with open(file, "r") as f:
    for l in f:
        line = l.rstrip().split("\t")
        similarities.append(float(line[4]))
        sent_a.append(line[5])
        sent_b.append(line[6])
similarities = np.array(similarities)
assert len(similarities) == len(sent_a) == len(sent_b)
task_length = len(similarities)

for i, obj in enumerate(zip(similarities, sent_a, sent_b)):
    print(f"{i}\tSim: {obj[0].round(3):.1f}\t{obj[1]:40s}\t{obj[2]:40s}\t")
    if i == 4:
        break

0	Sim: 2.5	A girl is styling her hair.             	A girl is brushing her hair.            	
1	Sim: 3.6	A group of men play soccer on the beach.	A group of boys are playing soccer on the beach.	
2	Sim: 5.0	One woman is measuring another woman's ankle.	A woman measures another woman's ankle. 	
3	Sim: 4.2	A man is cutting up a cucumber.         	A man is slicing a cucumber.            	
4	Sim: 1.5	A man is playing a harp.                	A man is playing a keyboard.            	


Each of these sentence requires some preparation (i.e. tokenization) to be used in the core input formats.
To reproduce the results from the uSIF paper this part is taken from https://github.com/kawine/usif/blob/master/usif.py

In [3]:
not_punc = re.compile('.*[A-Za-z0-9].*')

def prep_token(token):
    t = token.lower().strip("';.:()").strip('"')
    t = 'not' if t == "n't" else t
    return re.split(r'[-]', t)

def prep_sentence(sentence):
    tokens = []
    for token in word_tokenize(sentence):
        if not_punc.match(token):
            tokens = tokens + prep_token(token)
    return tokens

Next we define the IndexedList object. The IndexedList extends the previously constructed sent_a and sent_b list together. We additionally provide a custom function "prep_sentence" which performs all the prepocessing for a single sentence. Therefore we need the extention **CSplitIndexedList**, which provides you the option to provide a custom split function

In [4]:
sentences = CSplitIndexedList(sent_a, sent_b, custom_split=prep_sentence)

The IndexedList returns the core object required for fse to train a sentence embedding: A tuple. This object constists of words (a list of strings) and its corresponding index. The latter is important if multiple cores access the input queue simultaneously. Thus it must be always provided. The index represents the row in the matrix where it can later be found.

In [5]:
sentences[0]

(['a', 'girl', 'is', 'styling', 'her', 'hair'], 0)

Note, that IndexedList does not convert the sentences inplace but only on calling the __getitem__ method in order to turn the sentence into a tuple. You can access the original sentence using

In [6]:
sentences.items[0]

'A girl is styling her hair.'

### Loading the models

It is required for us to load the models as BaseKeyedVectors or as an BaseWordEmbeddingsModel. For this notebook, I already converted the models to a BaseKeyedVectors instance and saved the corresponding instance on my external harddrive. You have to replicate these steps yourself, because getting all the files can be a bit difficult, as the total filesize is around 15 GB.

In [7]:
path_to_models = "/Volumes/Ext_HDD/Models/Static/"
models, results = {}, {}

The following code performs a disk-to-ram training. Passing a path to __wv_mapfile_path__ will store the corresponding word vectors (wv) as a numpy memmap array. This is required, because loading all vectors into ram would would take up a lot of storage unecessary. The wv.vectors file will be replace by its memmap representation, which is why the next models do not require the wv_mapfile_path argument, as they access the same memmap object.

The lang_freq="en" induces the frequencies of the words according to the wordfreq package. This functionality allows you to work with pre-trained embeddings which don't come with frequency information. The method overwrites the counts in the glove.wv.vocab class, so that all further models also benefit from this induction.

In [25]:
glove = KeyedVectors.load(path_to_models+"glove.840B.300d.model", mmap="r")

#print(f"Before memmap {sys.getsizeof(glove.vectors)}")

#models[f"CBOW-Glove"] = Average(glove, wv_mapfile_path="data/glove", lang_freq="en")

#print(f"After memmap {sys.getsizeof(glove.vectors)}")

models[f"CBOW-Glove"] = Average(glove, lang_freq="en")
models[f"SIF-Glove"] = SIF(glove, components=15)
models[f"uSIF-Glove"] = uSIF(glove,length=11)

models[f"Max-Glove"] = MaxPooling(glove)
models[f"hMax-Glove"] = MaxPooling(glove, hierarchical=True)

2020-02-16 11:58:42,025 : MainThread : INFO : loading Word2VecKeyedVectors object from /Volumes/Ext_HDD/Models/Static/glove.840B.300d.model
2020-02-16 11:58:46,648 : MainThread : INFO : loading vectors from /Volumes/Ext_HDD/Models/Static/glove.840B.300d.model.vectors.npy with mmap=r
2020-02-16 11:58:46,653 : MainThread : INFO : setting ignored attribute vectors_norm to None
2020-02-16 11:58:46,654 : MainThread : INFO : loaded /Volumes/Ext_HDD/Models/Static/glove.840B.300d.model
2020-02-16 11:58:46,657 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en
2020-02-16 11:58:48,470 : MainThread : INFO : make sure you are using a model with valid word-frequency information. Otherwise use lang_freq argument.
2020-02-16 11:58:48,479 : MainThread : INFO : make sure you are using a model with valid word-frequency information. Otherwise use lang_freq argument.
  "C extension not loaded, training/inferring will be slow. "


In [26]:
# Do all the vectors contain the same content?
(models[f"SIF-Glove"].wv.vectors == models[f"uSIF-Glove"].wv.vectors).all()

True

Another option is to load the KeyedVectors using the kwarg mmap="r"

In [8]:
w2v = KeyedVectors.load(path_to_models+"google_news.model", mmap="r")

models[f"CBOW-W2V"] = Average(w2v, lang_freq="en")
models[f"SIF-W2V"] = SIF(w2v, components=10)
models[f"uSIF-W2V"] = uSIF(w2v, length=11)

models[f"Max-W2V"] = MaxPooling(w2v)
models[f"hMax-W2V"] = MaxPooling(w2v, hierarchical=True)

2020-02-16 11:48:12,816 : MainThread : INFO : loading Word2VecKeyedVectors object from /Volumes/Ext_HDD/Models/Static/google_news.model
2020-02-16 11:48:18,748 : MainThread : INFO : loading vectors from /Volumes/Ext_HDD/Models/Static/google_news.model.vectors.npy with mmap=r
2020-02-16 11:48:18,756 : MainThread : INFO : setting ignored attribute vectors_norm to None
2020-02-16 11:48:18,757 : MainThread : INFO : loaded /Volumes/Ext_HDD/Models/Static/google_news.model
2020-02-16 11:48:18,760 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en
2020-02-16 11:48:21,162 : MainThread : INFO : make sure you are using a model with valid word-frequency information. Otherwise use lang_freq argument.
2020-02-16 11:48:21,165 : MainThread : INFO : make sure you are using a model with valid word-frequency information. Otherwise use lang_freq argument.


In [13]:
ft = FastTextKeyedVectors.load(path_to_models+"ft_crawl_300d_2m.model", mmap="r")
models[f"CBOW-FT"] = Average(ft, lang_freq="en")
models[f"SIF-FT"] = SIF(ft, components=10)
models[f"uSIF-FT"] = uSIF(ft, length=11)

models[f"Max-FT"] = MaxPooling(ft)
models[f"hMax-FT"] = MaxPooling(ft, hierarchical=True)

2020-02-16 11:49:19,927 : MainThread : INFO : loading FastTextKeyedVectors object from /Volumes/Ext_HDD/Models/Static/ft_crawl_300d_2m.model
2020-02-16 11:49:24,020 : MainThread : INFO : loading vectors from /Volumes/Ext_HDD/Models/Static/ft_crawl_300d_2m.model.vectors.npy with mmap=r
2020-02-16 11:49:24,030 : MainThread : INFO : loading vectors_vocab from /Volumes/Ext_HDD/Models/Static/ft_crawl_300d_2m.model.vectors_vocab.npy with mmap=r
2020-02-16 11:49:24,036 : MainThread : INFO : loading vectors_ngrams from /Volumes/Ext_HDD/Models/Static/ft_crawl_300d_2m.model.vectors_ngrams.npy with mmap=r
2020-02-16 11:49:24,041 : MainThread : INFO : setting ignored attribute vectors_norm to None
2020-02-16 11:49:24,042 : MainThread : INFO : setting ignored attribute vectors_vocab_norm to None
2020-02-16 11:49:24,043 : MainThread : INFO : setting ignored attribute vectors_ngrams_norm to None
2020-02-16 11:49:24,043 : MainThread : INFO : setting ignored attribute buckets_word to None
2020-02-16 11

In [14]:
paranmt = KeyedVectors.load(path_to_models+"paranmt.model", mmap="r")

models[f"CBOW-Paranmt"] = Average(paranmt, lang_freq="en")
models[f"SIF-Paranmt"] = SIF(paranmt, components=10)
models[f"uSIF-Paranmt"] = uSIF(paranmt, length=11)

models[f"Max-Paranmt"] = MaxPooling(paranmt)
models[f"hMax-Paranmt"] = MaxPooling(paranmt, hierarchical=True)

2020-02-16 11:49:42,202 : MainThread : INFO : loading Word2VecKeyedVectors object from /Volumes/Ext_HDD/Models/Static/paranmt.model
2020-02-16 11:49:42,390 : MainThread : INFO : loading vectors from /Volumes/Ext_HDD/Models/Static/paranmt.model.vectors.npy with mmap=r
2020-02-16 11:49:42,396 : MainThread : INFO : setting ignored attribute vectors_norm to None
2020-02-16 11:49:42,396 : MainThread : INFO : loaded /Volumes/Ext_HDD/Models/Static/paranmt.model
2020-02-16 11:49:42,399 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en
2020-02-16 11:49:42,477 : MainThread : INFO : make sure you are using a model with valid word-frequency information. Otherwise use lang_freq argument.
2020-02-16 11:49:42,478 : MainThread : INFO : make sure you are using a model with valid word-frequency information. Otherwise use lang_freq argument.
  "C extension not loaded, training/inferring will be slow. "


In [15]:
paragram = KeyedVectors.load(path_to_models+"paragram_sl999_czeng.model", mmap="r")

models[f"CBOW-Paragram"] = Average(paragram, lang_freq="en")
models[f"SIF-Paragram"] = SIF(paragram, components=10)
models[f"uSIF-Paragram"] = uSIF(paragram, length=11)

models[f"Max-Paragram"] = MaxPooling(paragram)
models[f"hMax-Paragram"] = MaxPooling(paragram, hierarchical=True)

2020-02-16 11:50:11,887 : MainThread : INFO : loading Word2VecKeyedVectors object from /Volumes/Ext_HDD/Models/Static/paragram_sl999_czeng.model
2020-02-16 11:50:11,990 : MainThread : INFO : loading vectors from /Volumes/Ext_HDD/Models/Static/paragram_sl999_czeng.model.vectors.npy with mmap=r
2020-02-16 11:50:11,994 : MainThread : INFO : setting ignored attribute vectors_norm to None
2020-02-16 11:50:11,995 : MainThread : INFO : loaded /Volumes/Ext_HDD/Models/Static/paragram_sl999_czeng.model
2020-02-16 11:50:11,998 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en
2020-02-16 11:50:12,069 : MainThread : INFO : make sure you are using a model with valid word-frequency information. Otherwise use lang_freq argument.
2020-02-16 11:50:12,070 : MainThread : INFO : make sure you are using a model with valid word-frequency information. Otherwise use lang_freq argument.
  "C extension not loaded, training/inferring will be slow. "


## Computation of the results for the STS benchmark

We are finally able to compute the STS benchmark values for all models.

In [16]:
# This function is used to compute the similarities between two sentences.
# Task length is the length of the sts dataset.
def compute_similarities(task_length, model):
    sims = []
    for i, j in zip(range(task_length), range(task_length, 2*task_length)):
        sims.append(model.sv.similarity(i,j))
    return sims

In [17]:
for k, m in models.items():
    m_type  = k.split("-")[0]
    emb_type = k.split("-")[1]
    m.train(sentences)
    r = pearsonr(similarities, compute_similarities(task_length, m))[0].round(4) * 100
    results[f"{m_type}-{emb_type}"] = r
    print(k, f"{r:2.2f}")

2020-02-16 11:50:21,575 : MainThread : INFO : scanning all indexed sentences and their word counts
2020-02-16 11:50:21,945 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:22,962 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 3000000 vocabulary: 3447 MB (3 GB)
2020-02-16 11:50:22,963 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:23,020 : MainThread : INFO : begin training
2020-02-16 11:50:23,651 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:23,652 : MainThread : INFO : training on 2758 effective sentences with 23116 effective words took 0s with 4367 sentences/s
2020-02-16 11:50:23,683 : MainThread : INFO : scanning all indexed sentences and their word counts


CBOW-W2V 61.54


2020-02-16 11:50:24,051 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:25,067 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 3000000 vocabulary: 3447 MB (3 GB)
2020-02-16 11:50:25,067 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:25,081 : MainThread : INFO : pre-computing SIF weights for 3000000 words
2020-02-16 11:50:27,027 : MainThread : INFO : begin training
2020-02-16 11:50:27,381 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:27,420 : MainThread : INFO : computing 10 principal components took 0s
2020-02-16 11:50:27,422 : MainThread : INFO : removing 10 principal components took 0s
2020-02-16 11:50:27,423 : MainThread : INFO : training on 2758 effective sentences with 23116 effective words took 0s with 7763 sentences/s
2020-02-16 11:50:27,460 : MainThread : INFO : scanning all indexe

SIF-W2V 71.12


2020-02-16 11:50:27,821 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:28,819 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 3000000 vocabulary: 3447 MB (3 GB)
2020-02-16 11:50:28,820 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:28,832 : MainThread : INFO : pre-computing uSIF weights for 3000000 words
2020-02-16 11:50:37,657 : MainThread : INFO : begin training
2020-02-16 11:50:38,012 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:38,036 : MainThread : INFO : computing 5 principal components took 0s
2020-02-16 11:50:38,038 : MainThread : INFO : removing 5 principal components took 0s
2020-02-16 11:50:38,039 : MainThread : INFO : training on 2758 effective sentences with 23116 effective words took 0s with 7741 sentences/s
2020-02-16 11:50:38,078 : MainThread : INFO : scanning all indexed

uSIF-W2V 66.99


2020-02-16 11:50:38,522 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:39,531 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 3000000 vocabulary: 3447 MB (3 GB)
2020-02-16 11:50:39,532 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:39,587 : MainThread : INFO : begin training
2020-02-16 11:50:40,040 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:40,041 : MainThread : INFO : training on 2758 effective sentences with 23116 effective words took 0s with 6083 sentences/s
2020-02-16 11:50:40,066 : MainThread : INFO : scanning all indexed sentences and their word counts


Max-W2V 66.52


2020-02-16 11:50:40,443 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:41,423 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 3000000 vocabulary: 3447 MB (3 GB)
2020-02-16 11:50:41,424 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:41,482 : MainThread : INFO : begin training
2020-02-16 11:50:42,563 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:42,564 : MainThread : INFO : training on 2758 effective sentences with 23116 effective words took 1s with 2549 sentences/s
2020-02-16 11:50:42,588 : MainThread : INFO : scanning all indexed sentences and their word counts


hMax-W2V 52.19


2020-02-16 11:50:42,961 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:43,740 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 2000000 vocabulary: 6877 MB (6 GB)
2020-02-16 11:50:43,741 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:43,786 : MainThread : INFO : begin training
2020-02-16 11:50:44,844 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:44,845 : MainThread : INFO : training on 2758 effective sentences with 27528 effective words took 1s with 2604 sentences/s
2020-02-16 11:50:44,875 : MainThread : INFO : scanning all indexed sentences and their word counts


CBOW-FT 48.49


2020-02-16 11:50:45,288 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:46,090 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 2000000 vocabulary: 6877 MB (6 GB)
2020-02-16 11:50:46,091 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:46,106 : MainThread : INFO : pre-computing SIF weights for 2000000 words
2020-02-16 11:50:47,492 : MainThread : INFO : begin training
2020-02-16 11:50:47,868 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:47,904 : MainThread : INFO : computing 10 principal components took 0s
2020-02-16 11:50:47,906 : MainThread : INFO : removing 10 principal components took 0s
2020-02-16 11:50:47,907 : MainThread : INFO : training on 2758 effective sentences with 27528 effective words took 0s with 7314 sentences/s
2020-02-16 11:50:47,944 : MainThread : INFO : scanning all indexe

SIF-FT 73.38


2020-02-16 11:50:48,337 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:49,104 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 2000000 vocabulary: 6877 MB (6 GB)
2020-02-16 11:50:49,105 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:49,121 : MainThread : INFO : pre-computing uSIF weights for 2000000 words
2020-02-16 11:50:55,026 : MainThread : INFO : begin training
2020-02-16 11:50:55,385 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:55,415 : MainThread : INFO : computing 5 principal components took 0s
2020-02-16 11:50:55,418 : MainThread : INFO : removing 5 principal components took 0s
2020-02-16 11:50:55,419 : MainThread : INFO : training on 2758 effective sentences with 27528 effective words took 0s with 7664 sentences/s
2020-02-16 11:50:55,464 : MainThread : INFO : scanning all indexed

uSIF-FT 69.40


2020-02-16 11:50:55,835 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:56,624 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 2000000 vocabulary: 6877 MB (6 GB)
2020-02-16 11:50:56,625 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:56,669 : MainThread : INFO : begin training
2020-02-16 11:50:57,421 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:50:57,422 : MainThread : INFO : training on 2758 effective sentences with 27528 effective words took 0s with 3664 sentences/s
2020-02-16 11:50:57,446 : MainThread : INFO : scanning all indexed sentences and their word counts


Max-FT 57.10


2020-02-16 11:50:57,802 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:50:58,587 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 2000000 vocabulary: 6877 MB (6 GB)
2020-02-16 11:50:58,589 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:50:58,634 : MainThread : INFO : begin training
2020-02-16 11:51:00,569 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:00,570 : MainThread : INFO : training on 2758 effective sentences with 27528 effective words took 1s with 1424 sentences/s
2020-02-16 11:51:00,593 : MainThread : INFO : scanning all indexed sentences and their word counts


hMax-FT 52.97


2020-02-16 11:51:00,947 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:00,975 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:00,976 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:00,992 : MainThread : INFO : begin training
2020-02-16 11:51:01,621 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:01,621 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 0s with 4379 sentences/s
2020-02-16 11:51:01,646 : MainThread : INFO : scanning all indexed sentences and their word counts


CBOW-Paranmt 79.85


2020-02-16 11:51:02,004 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:02,029 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:02,030 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:02,043 : MainThread : INFO : pre-computing SIF weights for 77224 words
2020-02-16 11:51:02,090 : MainThread : INFO : begin training
2020-02-16 11:51:02,456 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:02,487 : MainThread : INFO : computing 10 principal components took 0s
2020-02-16 11:51:02,489 : MainThread : INFO : removing 10 principal components took 0s
2020-02-16 11:51:02,490 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 0s with 7522 sentences/s
2020-02-16 11:51:02,525 : MainThread : INFO : scanning all indexed sent

SIF-Paranmt 76.75


2020-02-16 11:51:02,892 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:02,919 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:02,920 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:02,933 : MainThread : INFO : pre-computing uSIF weights for 77224 words
2020-02-16 11:51:03,162 : MainThread : INFO : begin training
2020-02-16 11:51:03,526 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:03,552 : MainThread : INFO : computing 5 principal components took 0s
2020-02-16 11:51:03,555 : MainThread : INFO : removing 5 principal components took 0s
2020-02-16 11:51:03,556 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 0s with 7553 sentences/s
2020-02-16 11:51:03,594 : MainThread : INFO : scanning all indexed sente

uSIF-Paranmt 79.02


2020-02-16 11:51:03,959 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:03,986 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:03,987 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:04,000 : MainThread : INFO : begin training
2020-02-16 11:51:04,479 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:04,480 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 0s with 5747 sentences/s
2020-02-16 11:51:04,506 : MainThread : INFO : scanning all indexed sentences and their word counts


Max-Paranmt 71.57


2020-02-16 11:51:04,870 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:04,902 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:04,902 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:04,921 : MainThread : INFO : begin training
2020-02-16 11:51:06,129 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:06,130 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 1s with 2280 sentences/s
2020-02-16 11:51:06,154 : MainThread : INFO : scanning all indexed sentences and their word counts


hMax-Paranmt 54.62


2020-02-16 11:51:06,520 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:06,548 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:06,548 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:06,564 : MainThread : INFO : begin training
2020-02-16 11:51:07,058 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:07,059 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 0s with 5575 sentences/s
2020-02-16 11:51:07,083 : MainThread : INFO : scanning all indexed sentences and their word counts


CBOW-Paragram 50.38


2020-02-16 11:51:07,449 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:07,475 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:07,476 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:07,489 : MainThread : INFO : pre-computing SIF weights for 77224 words
2020-02-16 11:51:07,539 : MainThread : INFO : begin training
2020-02-16 11:51:07,912 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:07,942 : MainThread : INFO : computing 10 principal components took 0s
2020-02-16 11:51:07,945 : MainThread : INFO : removing 10 principal components took 0s
2020-02-16 11:51:07,946 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 0s with 7367 sentences/s
2020-02-16 11:51:07,982 : MainThread : INFO : scanning all indexed sent

SIF-Paragram 73.86


2020-02-16 11:51:08,355 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:08,381 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:08,382 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:08,393 : MainThread : INFO : pre-computing uSIF weights for 77224 words
2020-02-16 11:51:08,621 : MainThread : INFO : begin training
2020-02-16 11:51:08,979 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:09,006 : MainThread : INFO : computing 5 principal components took 0s
2020-02-16 11:51:09,008 : MainThread : INFO : removing 5 principal components took 0s
2020-02-16 11:51:09,009 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 0s with 7692 sentences/s
2020-02-16 11:51:09,049 : MainThread : INFO : scanning all indexed sente

uSIF-Paragram 73.64


2020-02-16 11:51:09,429 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:09,458 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:09,459 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:09,475 : MainThread : INFO : begin training
2020-02-16 11:51:09,952 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:09,953 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 0s with 5772 sentences/s
2020-02-16 11:51:09,978 : MainThread : INFO : scanning all indexed sentences and their word counts


Max-Paragram 59.82


2020-02-16 11:51:10,344 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2020-02-16 11:51:10,371 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2020-02-16 11:51:10,372 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2020-02-16 11:51:10,388 : MainThread : INFO : begin training
2020-02-16 11:51:11,659 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2020-02-16 11:51:11,660 : MainThread : INFO : training on 2758 effective sentences with 27441 effective words took 1s with 2167 sentences/s


hMax-Paragram 50.55


In [18]:
pd.DataFrame.from_dict(results, orient="index", columns=["Pearson"])

Unnamed: 0,Pearson
CBOW-W2V,61.54
SIF-W2V,71.12
uSIF-W2V,66.99
Max-W2V,66.52
hMax-W2V,52.19
CBOW-FT,48.49
SIF-FT,73.38
uSIF-FT,69.4
Max-FT,57.1
hMax-FT,52.97


If you closely study the values above you will find that:
- SIF-Glove is almost equivalent to the values reported in http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark
- CBOW-Paranmt is a little better than ParaNMT Word Avg. in https://www.aclweb.org/anthology/W18-3012
- uSIF-Paranmt is a little worse than ParaNMT+UP in https://www.aclweb.org/anthology/W18-3012
- uSIF-Paragram is a little worse than PSL+UP in https://www.aclweb.org/anthology/W18-3012

However, I guess those differences might arise due to differences in preprocessing. Too bad we didn't hit 80. If you have any ideas why those values don't match exactly, feel free to contact me anytime.