# STS Benchmark Datasets

### Preparation

Setup all required libraries

In [1]:
import logging
import re
import sys

import numpy as np
import pandas as pd

from fse import Average, SIF, uSIF, Vectors, FTVectors, CSplitIndexedList

from re import sub

from scipy.stats import pearsonr

from nltk import word_tokenize

logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.INFO)

Next, we require the sentences from the STS benchmark dataset.

In [2]:
file= "../evaluation/sts-test.csv"
similarities, sent_a, sent_b = [], [], []
with open(file, "r") as f:
    for l in f:
        line = l.rstrip().split("\t")
        similarities.append(float(line[4]))
        sent_a.append(line[5])
        sent_b.append(line[6])
similarities = np.array(similarities)
assert len(similarities) == len(sent_a) == len(sent_b)
task_length = len(similarities)

for i, obj in enumerate(zip(similarities, sent_a, sent_b)):
    print(f"{i}\tSim: {obj[0].round(3):.1f}\t{obj[1]:40s}\t{obj[2]:40s}\t")
    if i == 4:
        break

0	Sim: 2.5	A girl is styling her hair.             	A girl is brushing her hair.            	
1	Sim: 3.6	A group of men play soccer on the beach.	A group of boys are playing soccer on the beach.	
2	Sim: 5.0	One woman is measuring another woman's ankle.	A woman measures another woman's ankle. 	
3	Sim: 4.2	A man is cutting up a cucumber.         	A man is slicing a cucumber.            	
4	Sim: 1.5	A man is playing a harp.                	A man is playing a keyboard.            	


Each of these sentence requires some preparation (i.e. tokenization) to be used in the core input formats.
To reproduce the results from the uSIF paper this part is taken from https://github.com/kawine/usif/blob/master/usif.py

In [3]:
not_punc = re.compile('.*[A-Za-z0-9].*')

def prep_token(token):
    t = token.lower().strip("';.:()").strip('"')
    t = 'not' if t == "n't" else t
    return re.split(r'[-]', t)

def prep_sentence(sentence):
    tokens = []
    for token in word_tokenize(sentence):
        if not_punc.match(token):
            tokens = tokens + prep_token(token)
    return tokens

Next we define the IndexedList object. The IndexedList extends the previously constructed sent_a and sent_b list together. We additionally provide a custom function "prep_sentence" which performs all the prepocessing for a single sentence. Therefore we need the extention **CSplitIndexedList**, which provides you the option to provide a custom split function

In [4]:
sentences = CSplitIndexedList(sent_a, sent_b, custom_split=prep_sentence)

The IndexedList returns the core object required for fse to train a sentence embedding: A tuple. This object constists of words (a list of strings) and its corresponding index. The latter is important if multiple cores access the input queue simultaneously. Thus it must be always provided. The index represents the row in the matrix where it can later be found.

In [5]:
sentences[0]

(['a', 'girl', 'is', 'styling', 'her', 'hair'], 0)

Note, that IndexedList does not convert the sentences inplace but only on calling the __getitem__ method in order to turn the sentence into a tuple. You can access the original sentence using

In [6]:
sentences.items[0]

'A girl is styling her hair.'

### Loading the models

It is required for us to load the models as BaseKeyedVectors or as an BaseWordEmbeddingsModel. For this notebook, I already converted the models to a BaseKeyedVectors instance and saved the corresponding instance on my external harddrive. You have to replicate these steps yourself, because getting all the files can be a bit difficult, as the total filesize is around 15 GB.

The following code performs a disk-to-ram training. Passing a path to __wv_mapfile_path__ will store the corresponding word vectors (wv) as a numpy memmap array. This is required, because loading all vectors into ram would would take up a lot of storage unecessary. The wv.vectors file will be replace by its memmap representation, which is why the next models do not require the wv_mapfile_path argument, as they access the same memmap object.

The lang_freq="en" induces the frequencies of the words according to the wordfreq package. This functionality allows you to work with pre-trained embeddings which don't come with frequency information. The method overwrites the counts in the glove.wv.vocab class, so that all further models also benefit from this induction.

In [7]:
models = {}
model_names = [
    "fasttext-wiki-news-subwords-300",
    "glove-twitter-100",
    "glove-twitter-200",
    "glove-twitter-25",
    "glove-twitter-50",
    "glove-wiki-gigaword-100",
    "glove-wiki-gigaword-200",
    "glove-wiki-gigaword-300",
    "glove-wiki-gigaword-50",
    "paragram-25",
    "paragram-300-sl999",
    "paragram-300-ws353",
    "paranmt-300",
    "word2vec-google-news-300",
]

vectors = {}

for name in model_names:
    vectors[name] = Vectors.from_pretrained(name, mmap="r")
    
ft = "fasttext-crawl-subwords-300"
vectors[ft] = FTVectors.from_pretrained(ft, mmap="r")

2022-04-10 20:57:47,250 : MainThread : INFO : Lock 23217094427696 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 20:57:47,677 : MainThread : INFO : Lock 23217094427696 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/.gitattributes.lock





2022-04-10 20:57:48,088 : MainThread : INFO : Lock 23217094427936 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=224.0), HTML(value='')))

2022-04-10 20:57:48,523 : MainThread : INFO : Lock 23217094427936 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/README.md.lock





2022-04-10 20:57:48,935 : MainThread : INFO : Lock 23217094427264 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/fasttext-wiki-news-subwords-300.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=54600933.0), HTML(value='')))

2022-04-10 20:57:51,756 : MainThread : INFO : Lock 23217094427264 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/fasttext-wiki-news-subwords-300.model.lock





2022-04-10 20:57:52,156 : MainThread : INFO : Lock 23217094556688 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/fasttext-wiki-news-subwords-300.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1199998928.0), HTML(value='')))

2022-04-10 20:58:46,239 : MainThread : INFO : Lock 23217094556688 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/fasttext-wiki-news-subwords-300.model.vectors.npy.lock
2022-04-10 20:58:46,240 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/fasttext-wiki-news-subwords-300.model





2022-04-10 20:58:48,458 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/fasttext-wiki-news-subwords-300.model.vectors.npy with mmap=r
2022-04-10 20:58:48,460 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 20:58:54,473 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--fasttext-wiki-news-subwords-300.main.ef21870476cd93435d83140fcf6e7171b517e337/fasttext-wiki-news-subwords-300.model', 'datetime': '2022-04-10T20:58:54.447613', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 20:58:55,333 : MainThread : INFO : Lock 23216641444928 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/.gitattributes

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 20:58:55,799 : MainThread : INFO : Lock 23216641444928 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/.gitattributes.lock





2022-04-10 20:58:56,225 : MainThread : INFO : Lock 23216641442000 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231.0), HTML(value='')))

2022-04-10 20:58:56,653 : MainThread : INFO : Lock 23216641442000 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/README.md.lock





2022-04-10 20:58:57,049 : MainThread : INFO : Lock 23216641444016 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/glove-twitter-100.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=68268001.0), HTML(value='')))

2022-04-10 20:59:00,003 : MainThread : INFO : Lock 23216641444016 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/glove-twitter-100.model.lock





2022-04-10 20:59:00,416 : MainThread : INFO : Lock 23216641443584 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/glove-twitter-100.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=477405728.0), HTML(value='')))

2022-04-10 20:59:14,692 : MainThread : INFO : Lock 23216641443584 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/glove-twitter-100.model.vectors.npy.lock
2022-04-10 20:59:14,693 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/glove-twitter-100.model





2022-04-10 20:59:17,633 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/glove-twitter-100.model.vectors.npy with mmap=r
2022-04-10 20:59:17,634 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 20:59:25,072 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--glove-twitter-100.main.7ef1c21d9bf90598a0c618c041d9817e50250183/glove-twitter-100.model', 'datetime': '2022-04-10T20:59:25.072106', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 20:59:25,890 : MainThread : INFO : Lock 23216623418336 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 20:59:26,350 : MainThread : INFO : Lock 23216623418336 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/.gitattributes.lock





2022-04-10 20:59:26,761 : MainThread : INFO : Lock 23216623415408 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231.0), HTML(value='')))

2022-04-10 20:59:27,193 : MainThread : INFO : Lock 23216623415408 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/README.md.lock





2022-04-10 20:59:27,604 : MainThread : INFO : Lock 23216623416752 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/glove-twitter-200.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=68268001.0), HTML(value='')))

2022-04-10 20:59:30,800 : MainThread : INFO : Lock 23216623416752 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/glove-twitter-200.model.lock





2022-04-10 20:59:31,213 : MainThread : INFO : Lock 23216472298736 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/glove-twitter-200.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=954811328.0), HTML(value='')))

2022-04-10 21:00:04,121 : MainThread : INFO : Lock 23216472298736 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/glove-twitter-200.model.vectors.npy.lock
2022-04-10 21:00:04,123 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/glove-twitter-200.model





2022-04-10 21:00:06,943 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/glove-twitter-200.model.vectors.npy with mmap=r
2022-04-10 21:00:06,945 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:00:14,454 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--glove-twitter-200.main.72f480c107aaa58b9474ddaf45d13db2e34fa166/glove-twitter-200.model', 'datetime': '2022-04-10T21:00:14.454541', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:00:15,279 : MainThread : INFO : Lock 23214945516512 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:00:15,756 : MainThread : INFO : Lock 23214945516512 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/.gitattributes.lock





2022-04-10 21:00:16,167 : MainThread : INFO : Lock 23216668016496 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231.0), HTML(value='')))

2022-04-10 21:00:16,600 : MainThread : INFO : Lock 23216668016496 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/README.md.lock





2022-04-10 21:00:17,005 : MainThread : INFO : Lock 23216668014432 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/glove-twitter-25.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=68268001.0), HTML(value='')))

2022-04-10 21:00:20,281 : MainThread : INFO : Lock 23216668014432 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/glove-twitter-25.model.lock





2022-04-10 21:00:20,688 : MainThread : INFO : Lock 23216664898048 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/glove-twitter-25.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=119351528.0), HTML(value='')))

2022-04-10 21:00:24,281 : MainThread : INFO : Lock 23216664898048 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/glove-twitter-25.model.vectors.npy.lock
2022-04-10 21:00:24,282 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/glove-twitter-25.model





2022-04-10 21:00:27,355 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/glove-twitter-25.model.vectors.npy with mmap=r
2022-04-10 21:00:27,357 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:00:34,868 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--glove-twitter-25.main.5ec1e20fb42502d60c4676070bad354ec71aa9aa/glove-twitter-25.model', 'datetime': '2022-04-10T21:00:34.868101', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:00:35,704 : MainThread : INFO : Lock 23214286037584 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:00:36,160 : MainThread : INFO : Lock 23214286037584 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/.gitattributes.lock





2022-04-10 21:00:36,561 : MainThread : INFO : Lock 23214286039216 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231.0), HTML(value='')))

2022-04-10 21:00:37,020 : MainThread : INFO : Lock 23214286039216 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/README.md.lock





2022-04-10 21:00:37,422 : MainThread : INFO : Lock 23214286039552 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/glove-twitter-50.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=68268001.0), HTML(value='')))

2022-04-10 21:00:40,656 : MainThread : INFO : Lock 23214286039552 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/glove-twitter-50.model.lock





2022-04-10 21:00:41,055 : MainThread : INFO : Lock 23216412122656 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/glove-twitter-50.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=238702928.0), HTML(value='')))

2022-04-10 21:00:52,641 : MainThread : INFO : Lock 23216412122656 released on /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/glove-twitter-50.model.vectors.npy.lock
2022-04-10 21:00:52,643 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/glove-twitter-50.model





2022-04-10 21:00:55,986 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/glove-twitter-50.model.vectors.npy with mmap=r
2022-04-10 21:00:55,987 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:01:03,683 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--glove-twitter-50.main.38339c079845641fe59690e5a147fab348a2eb29/glove-twitter-50.model', 'datetime': '2022-04-10T21:01:03.683141', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:01:05,705 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-100.main.3282d5e7c5e979c2411ba9703d63a46243a2047e/glove-wiki-gigaword-100.model
2022-04-10 21:01:06,823 : MainThread : INF

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:01:10,679 : MainThread : INFO : Lock 23216694088704 released on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/.gitattributes.lock





2022-04-10 21:01:11,078 : MainThread : INFO : Lock 23216694091584 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231.0), HTML(value='')))

2022-04-10 21:01:11,519 : MainThread : INFO : Lock 23216694091584 released on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/README.md.lock





2022-04-10 21:01:11,935 : MainThread : INFO : Lock 23216412120064 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/glove-wiki-gigaword-200.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=21494764.0), HTML(value='')))

2022-04-10 21:01:13,663 : MainThread : INFO : Lock 23216412120064 released on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/glove-wiki-gigaword-200.model.lock





2022-04-10 21:01:14,063 : MainThread : INFO : Lock 23217094556640 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/glove-wiki-gigaword-200.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=320000128.0), HTML(value='')))

2022-04-10 21:01:24,311 : MainThread : INFO : Lock 23217094556640 released on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/glove-wiki-gigaword-200.model.vectors.npy.lock
2022-04-10 21:01:24,312 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/glove-wiki-gigaword-200.model





2022-04-10 21:01:25,958 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/glove-wiki-gigaword-200.model.vectors.npy with mmap=r
2022-04-10 21:01:25,960 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:01:28,568 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-200.main.96a689f1f194ddd2615e41c852396c1fb50e5882/glove-wiki-gigaword-200.model', 'datetime': '2022-04-10T21:01:28.568647', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:01:29,401 : MainThread : INFO : Lock 23214988651728 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:01:29,852 : MainThread : INFO : Lock 23214988651728 released on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/.gitattributes.lock





2022-04-10 21:01:30,258 : MainThread : INFO : Lock 23214988650576 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231.0), HTML(value='')))

2022-04-10 21:01:30,719 : MainThread : INFO : Lock 23214988650576 released on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/README.md.lock





2022-04-10 21:01:31,125 : MainThread : INFO : Lock 23214988649568 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/glove-wiki-gigaword-300.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=21494765.0), HTML(value='')))

2022-04-10 21:01:32,894 : MainThread : INFO : Lock 23214988649568 released on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/glove-wiki-gigaword-300.model.lock





2022-04-10 21:01:33,307 : MainThread : INFO : Lock 23214988649760 acquired on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/glove-wiki-gigaword-300.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=480000128.0), HTML(value='')))

2022-04-10 21:01:44,154 : MainThread : INFO : Lock 23214988649760 released on /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/glove-wiki-gigaword-300.model.vectors.npy.lock
2022-04-10 21:01:44,155 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/glove-wiki-gigaword-300.model





2022-04-10 21:01:45,835 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/glove-wiki-gigaword-300.model.vectors.npy with mmap=r
2022-04-10 21:01:45,836 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:01:48,397 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-300.main.242f9d6f62200e8b1a2aedfc22e4d673c0549add/glove-wiki-gigaword-300.model', 'datetime': '2022-04-10T21:01:48.397775', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:01:50,442 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--glove-wiki-gigaword-50.main.d2d3bc131d1c28de59b055d6724c742bda902bcf/glove-wiki-gigaword-50.model
2022-04-10 21:01

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:01:56,256 : MainThread : INFO : Lock 23216929418352 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/.gitattributes.lock





2022-04-10 21:01:56,669 : MainThread : INFO : Lock 23216664896272 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/.ipynb_checkpoints/README-checkpoint.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231.0), HTML(value='')))

2022-04-10 21:01:57,105 : MainThread : INFO : Lock 23216664896272 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/.ipynb_checkpoints/README-checkpoint.md.lock





2022-04-10 21:01:57,501 : MainThread : INFO : Lock 23216664898384 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231.0), HTML(value='')))

2022-04-10 21:01:57,949 : MainThread : INFO : Lock 23216664898384 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/README.md.lock





2022-04-10 21:01:58,357 : MainThread : INFO : Lock 23216664899296 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/paragram-25.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=19962648.0), HTML(value='')))

2022-04-10 21:02:00,634 : MainThread : INFO : Lock 23216664899296 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/paragram-25.model.lock
2022-04-10 21:02:00,636 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/paragram-25.model
2022-04-10 21:02:01,210 : MainThread : INFO : setting ignored attribute vectors_norm to None





2022-04-10 21:02:01,849 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--paragram-25.main.d27454408fa98c7bf128e58602f66775d23d532c/paragram-25.model', 'datetime': '2022-04-10T21:02:01.849070', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:02:02,660 : MainThread : INFO : Lock 23216850738144 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:02:03,115 : MainThread : INFO : Lock 23216850738144 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/.gitattributes.lock





2022-04-10 21:02:03,523 : MainThread : INFO : Lock 23216850738912 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=172.0), HTML(value='')))

2022-04-10 21:02:03,953 : MainThread : INFO : Lock 23216850738912 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/README.md.lock





2022-04-10 21:02:04,349 : MainThread : INFO : Lock 23216850737424 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/paragram-300-sl999.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=88857573.0), HTML(value='')))

2022-04-10 21:02:08,591 : MainThread : INFO : Lock 23216850737424 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/paragram-300-sl999.model.lock





2022-04-10 21:02:09,005 : MainThread : INFO : Lock 23216947652304 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/paragram-300-sl999.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=2044507328.0), HTML(value='')))

2022-04-10 21:02:57,092 : MainThread : INFO : Lock 23216947652304 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/paragram-300-sl999.model.vectors.npy.lock
2022-04-10 21:02:57,094 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/paragram-300-sl999.model





2022-04-10 21:03:02,493 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/paragram-300-sl999.model.vectors.npy with mmap=r
2022-04-10 21:03:02,496 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:03:13,183 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--paragram-300-sl999.main.d16350f324c00fadb0f7ed05f8a9df130d950aab/paragram-300-sl999.model', 'datetime': '2022-04-10T21:03:13.183333', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:03:14,051 : MainThread : INFO : Lock 23214977322000 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:03:14,525 : MainThread : INFO : Lock 23214977322000 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/.gitattributes.lock





2022-04-10 21:03:14,935 : MainThread : INFO : Lock 23214977320608 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=173.0), HTML(value='')))

2022-04-10 21:03:15,367 : MainThread : INFO : Lock 23214977320608 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/README.md.lock





2022-04-10 21:03:15,771 : MainThread : INFO : Lock 23214977321376 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/paragram-300-ws353.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=88857573.0), HTML(value='')))

2022-04-10 21:03:19,348 : MainThread : INFO : Lock 23214977321376 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/paragram-300-ws353.model.lock





2022-04-10 21:03:19,763 : MainThread : INFO : Lock 23216828244176 acquired on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/paragram-300-ws353.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=2044507328.0), HTML(value='')))

2022-04-10 21:04:38,000 : MainThread : INFO : Lock 23216828244176 released on /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/paragram-300-ws353.model.vectors.npy.lock
2022-04-10 21:04:38,002 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/paragram-300-ws353.model





2022-04-10 21:04:43,718 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/paragram-300-ws353.model.vectors.npy with mmap=r
2022-04-10 21:04:43,720 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:04:54,408 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--paragram-300-ws353.main.cda1e084a44a24a769d595e5b6caf7ee7a8500b5/paragram-300-ws353.model', 'datetime': '2022-04-10T21:04:54.408109', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:04:55,242 : MainThread : INFO : Lock 23210948017504 acquired on /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:04:55,735 : MainThread : INFO : Lock 23210948017504 released on /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/.gitattributes.lock





2022-04-10 21:04:56,133 : MainThread : INFO : Lock 23216947650816 acquired on /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=278.0), HTML(value='')))

2022-04-10 21:04:56,573 : MainThread : INFO : Lock 23216947650816 released on /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/README.md.lock





2022-04-10 21:04:56,987 : MainThread : INFO : Lock 23216947652304 acquired on /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/paranmt-300.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=3836889.0), HTML(value='')))

2022-04-10 21:04:58,204 : MainThread : INFO : Lock 23216947652304 released on /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/paranmt-300.model.lock





2022-04-10 21:04:58,601 : MainThread : INFO : Lock 23216517341584 acquired on /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/paranmt-300.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=92668928.0), HTML(value='')))

2022-04-10 21:05:02,117 : MainThread : INFO : Lock 23216517341584 released on /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/paranmt-300.model.vectors.npy.lock
2022-04-10 21:05:02,119 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/paranmt-300.model
2022-04-10 21:05:02,277 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/paranmt-300.model.vectors.npy with mmap=r
2022-04-10 21:05:02,279 : MainThread : INFO : setting ignored attribute vectors_norm to None





2022-04-10 21:05:02,744 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--paranmt-300.main.ae7612b27a2516b44a42bbc148f9936332d30847/paranmt-300.model', 'datetime': '2022-04-10T21:05:02.744504', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:05:03,577 : MainThread : INFO : Lock 23208651680784 acquired on /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:05:04,013 : MainThread : INFO : Lock 23208651680784 released on /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/.gitattributes.lock





2022-04-10 21:05:04,413 : MainThread : INFO : Lock 23208651681552 acquired on /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=688.0), HTML(value='')))

2022-04-10 21:05:04,859 : MainThread : INFO : Lock 23208651681552 released on /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/README.md.lock





2022-04-10 21:05:05,272 : MainThread : INFO : Lock 23216466616128 acquired on /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/word2vec-google-news-300.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=182007201.0), HTML(value='')))

2022-04-10 21:05:10,440 : MainThread : INFO : Lock 23216466616128 released on /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/word2vec-google-news-300.model.lock





2022-04-10 21:05:10,839 : MainThread : INFO : Lock 23216828246816 acquired on /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/word2vec-google-news-300.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=3600000128.0), HTML(value='')))

2022-04-10 21:06:32,703 : MainThread : INFO : Lock 23216828246816 released on /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/word2vec-google-news-300.model.vectors.npy.lock
2022-04-10 21:06:32,705 : MainThread : INFO : loading Vectors object from /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/word2vec-google-news-300.model





2022-04-10 21:06:42,794 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/word2vec-google-news-300.model.vectors.npy with mmap=r
2022-04-10 21:06:42,795 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:07:01,547 : MainThread : INFO : KeyedVectors lifecycle event {'fname': '/home/oborchers/.cache/huggingface/hub/fse--word2vec-google-news-300.main.528f381952a0b7d777bb4a611c4a43f588d48994/word2vec-google-news-300.model', 'datetime': '2022-04-10T21:07:01.547760', 'gensim': '4.0.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-173-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-04-10 21:07:02,386 : MainThread : INFO : Lock 23208726395920 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/.gitattributes.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1261.0), HTML(value='')))

2022-04-10 21:07:02,918 : MainThread : INFO : Lock 23208726395920 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/.gitattributes.lock





2022-04-10 21:07:03,315 : MainThread : INFO : Lock 23208726394096 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/README.md.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=199.0), HTML(value='')))

2022-04-10 21:07:03,742 : MainThread : INFO : Lock 23208726394096 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/README.md.lock





2022-04-10 21:07:04,156 : MainThread : INFO : Lock 23216517344896 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=123932951.0), HTML(value='')))

2022-04-10 21:07:07,990 : MainThread : INFO : Lock 23216517344896 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.lock





2022-04-10 21:07:08,405 : MainThread : INFO : Lock 23216517343600 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=2400000128.0), HTML(value='')))

2022-04-10 21:08:24,510 : MainThread : INFO : Lock 23216517343600 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors.npy.lock





2022-04-10 21:08:24,947 : MainThread : INFO : Lock 23217094152880 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors_ngrams.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=2400000128.0), HTML(value='')))

2022-04-10 21:09:57,606 : MainThread : INFO : Lock 23217094152880 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors_ngrams.npy.lock





2022-04-10 21:09:58,069 : MainThread : INFO : Lock 23216517341584 acquired on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors_vocab.npy.lock


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=2400000128.0), HTML(value='')))

2022-04-10 21:11:22,020 : MainThread : INFO : Lock 23216517341584 released on /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors_vocab.npy.lock
2022-04-10 21:11:22,022 : MainThread : INFO : loading FTVectors object from /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model





2022-04-10 21:11:29,456 : MainThread : INFO : loading vectors from /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors.npy with mmap=r
2022-04-10 21:11:29,457 : MainThread : INFO : loading vectors_vocab from /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors_vocab.npy with mmap=r
2022-04-10 21:11:29,459 : MainThread : INFO : loading vectors_ngrams from /home/oborchers/.cache/huggingface/hub/fse--fasttext-crawl-subwords-300.main.5db65694a7b3fde5a4f1a4c72ce96a25b931692d/fasttext-crawl-subwords-300.model.vectors_ngrams.npy with mmap=r
2022-04-10 21:11:29,460 : MainThread : INFO : setting ignored attribute vectors_norm to None
2022-04-10 21:11:29,461 : MainThread : INFO : setting ignored attribute vectors_vocab_norm to None
2022-04-10 21:11:29,461 : MainThread : INFO : setting

In [8]:
algos = {
    "CBOW" : (Average, {}),
    "SIF-10" : (SIF, {"components" : 10}),
    "uSIF": (uSIF, {"length": 11}),
}
import itertools
combinations = list(itertools.product(algos.keys(), vectors.keys()))

## Computation of the results for the STS benchmark

We are finally able to compute the STS benchmark values for all models.

In [9]:
# This function is used to compute the similarities between two sentences.
# Task length is the length of the sts dataset.
def compute_similarities(task_length, model):
    sims = []
    for i, j in zip(range(task_length), range(task_length, 2*task_length)):
        sims.append(model.sv.similarity(i,j))
    return sims

In [10]:
from collections import defaultdict
results = defaultdict(list)

for algo, name in combinations:
    class_obj = algos[algo][0]
    args = algos[algo][1]
    vec = vectors[name]
    m = class_obj(vec, lang_freq="en", **args)
    m.train(sentences)
    r = pearsonr(similarities, compute_similarities(task_length, m))[0].round(4) * 100
    
    results["algo"].append(algo)
    results["vecs"].append(name)
    
    if len(args) > 0:
        name = list(args.keys())[0]
        value = args[name]
        results["params"].append(f"{name}={value}")
    else:
        results["params"].append("")
    
    results["score"].append(r)
    print(algo, name, f"{r:2.2f}")

2022-04-10 21:11:55,679 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en
2022-04-10 21:11:58,635 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:11:58,960 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:11:59,761 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 999999 vocabulary: 1151 MB (1 GB)
2022-04-10 21:11:59,762 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:11:59,799 : MainThread : INFO : begin training
2022-04-10 21:12:00,652 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:00,653 : MainThread : INFO : training on 2758 effective sentences with 27172 effective words took 0s with 3230 sentences/s
2022-04-10 21:12:00,692 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency

CBOW fasttext-wiki-news-subwords-300 26.08


2022-04-10 21:12:04,126 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:04,442 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:05,502 : MainThread : INFO : estimated memory for 2758 sentences with 100 dimensions and 1193514 vocabulary: 460 MB (0 GB)
2022-04-10 21:12:05,504 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:05,542 : MainThread : INFO : begin training
2022-04-10 21:12:05,995 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:05,996 : MainThread : INFO : training on 2758 effective sentences with 26828 effective words took 0s with 6079 sentences/s
2022-04-10 21:12:06,025 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW glove-twitter-100 33.81


2022-04-10 21:12:09,554 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:09,878 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:10,907 : MainThread : INFO : estimated memory for 2758 sentences with 200 dimensions and 1193514 vocabulary: 917 MB (0 GB)
2022-04-10 21:12:10,909 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:10,947 : MainThread : INFO : begin training
2022-04-10 21:12:11,401 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:11,403 : MainThread : INFO : training on 2758 effective sentences with 26828 effective words took 0s with 6055 sentences/s
2022-04-10 21:12:11,434 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW glove-twitter-200 34.94


2022-04-10 21:12:14,870 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:15,278 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:16,312 : MainThread : INFO : estimated memory for 2758 sentences with 25 dimensions and 1193514 vocabulary: 118 MB (0 GB)
2022-04-10 21:12:16,314 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:16,349 : MainThread : INFO : begin training
2022-04-10 21:12:16,785 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:16,786 : MainThread : INFO : training on 2758 effective sentences with 26828 effective words took 0s with 6306 sentences/s
2022-04-10 21:12:16,816 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW glove-twitter-25 26.15


2022-04-10 21:12:20,380 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:20,711 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:21,727 : MainThread : INFO : estimated memory for 2758 sentences with 50 dimensions and 1193514 vocabulary: 232 MB (0 GB)
2022-04-10 21:12:21,728 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:21,766 : MainThread : INFO : begin training
2022-04-10 21:12:22,207 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:22,208 : MainThread : INFO : training on 2758 effective sentences with 26828 effective words took 0s with 6236 sentences/s
2022-04-10 21:12:22,239 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW glove-twitter-50 30.78


2022-04-10 21:12:23,433 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:23,770 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:24,111 : MainThread : INFO : estimated memory for 2758 sentences with 100 dimensions and 400000 vocabulary: 155 MB (0 GB)
2022-04-10 21:12:24,112 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:24,132 : MainThread : INFO : begin training
2022-04-10 21:12:24,583 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:24,584 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words took 0s with 6097 sentences/s
2022-04-10 21:12:24,613 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW glove-wiki-gigaword-100 38.12


2022-04-10 21:12:25,798 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:26,196 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:26,554 : MainThread : INFO : estimated memory for 2758 sentences with 200 dimensions and 400000 vocabulary: 308 MB (0 GB)
2022-04-10 21:12:26,556 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:26,583 : MainThread : INFO : begin training
2022-04-10 21:12:26,953 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:26,954 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words took 0s with 7427 sentences/s
2022-04-10 21:12:26,986 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW glove-wiki-gigaword-200 42.40


2022-04-10 21:12:28,279 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:28,704 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:29,062 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 400000 vocabulary: 462 MB (0 GB)
2022-04-10 21:12:29,063 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:29,084 : MainThread : INFO : begin training
2022-04-10 21:12:29,535 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:29,536 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words took 0s with 6099 sentences/s
2022-04-10 21:12:29,570 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW glove-wiki-gigaword-300 44.46


2022-04-10 21:12:30,841 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:31,168 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:31,534 : MainThread : INFO : estimated memory for 2758 sentences with 50 dimensions and 400000 vocabulary: 78 MB (0 GB)
2022-04-10 21:12:31,535 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:31,561 : MainThread : INFO : begin training
2022-04-10 21:12:32,037 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:32,039 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words took 0s with 5781 sentences/s
2022-04-10 21:12:32,080 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW glove-wiki-gigaword-50 37.47


2022-04-10 21:12:32,465 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:32,825 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:32,912 : MainThread : INFO : estimated memory for 2758 sentences with 25 dimensions and 99958 vocabulary: 10 MB (0 GB)
2022-04-10 21:12:32,914 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:32,927 : MainThread : INFO : begin training
2022-04-10 21:12:33,386 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:33,387 : MainThread : INFO : training on 2758 effective sentences with 27038 effective words took 0s with 5991 sentences/s
2022-04-10 21:12:33,419 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW paragram-25 40.13


2022-04-10 21:12:38,348 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:38,680 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:40,139 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 1703756 vocabulary: 1959 MB (1 GB)
2022-04-10 21:12:40,140 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:40,185 : MainThread : INFO : begin training
2022-04-10 21:12:40,527 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:40,528 : MainThread : INFO : training on 2758 effective sentences with 27448 effective words took 0s with 8052 sentences/s
2022-04-10 21:12:40,564 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW paragram-300-sl999 51.46


2022-04-10 21:12:45,392 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:45,808 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:47,366 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 1703756 vocabulary: 1959 MB (1 GB)
2022-04-10 21:12:47,368 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:47,429 : MainThread : INFO : begin training
2022-04-10 21:12:47,816 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:47,817 : MainThread : INFO : training on 2758 effective sentences with 27448 effective words took 0s with 7100 sentences/s
2022-04-10 21:12:47,853 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW paragram-300-ws353 54.72


2022-04-10 21:12:48,135 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:48,482 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:12:48,548 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2022-04-10 21:12:48,550 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:12:48,562 : MainThread : INFO : begin training
2022-04-10 21:12:49,009 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:12:49,010 : MainThread : INFO : training on 2758 effective sentences with 27439 effective words took 0s with 6163 sentences/s
2022-04-10 21:12:49,047 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW paranmt-300 79.82


2022-04-10 21:12:57,961 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:12:58,290 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:00,987 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 3000000 vocabulary: 3447 MB (3 GB)
2022-04-10 21:13:00,988 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:01,078 : MainThread : INFO : begin training
2022-04-10 21:13:01,522 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:01,523 : MainThread : INFO : training on 2758 effective sentences with 23116 effective words took 0s with 6202 sentences/s
2022-04-10 21:13:01,562 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW word2vec-google-news-300 61.54


2022-04-10 21:13:07,465 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:07,807 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:09,600 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 2000000 vocabulary: 6877 MB (6 GB)
2022-04-10 21:13:09,601 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:09,694 : MainThread : INFO : begin training
2022-04-10 21:13:10,149 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:10,150 : MainThread : INFO : training on 2758 effective sentences with 27528 effective words took 0s with 6044 sentences/s
2022-04-10 21:13:10,185 : MainThread : INFO : no frequency mode: using wordfreq for estimation of frequency for language: en


CBOW fasttext-crawl-subwords-300 48.49


2022-04-10 21:13:13,168 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:13,491 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:14,321 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 999999 vocabulary: 1151 MB (1 GB)
2022-04-10 21:13:14,322 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:14,335 : MainThread : INFO : pre-computing SIF weights for 999999 words
2022-04-10 21:13:15,743 : MainThread : INFO : begin training
2022-04-10 21:13:16,200 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:16,271 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:16,281 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:16,283 : MainThread : INFO : training on 2758 effective sentences with 27172 effective word

SIF-10 components 72.29


2022-04-10 21:13:19,886 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:20,221 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:21,296 : MainThread : INFO : estimated memory for 2758 sentences with 100 dimensions and 1193514 vocabulary: 460 MB (0 GB)
2022-04-10 21:13:21,298 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:21,308 : MainThread : INFO : pre-computing SIF weights for 1193514 words
2022-04-10 21:13:22,885 : MainThread : INFO : begin training
2022-04-10 21:13:23,321 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:23,376 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:23,382 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:23,383 : MainThread : INFO : training on 2758 effective sentences with 26828 effective wor

SIF-10 components 69.65


2022-04-10 21:13:27,048 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:27,374 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:28,484 : MainThread : INFO : estimated memory for 2758 sentences with 200 dimensions and 1193514 vocabulary: 917 MB (0 GB)
2022-04-10 21:13:28,486 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:28,500 : MainThread : INFO : pre-computing SIF weights for 1193514 words
2022-04-10 21:13:30,148 : MainThread : INFO : begin training
2022-04-10 21:13:30,587 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:30,645 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:30,652 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:30,653 : MainThread : INFO : training on 2758 effective sentences with 26828 effective wor

SIF-10 components 71.62


2022-04-10 21:13:34,166 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:34,551 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:35,585 : MainThread : INFO : estimated memory for 2758 sentences with 25 dimensions and 1193514 vocabulary: 118 MB (0 GB)
2022-04-10 21:13:35,586 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:35,598 : MainThread : INFO : pre-computing SIF weights for 1193514 words
2022-04-10 21:13:37,202 : MainThread : INFO : begin training
2022-04-10 21:13:37,670 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:37,702 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:37,704 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:37,705 : MainThread : INFO : training on 2758 effective sentences with 26828 effective word

SIF-10 components 54.16


2022-04-10 21:13:41,419 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:41,858 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:42,942 : MainThread : INFO : estimated memory for 2758 sentences with 50 dimensions and 1193514 vocabulary: 232 MB (0 GB)
2022-04-10 21:13:42,943 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:42,957 : MainThread : INFO : pre-computing SIF weights for 1193514 words
2022-04-10 21:13:44,559 : MainThread : INFO : begin training
2022-04-10 21:13:44,993 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:45,032 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:45,035 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:45,036 : MainThread : INFO : training on 2758 effective sentences with 26828 effective word

SIF-10 components 65.52


2022-04-10 21:13:46,444 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:46,859 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:47,256 : MainThread : INFO : estimated memory for 2758 sentences with 100 dimensions and 400000 vocabulary: 155 MB (0 GB)
2022-04-10 21:13:47,257 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:47,275 : MainThread : INFO : pre-computing SIF weights for 400000 words
2022-04-10 21:13:47,865 : MainThread : INFO : begin training
2022-04-10 21:13:48,261 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:48,301 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:48,304 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:48,306 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words

SIF-10 components 68.34


2022-04-10 21:13:49,669 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:50,006 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:50,414 : MainThread : INFO : estimated memory for 2758 sentences with 200 dimensions and 400000 vocabulary: 308 MB (0 GB)
2022-04-10 21:13:50,415 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:50,429 : MainThread : INFO : pre-computing SIF weights for 400000 words
2022-04-10 21:13:50,979 : MainThread : INFO : begin training
2022-04-10 21:13:51,412 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:51,470 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:51,475 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:51,476 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words

SIF-10 components 70.62


2022-04-10 21:13:52,767 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:53,132 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:53,483 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 400000 vocabulary: 462 MB (0 GB)
2022-04-10 21:13:53,484 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:53,497 : MainThread : INFO : pre-computing SIF weights for 400000 words
2022-04-10 21:13:54,037 : MainThread : INFO : begin training
2022-04-10 21:13:54,464 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:54,522 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:54,528 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:54,530 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words

SIF-10 components 71.35


2022-04-10 21:13:55,809 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:56,212 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:56,551 : MainThread : INFO : estimated memory for 2758 sentences with 50 dimensions and 400000 vocabulary: 78 MB (0 GB)
2022-04-10 21:13:56,552 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:56,566 : MainThread : INFO : pre-computing SIF weights for 400000 words
2022-04-10 21:13:57,101 : MainThread : INFO : begin training
2022-04-10 21:13:57,539 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:57,585 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:57,590 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:57,591 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words t

SIF-10 components 64.11


2022-04-10 21:13:58,056 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:13:58,385 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:13:58,483 : MainThread : INFO : estimated memory for 2758 sentences with 25 dimensions and 99958 vocabulary: 10 MB (0 GB)
2022-04-10 21:13:58,484 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:13:58,495 : MainThread : INFO : pre-computing SIF weights for 99958 words
2022-04-10 21:13:58,658 : MainThread : INFO : begin training
2022-04-10 21:13:59,121 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:13:59,173 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:13:59,179 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:13:59,180 : MainThread : INFO : training on 2758 effective sentences with 27038 effective words too

SIF-10 components 59.07


2022-04-10 21:14:04,259 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:14:04,598 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:14:06,100 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 1703756 vocabulary: 1959 MB (1 GB)
2022-04-10 21:14:06,101 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:14:06,114 : MainThread : INFO : pre-computing SIF weights for 1703756 words
2022-04-10 21:14:08,290 : MainThread : INFO : begin training
2022-04-10 21:14:08,733 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:14:08,787 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:14:08,792 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:14:08,793 : MainThread : INFO : training on 2758 effective sentences with 27448 effective wo

SIF-10 components 74.21


2022-04-10 21:14:13,811 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:14:14,145 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:14:15,566 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 1703756 vocabulary: 1959 MB (1 GB)
2022-04-10 21:14:15,568 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:14:15,590 : MainThread : INFO : pre-computing SIF weights for 1703756 words
2022-04-10 21:14:17,860 : MainThread : INFO : begin training
2022-04-10 21:14:18,305 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:14:18,339 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:14:18,346 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:14:18,347 : MainThread : INFO : training on 2758 effective sentences with 27448 effective wo

SIF-10 components 74.03


2022-04-10 21:14:18,742 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:14:19,121 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:14:19,190 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2022-04-10 21:14:19,191 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:14:19,202 : MainThread : INFO : pre-computing SIF weights for 77224 words
2022-04-10 21:14:19,305 : MainThread : INFO : begin training
2022-04-10 21:14:19,761 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:14:19,807 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:14:19,812 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:14:19,813 : MainThread : INFO : training on 2758 effective sentences with 27439 effective words to

SIF-10 components 76.72


2022-04-10 21:14:28,708 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:14:29,029 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:14:31,697 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 3000000 vocabulary: 3447 MB (3 GB)
2022-04-10 21:14:31,698 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:14:31,711 : MainThread : INFO : pre-computing SIF weights for 3000000 words
2022-04-10 21:14:35,675 : MainThread : INFO : begin training
2022-04-10 21:14:36,114 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:14:36,165 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:14:36,169 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:14:36,170 : MainThread : INFO : training on 2758 effective sentences with 23116 effective wo

SIF-10 components 71.12


2022-04-10 21:14:42,225 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:14:42,659 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:14:44,479 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 2000000 vocabulary: 6877 MB (6 GB)
2022-04-10 21:14:44,481 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:14:44,495 : MainThread : INFO : pre-computing SIF weights for 2000000 words
2022-04-10 21:14:47,243 : MainThread : INFO : begin training
2022-04-10 21:14:47,689 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:14:47,750 : MainThread : INFO : computing 10 principal components took 0s
2022-04-10 21:14:47,758 : MainThread : INFO : removing 10 principal components took 0s
2022-04-10 21:14:47,761 : MainThread : INFO : training on 2758 effective sentences with 27528 effective wo

SIF-10 components 73.38


2022-04-10 21:14:50,778 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:14:51,104 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:14:51,910 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 999999 vocabulary: 1151 MB (1 GB)
2022-04-10 21:14:51,911 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:14:51,926 : MainThread : INFO : pre-computing uSIF weights for 999999 words
2022-04-10 21:14:55,212 : MainThread : INFO : begin training
2022-04-10 21:14:55,686 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:14:55,770 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:14:55,774 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:14:55,777 : MainThread : INFO : training on 2758 effective sentences with 27172 effective words

uSIF length 68.63


2022-04-10 21:14:59,355 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:14:59,700 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:00,811 : MainThread : INFO : estimated memory for 2758 sentences with 100 dimensions and 1193514 vocabulary: 460 MB (0 GB)
2022-04-10 21:15:00,812 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:00,826 : MainThread : INFO : pre-computing uSIF weights for 1193514 words
2022-04-10 21:15:04,828 : MainThread : INFO : begin training
2022-04-10 21:15:05,214 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:05,267 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:05,274 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:05,276 : MainThread : INFO : training on 2758 effective sentences with 26828 effective word

uSIF length 64.13


2022-04-10 21:15:09,049 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:09,377 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:10,437 : MainThread : INFO : estimated memory for 2758 sentences with 200 dimensions and 1193514 vocabulary: 917 MB (0 GB)
2022-04-10 21:15:10,439 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:10,457 : MainThread : INFO : pre-computing uSIF weights for 1193514 words
2022-04-10 21:15:14,583 : MainThread : INFO : begin training
2022-04-10 21:15:15,070 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:15,118 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:15,122 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:15,123 : MainThread : INFO : training on 2758 effective sentences with 26828 effective word

uSIF length 66.67


2022-04-10 21:15:18,756 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:19,148 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:20,235 : MainThread : INFO : estimated memory for 2758 sentences with 25 dimensions and 1193514 vocabulary: 118 MB (0 GB)
2022-04-10 21:15:20,237 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:20,249 : MainThread : INFO : pre-computing uSIF weights for 1193514 words
2022-04-10 21:15:24,258 : MainThread : INFO : begin training
2022-04-10 21:15:24,723 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:24,774 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:24,779 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:24,780 : MainThread : INFO : training on 2758 effective sentences with 26828 effective words

uSIF length 55.06


2022-04-10 21:15:28,479 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:28,842 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:29,904 : MainThread : INFO : estimated memory for 2758 sentences with 50 dimensions and 1193514 vocabulary: 232 MB (0 GB)
2022-04-10 21:15:29,906 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:29,917 : MainThread : INFO : pre-computing uSIF weights for 1193514 words
2022-04-10 21:15:33,992 : MainThread : INFO : begin training
2022-04-10 21:15:34,462 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:34,497 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:34,500 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:34,501 : MainThread : INFO : training on 2758 effective sentences with 26828 effective words

uSIF length 60.41


2022-04-10 21:15:35,796 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:36,138 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:36,493 : MainThread : INFO : estimated memory for 2758 sentences with 100 dimensions and 400000 vocabulary: 155 MB (0 GB)
2022-04-10 21:15:36,495 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:36,505 : MainThread : INFO : pre-computing uSIF weights for 400000 words
2022-04-10 21:15:37,871 : MainThread : INFO : begin training
2022-04-10 21:15:38,356 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:38,389 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:38,394 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:38,395 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words 

uSIF length 65.33


2022-04-10 21:15:39,727 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:40,069 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:40,440 : MainThread : INFO : estimated memory for 2758 sentences with 200 dimensions and 400000 vocabulary: 308 MB (0 GB)
2022-04-10 21:15:40,441 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:40,461 : MainThread : INFO : pre-computing uSIF weights for 400000 words
2022-04-10 21:15:41,928 : MainThread : INFO : begin training
2022-04-10 21:15:42,292 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:42,330 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:42,333 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:42,334 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words 

uSIF length 67.11


2022-04-10 21:15:43,661 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:43,980 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:44,352 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 400000 vocabulary: 462 MB (0 GB)
2022-04-10 21:15:44,353 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:44,364 : MainThread : INFO : pre-computing uSIF weights for 400000 words
2022-04-10 21:15:45,705 : MainThread : INFO : begin training
2022-04-10 21:15:46,146 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:46,178 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:46,185 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:46,186 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words 

uSIF length 67.60


2022-04-10 21:15:47,539 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:47,892 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:48,235 : MainThread : INFO : estimated memory for 2758 sentences with 50 dimensions and 400000 vocabulary: 78 MB (0 GB)
2022-04-10 21:15:48,236 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:48,251 : MainThread : INFO : pre-computing uSIF weights for 400000 words
2022-04-10 21:15:49,671 : MainThread : INFO : begin training
2022-04-10 21:15:50,120 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:50,158 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:50,162 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:50,163 : MainThread : INFO : training on 2758 effective sentences with 27410 effective words to

uSIF length 62.06


2022-04-10 21:15:50,576 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:50,963 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:51,054 : MainThread : INFO : estimated memory for 2758 sentences with 25 dimensions and 99958 vocabulary: 10 MB (0 GB)
2022-04-10 21:15:51,055 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:51,067 : MainThread : INFO : pre-computing uSIF weights for 99958 words
2022-04-10 21:15:51,393 : MainThread : INFO : begin training
2022-04-10 21:15:51,823 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:15:51,843 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:15:51,845 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:15:51,846 : MainThread : INFO : training on 2758 effective sentences with 27038 effective words took

uSIF length 64.22


2022-04-10 21:15:56,885 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:15:57,202 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:15:58,675 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 1703756 vocabulary: 1959 MB (1 GB)
2022-04-10 21:15:58,676 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:15:58,691 : MainThread : INFO : pre-computing uSIF weights for 1703756 words
2022-04-10 21:16:04,254 : MainThread : INFO : begin training
2022-04-10 21:16:04,722 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:16:04,789 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:16:04,798 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:16:04,799 : MainThread : INFO : training on 2758 effective sentences with 27448 effective wor

uSIF length 73.04


2022-04-10 21:16:09,746 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:16:10,075 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:16:11,520 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 1703756 vocabulary: 1959 MB (1 GB)
2022-04-10 21:16:11,522 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:16:11,539 : MainThread : INFO : pre-computing uSIF weights for 1703756 words
2022-04-10 21:16:17,121 : MainThread : INFO : begin training
2022-04-10 21:16:17,566 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:16:17,606 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:16:17,608 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:16:17,609 : MainThread : INFO : training on 2758 effective sentences with 27448 effective wor

uSIF length 71.84


2022-04-10 21:16:17,974 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:16:18,362 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:16:18,432 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 77224 vocabulary: 91 MB (0 GB)
2022-04-10 21:16:18,433 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:16:18,445 : MainThread : INFO : pre-computing uSIF weights for 77224 words
2022-04-10 21:16:18,706 : MainThread : INFO : begin training
2022-04-10 21:16:19,129 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:16:19,172 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:16:19,179 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:16:19,180 : MainThread : INFO : training on 2758 effective sentences with 27439 effective words too

uSIF length 79.00


2022-04-10 21:16:28,027 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:16:28,352 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:16:31,060 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 3000000 vocabulary: 3447 MB (3 GB)
2022-04-10 21:16:31,061 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:16:31,074 : MainThread : INFO : pre-computing uSIF weights for 3000000 words
2022-04-10 21:16:41,139 : MainThread : INFO : begin training
2022-04-10 21:16:41,586 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:16:41,640 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:16:41,645 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:16:41,646 : MainThread : INFO : training on 2758 effective sentences with 23116 effective wor

uSIF length 66.99


2022-04-10 21:16:47,738 : MainThread : INFO : scanning all indexed sentences and their word counts
2022-04-10 21:16:48,060 : MainThread : INFO : finished scanning 2758 sentences with an average length of 9 and 27528 total words
2022-04-10 21:16:49,907 : MainThread : INFO : estimated memory for 2758 sentences with 300 dimensions and 2000000 vocabulary: 6877 MB (6 GB)
2022-04-10 21:16:49,909 : MainThread : INFO : initializing sentence vectors for 2758 sentences
2022-04-10 21:16:49,931 : MainThread : INFO : pre-computing uSIF weights for 2000000 words
2022-04-10 21:16:56,856 : MainThread : INFO : begin training
2022-04-10 21:16:57,322 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-10 21:16:57,374 : MainThread : INFO : computing 5 principal components took 0s
2022-04-10 21:16:57,384 : MainThread : INFO : removing 5 principal components took 0s
2022-04-10 21:16:57,386 : MainThread : INFO : training on 2758 effective sentences with 27528 effective wor

uSIF length 69.40


In [11]:
df = pd.DataFrame.from_dict(results, orient="columns").sort_values("score", ascending=False)
df.head()

Unnamed: 0,algo,vecs,params,score
12,CBOW,paranmt-300,,79.82
42,uSIF,paranmt-300,length=11,79.0
27,SIF-10,paranmt-300,components=10,76.72
25,SIF-10,paragram-300-sl999,components=10,74.21
26,SIF-10,paragram-300-ws353,components=10,74.03


If you closely study the values above you will find that:
- SIF-Glove is almost equivalent to the values reported in http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark
- CBOW-Paranmt is a little better than ParaNMT Word Avg. in https://www.aclweb.org/anthology/W18-3012
- uSIF-Paranmt is a little worse than ParaNMT+UP in https://www.aclweb.org/anthology/W18-3012
- uSIF-Paragram is a little worse than PSL+UP in https://www.aclweb.org/anthology/W18-3012

However, I guess those differences might arise due to differences in preprocessing. Too bad we didn't hit 80. If you have any ideas why those values don't match exactly, feel free to contact me anytime.

In [12]:
for row in df.iterrows():
    row = row[1]
    print(
        f"`{row.algo}` | `{row.vecs}` | {row.params} | {row.score:2.2f}"
    )    

`CBOW` | `paranmt-300` |  | 79.82
`uSIF` | `paranmt-300` | length=11 | 79.00
`SIF-10` | `paranmt-300` | components=10 | 76.72
`SIF-10` | `paragram-300-sl999` | components=10 | 74.21
`SIF-10` | `paragram-300-ws353` | components=10 | 74.03
`SIF-10` | `fasttext-crawl-subwords-300` | components=10 | 73.38
`uSIF` | `paragram-300-sl999` | length=11 | 73.04
`SIF-10` | `fasttext-wiki-news-subwords-300` | components=10 | 72.29
`uSIF` | `paragram-300-ws353` | length=11 | 71.84
`SIF-10` | `glove-twitter-200` | components=10 | 71.62
`SIF-10` | `glove-wiki-gigaword-300` | components=10 | 71.35
`SIF-10` | `word2vec-google-news-300` | components=10 | 71.12
`SIF-10` | `glove-wiki-gigaword-200` | components=10 | 70.62
`SIF-10` | `glove-twitter-100` | components=10 | 69.65
`uSIF` | `fasttext-crawl-subwords-300` | length=11 | 69.40
`uSIF` | `fasttext-wiki-news-subwords-300` | length=11 | 68.63
`SIF-10` | `glove-wiki-gigaword-100` | components=10 | 68.34
`uSIF` | `glove-wiki-gigaword-300` | length=11 | 67