# Finding similar documents with Word2Vec and Soft Cosine Measure 

Soft Cosine Measure (SCM) [1, 4] is a promising new tool in machine learning that allows us to submit a query and return the most relevant documents. In **part 1**, we will show how you can compute SCM between two documents using the `inner_product` method. In **part 2**, we will use `SoftCosineSimilarity` to retrieve documents most similar to a query and compare the performance against other similarity measures.

First, however, we go through the basics of what Soft Cosine Measure is.

## Soft Cosine Measure basics

Soft Cosine Measure (SCM) is a method that allows us to assess the similarity between two documents in a meaningful way, even when they have no words in common. It uses a measure of similarity between words, which can be derived [2] using [word2vec][] [3] vector embeddings of words. It has been shown to outperform many of the state-of-the-art methods in the semantic text similarity task in the context of community question answering [2].

[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html

SCM is illustrated below for two very similar sentences. The sentences have no words in common, but by modeling synonymy, SCM is able to accurately measure the similarity between the two sentences. The method also uses the bag-of-words vector representation of the documents (simply put, the word's frequencies in the documents). The intution behind the method is that we compute standard cosine similarity assuming that the document vectors are expressed in a non-orthogonal basis, where the angle between two basis vectors is derived from the angle between the word2vec embeddings of the corresponding words.

![Soft Cosine Measure](soft_cosine_tutorial.png)

This method was perhaps first introduced in the article “Soft Measure and Soft Cosine Measure: Measure of Features in Vector Space Model” by Grigori Sidorov, Alexander Gelbukh, Helena Gomez-Adorno, and David Pinto ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf)).

In this tutorial, we will learn how to use Gensim's SCM functionality, which consists of the `inner_product` method for one-off computation, and the `SoftCosineSimilarity` class for corpus-based similarity queries.

> **Note**:
>
> If you use this software, please consider citing [1] and [2].
>

## Running this notebook
You can download this [Jupyter notebook](http://jupyter.org/), and run it on your own computer, provided you have installed the `gensim`, `jupyter`, `sklearn`, `pyemd`, and `wmd` Python packages.

The notebook was run on an Ubuntu machine with an Intel core i7-6700HQ CPU 3.10GHz (4 cores) and 16 GB memory. Assuming all resources required by the notebook have already been downloaded, running the entire notebook on this machine takes about 30 minutes.

In [1]:
# Initialize logging.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Part 1: Computing the Soft Cosine Measure

To use SCM, we need some word embeddings first of all. You could train a [word2vec][] (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but we will use pre-trained word2vec embeddings.

[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html

Let's create some sentences to compare.

In [2]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
sentence_orange = 'Having a tough time finding an orange juice press machine?'.lower().split()

The first two sentences have very similar content, and as such the SCM should be large. Before we compute the SCM, we want to remove stopwords ("the", "to", etc.), as these do not contribute a lot to the information in the sentences.

In [3]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.

# Remove stopwords.
stop_words = stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stop_words]
sentence_president = [w for w in sentence_president if w not in stop_words]
sentence_orange = [w for w in sentence_orange if w not in stop_words]

# Prepare a dictionary and a corpus.
from gensim import corpora
documents = [sentence_obama, sentence_president, sentence_orange]
dictionary = corpora.Dictionary(documents)

# Convert the sentences into bag-of-words vectors.
sentence_obama = dictionary.doc2bow(sentence_obama)
sentence_president = dictionary.doc2bow(sentence_president)
sentence_orange = dictionary.doc2bow(sentence_orange)

[nltk_data] Downloading package stopwords to /home/misha/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
2019-06-17 10:46:35,031 : INFO : 'pattern' package not found; tag filters are not available for English
2019-06-17 10:46:35,036 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 10:46:35,037 : INFO : built Dictionary(14 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'chicago']...) from 3 documents (total 15 corpus positions)


Now, as we mentioned earlier, we will be using some downloaded pre-trained embeddings. Note that the embeddings we have chosen here require a lot of memory. We will use the embeddings to construct a term similarity matrix that will be used by the `inner_product` method.

In [4]:
%%time
import gensim.downloader as api
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix

w2v_model = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(w2v_model)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary)

2019-06-17 10:46:35,428 : INFO : loading projection weights from /home/misha/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2019-06-17 10:47:07,511 : INFO : loaded (400000, 50) matrix from /home/misha/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2019-06-17 10:47:07,512 : INFO : constructing a sparse term similarity matrix using <gensim.models.keyedvectors.WordEmbeddingSimilarityIndex object at 0x7fa0c2347d68>
2019-06-17 10:47:07,512 : INFO : iterating over columns in dictionary order
2019-06-17 10:47:07,514 : INFO : PROGRESS: at 7.14% columns (1 / 14, 7.142857% density, 7.142857% projected density)
2019-06-17 10:47:07,516 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 10:47:08,055 : INFO : constructed a sparse term similarity matrix with 11.224490% density


CPU times: user 30.4 s, sys: 1.27 s, total: 31.7 s
Wall time: 33 s


Let's compute SCM using the `inner_product` method.

In [5]:
similarity = similarity_matrix.inner_product(sentence_obama, sentence_president, normalized=True)
print('similarity = %.4f' % similarity)

similarity = 0.3790


Let's try the same thing with two completely unrelated sentences. Notice that the similarity is smaller.

In [6]:
similarity = similarity_matrix.inner_product(sentence_obama, sentence_orange, normalized=True)
print('similarity = %.4f' % similarity)

similarity = 0.1108


## Part 2: Similarity queries using `SoftCosineSimilarity`
You can use SCM to get the most similar documents to a query, using the `SoftCosineSimilarity` class. Its interface is similar to what is described in the [Similarity Queries](https://radimrehurek.com/gensim/tut3.html) Gensim tutorial.

### Qatar Living unannotated dataset
Contestants solving the community question answering task in the [SemEval 2016][semeval16] and [2017][semeval17] competitions had an unannotated dataset of 189,941 questions and 1,894,456 comments from the [Qatar Living][ql] discussion forums. As our first step, we will use the same dataset to build a corpus.

[semeval16]: http://alt.qcri.org/semeval2016/task3/
[semeval17]: http://alt.qcri.org/semeval2017/task3/
[ql]: http://www.qatarliving.com/forum

In [7]:
%%time
from itertools import chain
import json
from re import sub
from os.path import isfile

import gensim.downloader as api
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk import download


download("stopwords")  # Download stopwords list.
stopwords = set(stopwords.words("english"))

def preprocess(doc):
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    doc = sub(r'<[^<>]+(>|$)', " ", doc)
    doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
    doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

corpus = list(chain(*[
    chain(
        [preprocess(thread["RelQuestion"]["RelQSubject"]), preprocess(thread["RelQuestion"]["RelQBody"])],
        [preprocess(relcomment["RelCText"]) for relcomment in thread["RelComments"]])
    for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated")]))

print("Number of documents: %d" % len(documents))

[nltk_data] Downloading package stopwords to /home/misha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of documents: 3
CPU times: user 2min 46s, sys: 2.34 s, total: 2min 48s
Wall time: 2min 52s


Using the corpus we have just build, we will now construct a [dictionary][], a [TF-IDF model][tfidf], a [word2vec model][word2vec], and a term similarity matrix.

[dictionary]: https://radimrehurek.com/gensim/corpora/dictionary.html
[tfidf]: https://radimrehurek.com/gensim/models/tfidfmodel.html
[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html

In [8]:
%%time
from multiprocessing import cpu_count

from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import Word2Vec
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix

dictionary = Dictionary(corpus)
tfidf = TfidfModel(dictionary=dictionary)
w2v_model = Word2Vec(corpus, workers=cpu_count(), min_count=5, size=300, seed=12345)
similarity_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf, nonzero_limit=100)

2019-06-17 10:50:00,986 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 10:50:01,293 : INFO : adding document #10000 to Dictionary(20088 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:01,574 : INFO : adding document #20000 to Dictionary(29692 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:01,932 : INFO : adding document #30000 to Dictionary(37971 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:02,293 : INFO : adding document #40000 to Dictionary(43930 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:02,551 : INFO : adding document #50000 to Dictionary(49340 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:02,817 : INFO : adding document #60000 to Dictionary(54734 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:03

2019-06-17 10:50:21,125 : INFO : adding document #550000 to Dictionary(195951 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:21,644 : INFO : adding document #560000 to Dictionary(197956 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:22,189 : INFO : adding document #570000 to Dictionary(200145 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:22,809 : INFO : adding document #580000 to Dictionary(201859 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:23,466 : INFO : adding document #590000 to Dictionary(203724 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:24,117 : INFO : adding document #600000 to Dictionary(205607 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:24,644 : INFO : adding document #610000 to Dictionary(207387 unique tokens: [

2019-06-17 10:50:52,177 : INFO : adding document #1100000 to Dictionary(293838 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:52,927 : INFO : adding document #1110000 to Dictionary(295273 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:53,529 : INFO : adding document #1120000 to Dictionary(296816 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:54,091 : INFO : adding document #1130000 to Dictionary(298552 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:54,500 : INFO : adding document #1140000 to Dictionary(299628 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:55,158 : INFO : adding document #1150000 to Dictionary(301139 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:50:55,844 : INFO : adding document #1160000 to Dictionary(302566 unique to

2019-06-17 10:51:09,616 : INFO : adding document #1640000 to Dictionary(374282 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:09,885 : INFO : adding document #1650000 to Dictionary(375746 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:10,223 : INFO : adding document #1660000 to Dictionary(377073 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:10,535 : INFO : adding document #1670000 to Dictionary(378393 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:10,805 : INFO : adding document #1680000 to Dictionary(379812 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:11,088 : INFO : adding document #1690000 to Dictionary(380895 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:11,452 : INFO : adding document #1700000 to Dictionary(384739 unique to

2019-06-17 10:51:24,664 : INFO : adding document #2180000 to Dictionary(451840 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:24,940 : INFO : adding document #2190000 to Dictionary(453020 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:25,245 : INFO : adding document #2200000 to Dictionary(454160 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:25,496 : INFO : adding document #2210000 to Dictionary(455302 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:25,781 : INFO : adding document #2220000 to Dictionary(456657 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:26,053 : INFO : adding document #2230000 to Dictionary(457752 unique tokens: ['blocks', 'cnn', 'facebook', 'minsitry', 'thailand']...)
2019-06-17 10:51:26,316 : INFO : adding document #2240000 to Dictionary(458938 unique to

2019-06-17 10:51:31,385 : INFO : PROGRESS: at sentence #570000, processed 10082223 words, keeping 200145 word types
2019-06-17 10:51:31,430 : INFO : PROGRESS: at sentence #580000, processed 10249508 words, keeping 201859 word types
2019-06-17 10:51:31,481 : INFO : PROGRESS: at sentence #590000, processed 10413550 words, keeping 203724 word types
2019-06-17 10:51:31,553 : INFO : PROGRESS: at sentence #600000, processed 10583886 words, keeping 205607 word types
2019-06-17 10:51:31,607 : INFO : PROGRESS: at sentence #610000, processed 10761502 words, keeping 207387 word types
2019-06-17 10:51:31,653 : INFO : PROGRESS: at sentence #620000, processed 10937476 words, keeping 209246 word types
2019-06-17 10:51:31,693 : INFO : PROGRESS: at sentence #630000, processed 11103087 words, keeping 211094 word types
2019-06-17 10:51:31,734 : INFO : PROGRESS: at sentence #640000, processed 11271558 words, keeping 212963 word types
2019-06-17 10:51:31,774 : INFO : PROGRESS: at sentence #650000, processe

2019-06-17 10:51:35,159 : INFO : PROGRESS: at sentence #1280000, processed 22594585 words, keeping 321715 word types
2019-06-17 10:51:35,207 : INFO : PROGRESS: at sentence #1290000, processed 22771530 words, keeping 323216 word types
2019-06-17 10:51:35,258 : INFO : PROGRESS: at sentence #1300000, processed 22963365 words, keeping 324767 word types
2019-06-17 10:51:35,304 : INFO : PROGRESS: at sentence #1310000, processed 23129072 words, keeping 326386 word types
2019-06-17 10:51:35,374 : INFO : PROGRESS: at sentence #1320000, processed 23362428 words, keeping 329383 word types
2019-06-17 10:51:35,423 : INFO : PROGRESS: at sentence #1330000, processed 23523119 words, keeping 330810 word types
2019-06-17 10:51:35,472 : INFO : PROGRESS: at sentence #1340000, processed 23697659 words, keeping 332299 word types
2019-06-17 10:51:35,522 : INFO : PROGRESS: at sentence #1350000, processed 23867127 words, keeping 333664 word types
2019-06-17 10:51:35,591 : INFO : PROGRESS: at sentence #1360000,

2019-06-17 10:51:38,999 : INFO : PROGRESS: at sentence #1990000, processed 35059892 words, keeping 427732 word types
2019-06-17 10:51:39,080 : INFO : PROGRESS: at sentence #2000000, processed 35237154 words, keeping 428904 word types
2019-06-17 10:51:39,136 : INFO : PROGRESS: at sentence #2010000, processed 35409658 words, keeping 429960 word types
2019-06-17 10:51:39,206 : INFO : PROGRESS: at sentence #2020000, processed 35599655 words, keeping 431271 word types
2019-06-17 10:51:39,260 : INFO : PROGRESS: at sentence #2030000, processed 35788909 words, keeping 432825 word types
2019-06-17 10:51:39,309 : INFO : PROGRESS: at sentence #2040000, processed 35960123 words, keeping 433994 word types
2019-06-17 10:51:39,355 : INFO : PROGRESS: at sentence #2050000, processed 36145529 words, keeping 436053 word types
2019-06-17 10:51:39,399 : INFO : PROGRESS: at sentence #2060000, processed 36317031 words, keeping 437115 word types
2019-06-17 10:51:39,453 : INFO : PROGRESS: at sentence #2070000,

2019-06-17 10:52:19,198 : INFO : EPOCH 1 - PROGRESS: at 65.63% examples, 713917 words/s, in_qsize 13, out_qsize 2
2019-06-17 10:52:20,209 : INFO : EPOCH 1 - PROGRESS: at 67.31% examples, 711748 words/s, in_qsize 14, out_qsize 1
2019-06-17 10:52:21,249 : INFO : EPOCH 1 - PROGRESS: at 69.30% examples, 712944 words/s, in_qsize 13, out_qsize 2
2019-06-17 10:52:22,277 : INFO : EPOCH 1 - PROGRESS: at 71.13% examples, 712527 words/s, in_qsize 16, out_qsize 1
2019-06-17 10:52:23,291 : INFO : EPOCH 1 - PROGRESS: at 72.92% examples, 711716 words/s, in_qsize 16, out_qsize 0
2019-06-17 10:52:24,293 : INFO : EPOCH 1 - PROGRESS: at 74.81% examples, 712079 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:52:25,294 : INFO : EPOCH 1 - PROGRESS: at 76.79% examples, 712915 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:52:26,297 : INFO : EPOCH 1 - PROGRESS: at 78.81% examples, 713783 words/s, in_qsize 13, out_qsize 1
2019-06-17 10:52:27,308 : INFO : EPOCH 1 - PROGRESS: at 80.55% examples, 712396 words/s,

2019-06-17 10:53:25,138 : INFO : EPOCH 2 - PROGRESS: at 84.71% examples, 698816 words/s, in_qsize 16, out_qsize 0
2019-06-17 10:53:26,152 : INFO : EPOCH 2 - PROGRESS: at 86.55% examples, 699089 words/s, in_qsize 16, out_qsize 0
2019-06-17 10:53:27,162 : INFO : EPOCH 2 - PROGRESS: at 88.20% examples, 697750 words/s, in_qsize 14, out_qsize 1
2019-06-17 10:53:28,198 : INFO : EPOCH 2 - PROGRESS: at 89.88% examples, 696676 words/s, in_qsize 16, out_qsize 2
2019-06-17 10:53:29,235 : INFO : EPOCH 2 - PROGRESS: at 91.43% examples, 694317 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:53:30,254 : INFO : EPOCH 2 - PROGRESS: at 93.32% examples, 694702 words/s, in_qsize 13, out_qsize 2
2019-06-17 10:53:31,258 : INFO : EPOCH 2 - PROGRESS: at 95.38% examples, 696182 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:53:32,283 : INFO : EPOCH 2 - PROGRESS: at 97.30% examples, 696781 words/s, in_qsize 16, out_qsize 0
2019-06-17 10:53:33,287 : INFO : EPOCH 2 - PROGRESS: at 99.01% examples, 696552 words/s,

2019-06-17 10:54:32,665 : INFO : EPOCH 3 - PROGRESS: at 70.76% examples, 463907 words/s, in_qsize 14, out_qsize 1
2019-06-17 10:54:33,669 : INFO : EPOCH 3 - PROGRESS: at 72.26% examples, 465559 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:54:34,675 : INFO : EPOCH 3 - PROGRESS: at 73.91% examples, 468254 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:54:35,704 : INFO : EPOCH 3 - PROGRESS: at 75.48% examples, 470220 words/s, in_qsize 14, out_qsize 1
2019-06-17 10:54:36,751 : INFO : EPOCH 3 - PROGRESS: at 76.81% examples, 470460 words/s, in_qsize 14, out_qsize 1
2019-06-17 10:54:37,846 : INFO : EPOCH 3 - PROGRESS: at 77.93% examples, 468989 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:54:38,852 : INFO : EPOCH 3 - PROGRESS: at 78.68% examples, 466013 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:54:40,054 : INFO : EPOCH 3 - PROGRESS: at 79.66% examples, 463130 words/s, in_qsize 16, out_qsize 1
2019-06-17 10:54:41,195 : INFO : EPOCH 3 - PROGRESS: at 80.47% examples, 459836 words/s,

2019-06-17 10:55:40,639 : INFO : EPOCH 4 - PROGRESS: at 36.08% examples, 372243 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:55:41,721 : INFO : EPOCH 4 - PROGRESS: at 36.84% examples, 369503 words/s, in_qsize 16, out_qsize 0
2019-06-17 10:55:42,821 : INFO : EPOCH 4 - PROGRESS: at 37.61% examples, 366729 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:55:43,826 : INFO : EPOCH 4 - PROGRESS: at 38.42% examples, 365442 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:55:44,891 : INFO : EPOCH 4 - PROGRESS: at 39.06% examples, 361881 words/s, in_qsize 15, out_qsize 2
2019-06-17 10:55:45,908 : INFO : EPOCH 4 - PROGRESS: at 39.74% examples, 359282 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:55:46,946 : INFO : EPOCH 4 - PROGRESS: at 40.51% examples, 357338 words/s, in_qsize 11, out_qsize 4
2019-06-17 10:55:47,952 : INFO : EPOCH 4 - PROGRESS: at 41.26% examples, 355736 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:55:49,007 : INFO : EPOCH 4 - PROGRESS: at 41.88% examples, 352781 words/s,

2019-06-17 10:56:55,829 : INFO : EPOCH 4 - PROGRESS: at 98.02% examples, 335089 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:56:56,835 : INFO : EPOCH 4 - PROGRESS: at 99.18% examples, 336150 words/s, in_qsize 14, out_qsize 1
2019-06-17 10:56:57,231 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-06-17 10:56:57,274 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-06-17 10:56:57,288 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-06-17 10:56:57,301 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-06-17 10:56:57,324 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-06-17 10:56:57,343 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-17 10:56:57,353 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-17 10:56:57,402 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-17 10:56:57,403 : INFO : EPOCH - 4 :

2019-06-17 10:58:04,363 : INFO : EPOCH 5 - PROGRESS: at 59.13% examples, 341498 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:58:05,397 : INFO : EPOCH 5 - PROGRESS: at 59.99% examples, 341079 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:58:06,407 : INFO : EPOCH 5 - PROGRESS: at 60.77% examples, 340374 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:58:07,520 : INFO : EPOCH 5 - PROGRESS: at 61.81% examples, 340782 words/s, in_qsize 15, out_qsize 1
2019-06-17 10:58:08,514 : INFO : EPOCH 5 - PROGRESS: at 62.52% examples, 339592 words/s, in_qsize 16, out_qsize 1
2019-06-17 10:58:09,515 : INFO : EPOCH 5 - PROGRESS: at 63.37% examples, 339258 words/s, in_qsize 15, out_qsize 0
2019-06-17 10:58:10,517 : INFO : EPOCH 5 - PROGRESS: at 64.34% examples, 339726 words/s, in_qsize 13, out_qsize 2
2019-06-17 10:58:11,527 : INFO : EPOCH 5 - PROGRESS: at 65.38% examples, 340404 words/s, in_qsize 13, out_qsize 2
2019-06-17 10:58:12,596 : INFO : EPOCH 5 - PROGRESS: at 66.59% examples, 341673 words/s,

2019-06-17 10:58:59,536 : INFO : PROGRESS: at 3.24% columns (15001 / 462807, 0.000221% density, 0.000354% projected density)
2019-06-17 10:58:59,556 : INFO : PROGRESS: at 3.46% columns (16001 / 462807, 0.000221% density, 0.000345% projected density)
2019-06-17 10:58:59,923 : INFO : PROGRESS: at 3.67% columns (17001 / 462807, 0.000221% density, 0.000359% projected density)
2019-06-17 10:59:00,094 : INFO : PROGRESS: at 3.89% columns (18001 / 462807, 0.000222% density, 0.000358% projected density)
2019-06-17 10:59:00,175 : INFO : PROGRESS: at 4.11% columns (19001 / 462807, 0.000222% density, 0.000352% projected density)
2019-06-17 10:59:00,349 : INFO : PROGRESS: at 4.32% columns (20001 / 462807, 0.000222% density, 0.000352% projected density)
2019-06-17 10:59:00,563 : INFO : PROGRESS: at 4.54% columns (21001 / 462807, 0.000222% density, 0.000352% projected density)
2019-06-17 10:59:00,680 : INFO : PROGRESS: at 4.75% columns (22001 / 462807, 0.000223% density, 0.000351% projected density)


2019-06-17 10:59:14,945 : INFO : PROGRESS: at 17.50% columns (81001 / 462807, 0.000243% density, 0.000372% projected density)
2019-06-17 10:59:15,083 : INFO : PROGRESS: at 17.72% columns (82001 / 462807, 0.000244% density, 0.000371% projected density)
2019-06-17 10:59:15,156 : INFO : PROGRESS: at 17.93% columns (83001 / 462807, 0.000244% density, 0.000370% projected density)
2019-06-17 10:59:15,193 : INFO : PROGRESS: at 18.15% columns (84001 / 462807, 0.000244% density, 0.000368% projected density)
2019-06-17 10:59:15,392 : INFO : PROGRESS: at 18.37% columns (85001 / 462807, 0.000244% density, 0.000369% projected density)
2019-06-17 10:59:15,548 : INFO : PROGRESS: at 18.58% columns (86001 / 462807, 0.000245% density, 0.000370% projected density)
2019-06-17 10:59:15,651 : INFO : PROGRESS: at 18.80% columns (87001 / 462807, 0.000245% density, 0.000369% projected density)
2019-06-17 10:59:15,730 : INFO : PROGRESS: at 19.01% columns (88001 / 462807, 0.000245% density, 0.000367% projected d

2019-06-17 10:59:27,252 : INFO : PROGRESS: at 31.55% columns (146001 / 462807, 0.000263% density, 0.000363% projected density)
2019-06-17 10:59:27,328 : INFO : PROGRESS: at 31.76% columns (147001 / 462807, 0.000263% density, 0.000363% projected density)
2019-06-17 10:59:27,399 : INFO : PROGRESS: at 31.98% columns (148001 / 462807, 0.000263% density, 0.000363% projected density)
2019-06-17 10:59:27,498 : INFO : PROGRESS: at 32.20% columns (149001 / 462807, 0.000263% density, 0.000363% projected density)
2019-06-17 10:59:27,727 : INFO : PROGRESS: at 32.41% columns (150001 / 462807, 0.000264% density, 0.000364% projected density)
2019-06-17 10:59:27,858 : INFO : PROGRESS: at 32.63% columns (151001 / 462807, 0.000264% density, 0.000364% projected density)
2019-06-17 10:59:27,938 : INFO : PROGRESS: at 32.84% columns (152001 / 462807, 0.000264% density, 0.000363% projected density)
2019-06-17 10:59:28,166 : INFO : PROGRESS: at 33.06% columns (153001 / 462807, 0.000265% density, 0.000364% pro

2019-06-17 10:59:36,331 : INFO : PROGRESS: at 45.59% columns (211001 / 462807, 0.000284% density, 0.000364% projected density)
2019-06-17 10:59:36,391 : INFO : PROGRESS: at 45.81% columns (212001 / 462807, 0.000284% density, 0.000364% projected density)
2019-06-17 10:59:36,453 : INFO : PROGRESS: at 46.02% columns (213001 / 462807, 0.000284% density, 0.000363% projected density)
2019-06-17 10:59:36,556 : INFO : PROGRESS: at 46.24% columns (214001 / 462807, 0.000284% density, 0.000363% projected density)
2019-06-17 10:59:36,711 : INFO : PROGRESS: at 46.46% columns (215001 / 462807, 0.000284% density, 0.000363% projected density)
2019-06-17 10:59:36,781 : INFO : PROGRESS: at 46.67% columns (216001 / 462807, 0.000284% density, 0.000363% projected density)
2019-06-17 10:59:36,945 : INFO : PROGRESS: at 46.89% columns (217001 / 462807, 0.000285% density, 0.000363% projected density)
2019-06-17 10:59:36,997 : INFO : PROGRESS: at 47.10% columns (218001 / 462807, 0.000285% density, 0.000363% pro

2019-06-17 10:59:59,075 : INFO : PROGRESS: at 59.64% columns (276001 / 462807, 0.000321% density, 0.000393% projected density)
2019-06-17 10:59:59,593 : INFO : PROGRESS: at 59.85% columns (277001 / 462807, 0.000322% density, 0.000393% projected density)
2019-06-17 11:00:00,401 : INFO : PROGRESS: at 60.07% columns (278001 / 462807, 0.000323% density, 0.000395% projected density)
2019-06-17 11:00:01,359 : INFO : PROGRESS: at 60.28% columns (279001 / 462807, 0.000325% density, 0.000396% projected density)
2019-06-17 11:00:02,584 : INFO : PROGRESS: at 60.50% columns (280001 / 462807, 0.000327% density, 0.000399% projected density)


TypeError: cannot unpack non-iterable numpy.float32 object

### Evaluation
Next, we will load the validation and test datasets that were used by the SemEval 2016 and 2017 contestants. The datasets contain 208 original questions posted by the forum members. For each question, there is a list of 10 threads with a human annotation denoting whether or not the thread is relevant to the original question. Our task will be to order the threads so that relevant threads rank above irrelevant threads.

In [9]:
datasets = api.load("semeval-2016-2017-task3-subtaskBC")



Finally, we will perform an evaluation to compare three unsupervised similarity measures – the Soft Cosine Measure, two different implementations of the [Word Mover's Distance][wmd], and standard cosine similarity. We will use the [Mean Average Precision (MAP)][map] as an evaluation measure and 10-fold cross-validation to get an estimate of the variance of MAP for each similarity measure.

[wmd]: http://vene.ro/blog/word-movers-distance-in-python.html
[map]: https://medium.com/@pds.bangalore/mean-average-precision-abd77d0b9a7e

In [10]:
!pip install wmd

[33mYou are using pip version 19.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
from math import isnan
from time import time

from gensim.similarities import MatrixSimilarity, WmdSimilarity, SoftCosineSimilarity
import numpy as np
from sklearn.model_selection import KFold
from wmd import WMD

def produce_test_data(dataset):
    for orgquestion in datasets[dataset]:
        query = preprocess(orgquestion["OrgQSubject"]) + preprocess(orgquestion["OrgQBody"])
        documents = [
            preprocess(thread["RelQuestion"]["RelQSubject"]) + preprocess(thread["RelQuestion"]["RelQBody"])
            for thread in orgquestion["Threads"]]
        relevance = [
            thread["RelQuestion"]["RELQ_RELEVANCE2ORGQ"] in ("PerfectMatch", "Relevant")
            for thread in orgquestion["Threads"]]
        yield query, documents, relevance

def cossim(query, documents):
    # Compute cosine similarity between the query and the documents.
    query = tfidf[dictionary.doc2bow(query)]
    index = MatrixSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in documents]],
        num_features=len(dictionary))
    similarities = index[query]
    return similarities

def softcossim(query, documents):
    # Compute Soft Cosine Measure between the query and the documents.
    query = tfidf[dictionary.doc2bow(query)]
    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in documents]],
        similarity_matrix)
    similarities = index[query]
    return similarities

def wmd_gensim(query, documents):
    # Compute Word Mover's Distance as implemented in PyEMD by William Mayner
    # between the query and the documents.
    index = WmdSimilarity(documents, w2v_model)
    similarities = index[query]
    return similarities

def wmd_relax(query, documents):
    # Compute Word Mover's Distance as implemented in WMD by Source{d}
    # between the query and the documents.
    words = [word for word in set(chain(query, *documents)) if word in w2v_model.wv]
    indices, words = zip(*sorted((
        (index, word) for (index, _), word in zip(dictionary.doc2bow(words), words))))
    query = dict(tfidf[dictionary.doc2bow(query)])
    query = [
        (new_index, query[dict_index])
        for new_index, dict_index in enumerate(indices)
        if dict_index in query]
    documents = [dict(tfidf[dictionary.doc2bow(document)]) for document in documents]
    documents = [[
        (new_index, document[dict_index])
        for new_index, dict_index in enumerate(indices)
        if dict_index in document] for document in documents]
    embeddings = np.array([w2v_model.wv[word] for word in words], dtype=np.float32)
    nbow = dict(((index, list(chain([None], zip(*document)))) for index, document in enumerate(documents)))
    nbow["query"] = tuple([None] + list(zip(*query)))
    distances = WMD(embeddings, nbow, vocabulary_min=1).nearest_neighbors("query")
    similarities = [-distance for _, distance in sorted(distances)]
    return similarities

strategies = {
    "cossim" : cossim,
    "softcossim": softcossim,
    "wmd-gensim": wmd_gensim,
    "wmd-relax": wmd_relax}

def evaluate(split, strategy):
    # Perform a single round of evaluation.
    results = []
    start_time = time()
    for query, documents, relevance in split:
        similarities = strategies[strategy](query, documents)
        assert len(similarities) == len(documents)
        precision = [
            (num_correct + 1) / (num_total + 1) for num_correct, num_total in enumerate(
                num_total for num_total, (_, relevant) in enumerate(
                    sorted(zip(similarities, relevance), reverse=True)) if relevant)]
        average_precision = np.mean(precision) if precision else 0.0
        results.append(average_precision)
    return (np.mean(results) * 100, time() - start_time)

def crossvalidate(args):
    # Perform a cross-validation.
    dataset, strategy = args
    test_data = np.array(list(produce_test_data(dataset)))
    kf = KFold(n_splits=10)
    samples = []
    for _, test_index in kf.split(test_data):
        samples.append(evaluate(test_data[test_index], strategy))
    return (np.mean(samples, axis=0), np.std(samples, axis=0))

In [None]:
%%time
from multiprocessing import Pool

args_list = [
    (dataset, technique)
    for dataset in ("2016-test", "2017-test")
    for technique in ("softcossim", "wmd-gensim", "wmd-relax", "cossim")]
with Pool() as pool:
    results = pool.map(crossvalidate, args_list)

2019-06-17 11:00:13,689 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:13,822 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:14,044 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:14,270 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:14,795 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:14,757 : INFO : Vocabulary size: 35 500
2019-06-17 11:00:14,907 : INFO : WCD
2019-06-17 11:00:14,927 : INFO : 0.0
2019-06-17 11:00:14,919 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:14,950 : INFO : First K WMD
2019-06-17 11:00:15,013 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:15,186 : INFO : Vocabulary size: 19 500
2019-06-17 11:00:15,209 : INFO : WCD
2019-06-17 11:00:15,184 : INFO : [(-19.50642204284668, 6), (-18.93822479248047, 4), (-19.099836349487305, 7), (-18.120803833007812, 5), (-1

2019-06-17 11:00:16,974 : INFO : First K WMD
2019-06-17 11:00:16,997 : INFO : [(-18.612998962402344, 7), (-17.614749908447266, 0), (-17.181663513183594, 6), (-16.649444580078125, 8), (-16.556007385253906, 9), (-12.592411041259766, 3), (-16.570131301879883, 5), (-11.052681922912598, 2), (-16.47679328918457, 4), (-16.522056579589844, 1)]
2019-06-17 11:00:17,027 : INFO : 0.0
2019-06-17 11:00:17,045 : INFO : P&P
2019-06-17 11:00:17,053 : INFO : stopped by early_stop condition
2019-06-17 11:00:17,093 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:17,121 : INFO : Vocabulary size: 6 500
2019-06-17 11:00:17,113 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:17,136 : INFO : WCD
2019-06-17 11:00:17,131 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:17,158 : INFO : Vocabulary size: 24 500
2019-06-17 11:00:17,170 : INFO : built Dictionary(35 unique tokens: ['abu', 'airport', 'apply', 'architect', 'ar

2019-06-17 11:00:18,065 : INFO : 0.1
2019-06-17 11:00:18,102 : INFO : Vocabulary size: 16 500
2019-06-17 11:00:18,111 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:18,147 : INFO : P&P
2019-06-17 11:00:18,150 : INFO : stopped by early_stop condition
2019-06-17 11:00:18,143 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:18,169 : INFO : built Dictionary(63 unique tokens: ['accepted', 'advice', 'anyone', 'appreciated', 'april']...) from 2 documents (total 77 corpus positions)
2019-06-17 11:00:18,174 : INFO : WCD
2019-06-17 11:00:18,182 : INFO : 0.0
2019-06-17 11:00:18,221 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:18,204 : INFO : First K WMD
2019-06-17 11:00:18,268 : INFO : [(-20.066619873046875, 4), (-19.894027709960938, 5), (-19.66065788269043, 2), (-19.0922908782959, 1), (-19.50944709777832, 7), (-18.448623657226562, 3), (-14.045564651489258, 0), (-18.474266052246094, 8), (-18.9484672

2019-06-17 11:00:19,388 : INFO : built Dictionary(34 unique tokens: ['applied', 'applying', 'ask', 'certificate', 'change']...) from 2 documents (total 56 corpus positions)
2019-06-17 11:00:19,405 : INFO : [(-22.74655532836914, 8), (-22.712932586669922, 9), (-22.07628059387207, 5), (-21.98308563232422, 2), (-19.768983840942383, 4), (-21.54038429260254, 6), (-21.95960235595703, 0), (-18.395387649536133, 3), (-21.559354782104492, 7), (-19.146608352661133, 1)]
2019-06-17 11:00:19,390 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:19,407 : INFO : 0.0
2019-06-17 11:00:19,427 : INFO : Vocabulary size: 27 500
2019-06-17 11:00:19,432 : INFO : P&P
2019-06-17 11:00:19,439 : INFO : stopped by early_stop condition
2019-06-17 11:00:19,440 : INFO : WCD
2019-06-17 11:00:19,457 : INFO : 0.0
2019-06-17 11:00:19,462 : INFO : First K WMD
2019-06-17 11:00:19,454 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:19,473 : INFO : built Dictionary

2019-06-17 11:00:20,305 : INFO : WCD
2019-06-17 11:00:20,311 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:20,310 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:20,306 : INFO : P&P
2019-06-17 11:00:20,319 : INFO : 0.0
2019-06-17 11:00:20,334 : INFO : First K WMD
2019-06-17 11:00:20,327 : INFO : built Dictionary(36 unique tokens: ['around', 'closed', 'expired', 'garvey', 'got']...) from 2 documents (total 43 corpus positions)
2019-06-17 11:00:20,330 : INFO : stopped by early_stop condition
2019-06-17 11:00:20,361 : INFO : [(-19.928001403808594, 4), (-19.701377868652344, 9), (-19.740556716918945, 5), (-19.489063262939453, 8), (-14.717412948608398, 3), (-19.454524993896484, 1), (-18.943920135498047, 7), (-18.111894607543945, 6), (-17.66033935546875, 2), (-0.0, 0)]
2019-06-17 11:00:20,369 : INFO : 0.0
2019-06-17 11:00:20,379 : INFO : P&P
2019-06-17 11:00:20,374 : INFO : adding document #0 to Dictionary(0 unique tokens: [])


2019-06-17 11:00:21,472 : INFO : Vocabulary size: 16 500
2019-06-17 11:00:21,475 : INFO : WCD
2019-06-17 11:00:21,478 : INFO : 0.0
2019-06-17 11:00:21,478 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:21,494 : INFO : built Dictionary(45 unique tokens: ['able', 'accommodation', 'apartment', 'considering', 'cost']...) from 2 documents (total 57 corpus positions)
2019-06-17 11:00:21,481 : INFO : First K WMD
2019-06-17 11:00:21,523 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:21,536 : INFO : [(-18.4628963470459, 1), (-16.511249542236328, 4), (-18.106136322021484, 8), (-15.929932594299316, 7), (-15.801773071289062, 5), (-12.384505271911621, 0), (-16.102802276611328, 2), (-15.169452667236328, 3), (-13.54975700378418, 9), (-15.740694999694824, 6)]
2019-06-17 11:00:21,557 : INFO : 0.0
2019-06-17 11:00:21,550 : INFO : Vocabulary size: 24 500
2019-06-17 11:00:21,565 : INFO : P&P
2019-06-17 11:00:21,573 : INFO : WCD
2019-06-17 1

2019-06-17 11:00:22,535 : INFO : built Dictionary(61 unique tokens: ['airport', 'alcohol', 'apartment', 'cafeteria', 'close']...) from 2 documents (total 76 corpus positions)
2019-06-17 11:00:22,558 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:22,521 : INFO : stopped by early_stop condition
2019-06-17 11:00:22,607 : INFO : [(-22.983196258544922, 2), (-20.509382247924805, 4), (-20.981613159179688, 6), (-20.342670440673828, 7), (-19.96017074584961, 9), (-19.425537109375, 5), (-20.297460556030273, 0), (-19.357925415039062, 8), (-19.886138916015625, 3), (-19.932920455932617, 1)]
2019-06-17 11:00:22,617 : INFO : built Dictionary(16 unique tokens: ['advise', 'change', 'company', 'even', 'husband']...) from 2 documents (total 23 corpus positions)
2019-06-17 11:00:22,616 : INFO : 0.1
2019-06-17 11:00:22,654 : INFO : P&P
2019-06-17 11:00:22,665 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:22,682 : INFO : stopped by early_stop

2019-06-17 11:00:23,637 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:23,662 : INFO : [(-20.357885360717773, 1), (-19.780426025390625, 7), (-19.14809799194336, 8), (-18.854463577270508, 9), (-19.412431716918945, 4), (-18.396207809448242, 6), (-0.0, 0), (-18.080116271972656, 5), (-14.599279403686523, 2), (-18.860294342041016, 3)]
2019-06-17 11:00:23,679 : INFO : 0.1
2019-06-17 11:00:23,685 : INFO : P&P
2019-06-17 11:00:23,692 : INFO : stopped by early_stop condition
2019-06-17 11:00:23,695 : INFO : Vocabulary size: 19 500
2019-06-17 11:00:23,709 : INFO : WCD
2019-06-17 11:00:23,717 : INFO : 0.0
2019-06-17 11:00:23,714 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:23,736 : INFO : First K WMD
2019-06-17 11:00:23,736 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:23,780 : INFO : built Dictionary(38 unique tokens: ['best', 'craftsmanship', 'doha', 'kind', 'looking']...) from 2 documents (total 44 cor

2019-06-17 11:00:24,686 : INFO : P&P
2019-06-17 11:00:24,703 : INFO : built Dictionary(47 unique tokens: ['advice', 'al', 'anybody', 'anyone', 'around']...) from 2 documents (total 56 corpus positions)
2019-06-17 11:00:24,711 : INFO : stopped by early_stop condition
2019-06-17 11:00:24,714 : INFO : First K WMD
2019-06-17 11:00:24,720 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:24,737 : INFO : [(-22.86701011657715, 6), (-20.402727127075195, 7), (-20.48593521118164, 8), (-18.18132781982422, 2), (-19.93043327331543, 4), (-20.44795799255371, 9), (-16.477418899536133, 3), (-14.835026741027832, 1), (-16.64773941040039, 0), (-18.959447860717773, 5)]
2019-06-17 11:00:24,741 : INFO : 0.0
2019-06-17 11:00:24,753 : INFO : P&P
2019-06-17 11:00:24,756 : INFO : stopped by early_stop condition
2019-06-17 11:00:24,751 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:24,768 : INFO : Vocabulary size: 19 500
2019-06-17 11:00:24,781 : INFO

2019-06-17 11:00:26,017 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:26,075 : INFO : [(-20.992015838623047, 9), (-20.75328826904297, 7), (-19.498641967773438, 8), (-19.870206832885742, 4), (-19.648998260498047, 5), (-17.791488647460938, 0), (-0.0, 2), (-17.486867904663086, 6), (-18.01848793029785, 1), (-18.784202575683594, 3)]
2019-06-17 11:00:26,038 : INFO : 0.0
2019-06-17 11:00:26,087 : INFO : First K WMD
2019-06-17 11:00:26,087 : INFO : 0.0
2019-06-17 11:00:26,105 : INFO : [(-18.782733917236328, 6), (-18.279544830322266, 7), (-18.260786056518555, 4), (-16.14962387084961, 8), (-18.257122039794922, 2), (-16.6293888092041, 0), (-17.138071060180664, 5), (-15.686863899230957, 3), (-15.035643577575684, 1), (-17.576824188232422, 9)]
2019-06-17 11:00:26,107 : INFO : P&P
2019-06-17 11:00:26,108 : INFO : 0.0
2019-06-17 11:00:26,110 : INFO : stopped by early_stop condition
2019-06-17 11:00:26,118 : INFO : P&P
2019-06-17 11:00:26,122 : INFO : stopped by early_

2019-06-17 11:00:26,919 : INFO : stopped by early_stop condition
2019-06-17 11:00:26,903 : INFO : First K WMD
2019-06-17 11:00:26,920 : INFO : built Dictionary(33 unique tokens: ['advise', 'alcohol', 'answers', 'anywhere', 'bottle']...) from 2 documents (total 37 corpus positions)
2019-06-17 11:00:26,921 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:26,956 : INFO : built Dictionary(51 unique tokens: ['april', 'better', 'cannot', 'country', 'daughter']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:00:26,968 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:26,972 : INFO : [(-19.62661361694336, 9), (-18.843427658081055, 5), (-18.94264030456543, 7), (-18.63849449157715, 8), (-18.411895751953125, 1), (-17.88629150390625, 0), (-18.74211883544922, 3), (-18.212522506713867, 2), (-18.46287727355957, 6), (-18.2197322845459, 4)]
2019-06-17 11:00:26,977 : INFO : Vocabulary size: 14 500
2019-06-17 11:00:26,999 : INFO : WCD
201

2019-06-17 11:00:27,961 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:27,961 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:27,975 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:27,997 : INFO : built Dictionary(18 unique tokens: ['brand', 'car', 'dog', 'first', 'good']...) from 2 documents (total 21 corpus positions)
2019-06-17 11:00:27,986 : INFO : [(-21.977235794067383, 7), (-20.514877319335938, 8), (-19.21251678466797, 5), (-19.176013946533203, 1), (-18.162734985351562, 4), (-19.07843589782715, 9), (-18.020183563232422, 6), (-0.0, 0), (-14.783650398254395, 3), (-13.305310249328613, 2)]
2019-06-17 11:00:28,002 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:27,990 : INFO : WCD
2019-06-17 11:00:28,015 : INFO : built Dictionary(9 unique tokens: ['anyone', 'experience', 'needed', 'qaws', 'recently']...) from 2 documents (total 9 corpus positions)


2019-06-17 11:00:29,070 : INFO : [(-20.862916946411133, 6), (-20.249042510986328, 7), (-19.542098999023438, 4), (-19.406902313232422, 8), (-19.896913528442383, 9), (-18.920698165893555, 1), (-16.545400619506836, 0), (-17.868898391723633, 3), (-16.870378494262695, 5), (-17.386886596679688, 2)]
2019-06-17 11:00:29,078 : INFO : 0.0
2019-06-17 11:00:29,103 : INFO : P&P
2019-06-17 11:00:29,100 : INFO : [(-21.121061325073242, 7), (-20.979509353637695, 4), (-20.974580764770508, 8), (-20.244028091430664, 9), (-18.75907325744629, 5), (-18.502172470092773, 3), (-19.219619750976562, 6), (-17.862863540649414, 1), (-18.3883113861084, 0), (-15.199790954589844, 2)]
2019-06-17 11:00:29,108 : INFO : stopped by early_stop condition
2019-06-17 11:00:29,121 : INFO : 0.0
2019-06-17 11:00:29,137 : INFO : P&P
2019-06-17 11:00:29,154 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:29,150 : INFO : stopped by early_stop condition
2019-06-17 11:00:29,195 : INFO : creating matrix w

2019-06-17 11:00:29,836 : INFO : [(-19.3093318939209, 7), (-17.731443405151367, 9), (-18.60076332092285, 8), (-17.314767837524414, 1), (-17.476215362548828, 6), (-17.00689697265625, 3), (-13.628423690795898, 4), (-16.6080322265625, 0), (-14.79593276977539, 5), (-17.467424392700195, 2)]
2019-06-17 11:00:29,848 : INFO : 0.0
2019-06-17 11:00:29,857 : INFO : 0.0
2019-06-17 11:00:29,861 : INFO : P&P
2019-06-17 11:00:29,880 : INFO : P&P
2019-06-17 11:00:29,865 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:29,891 : INFO : built Dictionary(46 unique tokens: ['advance', 'airways', 'allowance', 'also', 'analyst']...) from 2 documents (total 63 corpus positions)
2019-06-17 11:00:29,898 : INFO : stopped by early_stop condition
2019-06-17 11:00:29,884 : INFO : stopped by early_stop condition
2019-06-17 11:00:29,895 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:29,937 : INFO : built Dictionary(43 unique tokens: ['anything', 'apply',

2019-06-17 11:00:30,907 : INFO : P&P
2019-06-17 11:00:30,890 : INFO : [(-18.945558547973633, 1), (-18.90532875061035, 0), (-16.798765182495117, 6), (-18.410327911376953, 5), (-17.673742294311523, 8), (-14.911725044250488, 3), (-15.577156066894531, 7), (-17.493745803833008, 4), (-16.80230712890625, 2), (-17.353836059570312, 9)]
2019-06-17 11:00:30,905 : INFO : built Dictionary(44 unique tokens: ['also', 'car', 'care', 'considering', 'cost']...) from 2 documents (total 57 corpus positions)
2019-06-17 11:00:30,922 : INFO : stopped by early_stop condition
2019-06-17 11:00:30,931 : INFO : 0.1
2019-06-17 11:00:30,947 : INFO : P&P
2019-06-17 11:00:30,966 : INFO : stopped by early_stop condition
2019-06-17 11:00:30,974 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:31,001 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:31,000 : INFO : creating matrix with 10 documents and 462807 features
2019-06-17 11:00:31,019 : INFO :

2019-06-17 11:00:31,750 : INFO : Vocabulary size: 36 500
2019-06-17 11:00:31,809 : INFO : WCD
2019-06-17 11:00:31,814 : INFO : 0.0
2019-06-17 11:00:31,832 : INFO : First K WMD
2019-06-17 11:00:31,827 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:31,834 : INFO : Vocabulary size: 13 500
2019-06-17 11:00:31,842 : INFO : WCD
2019-06-17 11:00:31,851 : INFO : 0.0
2019-06-17 11:00:31,862 : INFO : First K WMD
2019-06-17 11:00:31,861 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:31,944 : INFO : [(-19.713876724243164, 2), (-18.643831253051758, 9), (-18.276973724365234, 0), (-17.69195556640625, 8), (-18.381807327270508, 7), (-16.744121551513672, 3), (-15.881484985351562, 6), (-16.615751266479492, 5), (-16.494796752929688, 1), (-16.381183624267578, 4)]
2019-06-17 11:00:31,941 : INFO : built Dictionary(42 unique tokens: ['account', 'anyone', 'bank', 'directly', 'employee']...) from 2 documents (total 52 corpus positions)
2019-06-17 11:00:31

2019-06-17 11:00:32,929 : INFO : 0.0
2019-06-17 11:00:32,920 : INFO : 0.1
2019-06-17 11:00:32,939 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:32,943 : INFO : P&P
2019-06-17 11:00:32,946 : INFO : P&P
2019-06-17 11:00:32,957 : INFO : stopped by early_stop condition
2019-06-17 11:00:32,956 : INFO : built Dictionary(43 unique tokens: ['companies', 'dolphin', 'energy', 'engineer', 'etc']...) from 2 documents (total 60 corpus positions)
2019-06-17 11:00:32,967 : INFO : stopped by early_stop condition
2019-06-17 11:00:32,996 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:33,017 : INFO : built Dictionary(23 unique tokens: ['anyone', 'city', 'come', 'compare', 'open']...) from 2 documents (total 33 corpus positions)
2019-06-17 11:00:33,014 : INFO : Vocabulary size: 8 500
2019-06-17 11:00:33,038 : INFO : WCD
2019-06-17 11:00:33,035 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:33,087 : INFO : b

2019-06-17 11:00:34,079 : INFO : P&P
2019-06-17 11:00:34,064 : INFO : 0.0
2019-06-17 11:00:34,109 : INFO : P&P
2019-06-17 11:00:34,089 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:34,091 : INFO : stopped by early_stop condition
2019-06-17 11:00:34,130 : INFO : stopped by early_stop condition
2019-06-17 11:00:34,132 : INFO : built Dictionary(16 unique tokens: ['assuming', 'budget', 'enough', 'food', 'month']...) from 2 documents (total 22 corpus positions)
2019-06-17 11:00:34,169 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:34,186 : INFO : built Dictionary(25 unique tokens: ['assist', 'confusing', 'contact', 'delivery', 'doha']...) from 2 documents (total 31 corpus positions)
2019-06-17 11:00:34,221 : INFO : Vocabulary size: 15 500
2019-06-17 11:00:34,214 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:34,209 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:34,237 

2019-06-17 11:00:35,077 : INFO : P&P
2019-06-17 11:00:35,067 : INFO : 0.1
2019-06-17 11:00:35,105 : INFO : P&P
2019-06-17 11:00:35,120 : INFO : stopped by early_stop condition
2019-06-17 11:00:35,111 : INFO : Removed 1 and 1 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:35,133 : INFO : stopped by early_stop condition
2019-06-17 11:00:35,120 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:35,133 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:35,151 : INFO : built Dictionary(41 unique tokens: ['application', 'applied', 'approval', 'approved', 'changed']...) from 2 documents (total 56 corpus positions)
2019-06-17 11:00:35,157 : INFO : built Dictionary(27 unique tokens: ['ask', 'authorities', 'caught', 'couple', 'happen']...) from 2 documents (total 31 corpus positions)
2019-06-17 11:00:35,166 : INFO : Vocabulary size: 6 500
2019-06-17 11:00:35,191 : INFO : Vocabulary size: 17 500
2019-06-17 11:00:35,192 : 

2019-06-17 11:00:36,051 : INFO : First K WMD
2019-06-17 11:00:36,082 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:36,067 : INFO : [(-20.949689865112305, 6), (-19.396881103515625, 8), (-19.628931045532227, 4), (-19.34319305419922, 5), (-17.90311622619629, 7), (-16.35881233215332, 0), (-19.283071517944336, 3), (-15.986547470092773, 2), (-18.581565856933594, 9), (-16.03264045715332, 1)]
2019-06-17 11:00:36,106 : INFO : 0.0
2019-06-17 11:00:36,112 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:36,113 : INFO : P&P
2019-06-17 11:00:36,122 : INFO : built Dictionary(26 unique tokens: ['cost', 'document', 'family', 'hi', 'much']...) from 2 documents (total 34 corpus positions)
2019-06-17 11:00:36,129 : INFO : stopped by early_stop condition
2019-06-17 11:00:36,156 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:36,164 : INFO : built Dictionary(16 unique tokens: ['accommodation', 'airway

2019-06-17 11:00:37,412 : INFO : 0.0
2019-06-17 11:00:37,434 : INFO : First K WMD
2019-06-17 11:00:37,446 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:37,475 : INFO : [(-19.762784957885742, 3), (-19.206506729125977, 4), (-19.526235580444336, 1), (-19.05494499206543, 8), (-17.926799774169922, 7), (-18.369455337524414, 2), (-17.78753089904785, 5), (-17.48784065246582, 6), (-17.665884017944336, 0), (-16.571170806884766, 9)]
2019-06-17 11:00:37,483 : INFO : 0.0
2019-06-17 11:00:37,480 : INFO : built Dictionary(20 unique tokens: ['embassy', 'experiences', 'go', 'loved', 'ones']...) from 2 documents (total 26 corpus positions)
2019-06-17 11:00:37,496 : INFO : P&P
2019-06-17 11:00:37,505 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:37,505 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:37,517 : INFO : stopped by early_stop condition
2019-06-17 11:00:37,524 : INFO : built Dictionary(24 unique 

2019-06-17 11:00:38,620 : INFO : WCD
2019-06-17 11:00:38,633 : INFO : built Dictionary(34 unique tokens: ['anybody', 'asd', 'country', 'driving', 'drving']...) from 2 documents (total 43 corpus positions)
2019-06-17 11:00:38,638 : INFO : 0.0
2019-06-17 11:00:38,659 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:38,661 : INFO : First K WMD
2019-06-17 11:00:38,663 : INFO : built Dictionary(25 unique tokens: ['apply', 'class', 'direct', 'documents', 'driving']...) from 2 documents (total 32 corpus positions)
2019-06-17 11:00:38,704 : INFO : [(-21.377159118652344, 3), (-21.08753776550293, 6), (-19.832258224487305, 9), (-20.14006233215332, 7), (-19.665321350097656, 5), (-19.657665252685547, 8), (-18.030359268188477, 4), (-18.450210571289062, 0), (-19.70598793029785, 1), (-15.693595886230469, 2)]
2019-06-17 11:00:38,685 : INFO : Removed 2 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:38,709 : INFO : 0.0
2019-06-17 11:00:38,710 : INFO 

2019-06-17 11:00:39,927 : INFO : built Dictionary(44 unique tokens: ['anyone', 'connection', 'general', 'get', 'house']...) from 2 documents (total 57 corpus positions)
2019-06-17 11:00:39,965 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:39,988 : INFO : built Dictionary(47 unique tokens: ['bank', 'citizens', 'exemption', 'fill', 'forget']...) from 2 documents (total 60 corpus positions)
2019-06-17 11:00:39,979 : INFO : [(-21.04570770263672, 4), (-20.171222686767578, 8), (-20.69675064086914, 2), (-19.5819034576416, 3), (-18.434484481811523, 6), (-19.40779685974121, 0), (-18.442502975463867, 7), (-18.39457893371582, 5), (-19.11724090576172, 9), (-17.09741973876953, 1)]
2019-06-17 11:00:40,013 : INFO : 0.1
2019-06-17 11:00:40,016 : INFO : P&P
2019-06-17 11:00:40,044 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:40,039 : INFO : stopped by early_stop condition
2019-06-17 11:00:40,049 : INFO : adding document #0 to Dictionary(0 uniq

2019-06-17 11:00:41,347 : INFO : built Dictionary(41 unique tokens: ['atleast', 'call', 'could', 'inform', 'know']...) from 2 documents (total 44 corpus positions)
2019-06-17 11:00:41,368 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:41,379 : INFO : built Dictionary(28 unique tokens: ['american', 'bu', 'chance', 'country', 'doha']...) from 2 documents (total 44 corpus positions)
2019-06-17 11:00:41,389 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:41,406 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:41,429 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:41,453 : INFO : built Dictionary(23 unique tokens: ['answers', 'asap', 'business', 'company', 'follow']...) from 2 documents (total 33 corpus positions)
2019-06-17 11:00:41,480 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:41,506 : INFO : built Dictionary(20 unique token

2019-06-17 11:00:42,734 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:42,738 : INFO : built Dictionary(56 unique tokens: ['advance', 'agencies', 'almost', 'anyone', 'best']...) from 2 documents (total 68 corpus positions)
2019-06-17 11:00:42,809 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:42,822 : INFO : built Dictionary(48 unique tokens: ['advice', 'also', 'anyone', 'around', 'buying']...) from 2 documents (total 57 corpus positions)
2019-06-17 11:00:42,885 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:42,907 : INFO : built Dictionary(39 unique tokens: ['clothes', 'deals', 'doha', 'get', 'good']...) from 2 documents (total 45 corpus positions)
2019-06-17 11:00:42,930 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:42,935 : INFO : built Dictionary(22 unique tokens: ['embassy', 'experiences', 'go', 'loved', 'ones']...) from 2 documents (total 31 corpus 

2019-06-17 11:00:43,887 : INFO : built Dictionary(40 unique tokens: ['accommodation', 'allowance', 'anxious', 'contract', 'corp']...) from 2 documents (total 47 corpus positions)
2019-06-17 11:00:43,934 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:43,945 : INFO : built Dictionary(48 unique tokens: ['apartment', 'aprtment', 'bed', 'bedroom', 'cable']...) from 2 documents (total 77 corpus positions)
2019-06-17 11:00:43,995 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:44,017 : INFO : built Dictionary(24 unique tokens: ['couple', 'enough', 'good', 'k', 'kids']...) from 2 documents (total 32 corpus positions)
2019-06-17 11:00:44,035 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:44,057 : INFO : built Dictionary(33 unique tokens: ['acommodation', 'allowance', 'ask', 'basic', 'company']...) from 2 documents (total 40 corpus positions)
2019-06-17 11:00:44,107 : INFO : adding document #0 to Di

2019-06-17 11:00:44,863 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:44,866 : INFO : built Dictionary(47 unique tokens: ['account', 'also', 'bank', 'born', 'cheapest']...) from 2 documents (total 59 corpus positions)
2019-06-17 11:00:44,899 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:44,901 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:44,904 : INFO : built Dictionary(29 unique tokens: ['actually', 'anyone', 'call', 'doctor', 'gone']...) from 2 documents (total 36 corpus positions)
2019-06-17 11:00:44,914 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:44,916 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:44,918 : INFO : built Dictionary(48 unique tokens: ['also', 'bring', 'care', 'careful', 'child']...) from 2 documents (total 57 corpus positions)
2019-06-17 11:00:44,939 : INFO : Removed 1 and 0 OO

2019-06-17 11:00:45,332 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:45,335 : INFO : built Dictionary(53 unique tokens: ['ago', 'anyone', 'anytime', 'applied', 'banned']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:00:45,375 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:45,384 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:45,405 : INFO : built Dictionary(39 unique tokens: ['boy', 'boys', 'brother', 'calling', 'calls']...) from 2 documents (total 44 corpus positions)
2019-06-17 11:00:45,416 : INFO : built Dictionary(38 unique tokens: ['anyone', 'carry', 'duplicate', 'expire', 'get']...) from 2 documents (total 56 corpus positions)
2019-06-17 11:00:45,463 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:45,464 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:45,477 : INFO : built Dictionary(48 unique tokens: ['ask', 

2019-06-17 11:00:47,282 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:47,382 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:47,412 : INFO : built Dictionary(63 unique tokens: ['abu', 'ahead', 'clean', 'contest', 'dhabi']...) from 2 documents (total 74 corpus positions)
2019-06-17 11:00:47,666 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:47,698 : INFO : built Dictionary(31 unique tokens: ['abu', 'allowed', 'alone', 'best', 'culture']...) from 2 documents (total 70 corpus positions)
2019-06-17 11:00:47,752 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:47,801 : INFO : built Dictionary(49 unique tokens: ['abu', 'abudhabi', 'assist', 'benefits', 'better']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:00:47,900 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:47,945 : INFO : built Dictionary(40 unique tokens: ['anyone',

2019-06-17 11:00:49,965 : INFO : built Dictionary(17 unique tokens: ['application', 'approval', 'checking', 'deferred', 'found']...) from 2 documents (total 22 corpus positions)
2019-06-17 11:00:49,985 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:50,008 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:50,018 : INFO : built Dictionary(35 unique tokens: ['application', 'applied', 'approval', 'approved', 'changed']...) from 2 documents (total 46 corpus positions)
2019-06-17 11:00:50,075 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:50,098 : INFO : built Dictionary(33 unique tokens: ['administration', 'agent', 'body', 'business', 'change']...) from 2 documents (total 46 corpus positions)
2019-06-17 11:00:50,110 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:50,146 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11

2019-06-17 11:00:51,927 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:51,965 : INFO : built Dictionary(12 unique tokens: ['afternoon', 'become', 'evening', 'good', 'harder']...) from 2 documents (total 15 corpus positions)
2019-06-17 11:00:51,985 : INFO : built Dictionary(32 unique tokens: ['america', 'anyone', 'company', 'deliver', 'doha']...) from 2 documents (total 44 corpus positions)
2019-06-17 11:00:51,989 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:52,010 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:52,007 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:52,025 : INFO : built Dictionary(17 unique tokens: ['anyone', 'details', 'doha', 'good', 'know']...) from 2 documents (total 22 corpus positions)
2019-06-17 11:00:52,023 : INFO : built Dictionary(29 unique tokens: ['anyone', 'center', 'curves', 'help', 'hi']...) from 2 documents (total

2019-06-17 11:00:53,690 : INFO : built Dictionary(17 unique tokens: ['one', 'painful', 'associates', 'babies', 'baby']...) from 2 documents (total 23 corpus positions)
2019-06-17 11:00:53,722 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:53,751 : INFO : built Dictionary(26 unique tokens: ['anniversary', 'bf', 'gift', 'good', 'guys']...) from 2 documents (total 31 corpus positions)
2019-06-17 11:00:53,761 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:53,783 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:53,785 : INFO : built Dictionary(40 unique tokens: ['cancel', 'cancellation', 'even', 'guys', 'passport']...) from 2 documents (total 55 corpus positions)
2019-06-17 11:00:53,830 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:53,849 : INFO : built Dictionary(59 unique tokens: ['allow', 'anyone', 'company', 'contract', 'court']...) from 2 documents (total 84 corpus 

2019-06-17 11:00:55,434 : INFO : built Dictionary(13 unique tokens: ['freelance', 'visa', 'anyone', 'appreciated', 'female']...) from 2 documents (total 20 corpus positions)
2019-06-17 11:00:55,443 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:55,460 : INFO : built Dictionary(23 unique tokens: ['anyone', 'center', 'enough', 'filipino', 'getting']...) from 2 documents (total 33 corpus positions)
2019-06-17 11:00:55,472 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:55,484 : INFO : built Dictionary(34 unique tokens: ['accommodation', 'allowed', 'another', 'everyone', 'female']...) from 2 documents (total 40 corpus positions)
2019-06-17 11:00:55,534 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:55,544 : INFO : built Dictionary(37 unique tokens: ['ahli', 'al', 'anyone', 'clear', 'coverage']...) from 2 documents (total 50 corpus positions)
2019-06-17 11:00:55,582 : INFO : precomputing L2-nor

2019-06-17 11:00:57,315 : INFO : built Dictionary(41 unique tokens: ['accommodation', 'allowance', 'allowances', 'allownaces', 'applied']...) from 2 documents (total 55 corpus positions)
2019-06-17 11:00:57,355 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:57,363 : INFO : built Dictionary(28 unique tokens: ['account', 'anyone', 'bank', 'directly', 'employee']...) from 2 documents (total 38 corpus positions)
2019-06-17 11:00:57,400 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:57,413 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:57,439 : INFO : built Dictionary(23 unique tokens: ['available', 'cameroonian', 'currently', 'everyone', 'hello']...) from 2 documents (total 32 corpus positions)
2019-06-17 11:00:57,460 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:00:57,517 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06

2019-06-17 11:00:59,440 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:00:59,467 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:59,483 : INFO : built Dictionary(27 unique tokens: ['anybody', 'authenticated', 'business', 'company', 'friend']...) from 2 documents (total 34 corpus positions)
2019-06-17 11:00:59,512 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:59,528 : INFO : built Dictionary(19 unique tokens: ['coming', 'enter', 'family', 'january', 'kindly']...) from 2 documents (total 34 corpus positions)
2019-06-17 11:00:59,546 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:59,561 : INFO : built Dictionary(25 unique tokens: ['also', 'document', 'documents', 'entry', 'find']...) from 2 documents (total 36 corpus positions)
2019-06-17 11:00:59,585 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:00:59,594 : INFO : built Dictionary(22 unique

2019-06-17 11:01:01,311 : INFO : built Dictionary(39 unique tokens: ['basis', 'car', 'classifieds', 'collect', 'find']...) from 2 documents (total 50 corpus positions)
2019-06-17 11:01:01,386 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:01,409 : INFO : built Dictionary(44 unique tokens: ['accomodation', 'ago', 'also', 'dear', 'details']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:01:01,472 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:01,515 : INFO : built Dictionary(45 unique tokens: ['advance', 'amount', 'company', 'dear', 'excluding']...) from 2 documents (total 53 corpus positions)
2019-06-17 11:01:01,623 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:01,632 : INFO : built Dictionary(47 unique tokens: ['ask', 'attend', 'clinics', 'days', 'doha']...) from 2 documents (total 63 corpus positions)
2019-06-17 11:01:01,694 : INFO : adding document #0 to Dictionary(0 un

2019-06-17 11:01:03,217 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:03,245 : INFO : built Dictionary(47 unique tokens: ['agreement', 'back', 'call', 'cancelled', 'chance']...) from 2 documents (total 53 corpus positions)
2019-06-17 11:01:03,311 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:03,334 : INFO : built Dictionary(23 unique tokens: ['answer', 'back', 'ban', 'come', 'contract']...) from 2 documents (total 33 corpus positions)
2019-06-17 11:01:03,361 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:03,392 : INFO : built Dictionary(25 unique tokens: ['alternative', 'banned', 'getting', 'guys', 'hello']...) from 2 documents (total 34 corpus positions)
2019-06-17 11:01:03,412 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:03,418 : INFO : built Dictionary(30 unique tokens: ['ago', 'back', 'canceled', 'cancelled', 'company']...) from 2 documents (total 

2019-06-17 11:01:04,957 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:01:05,178 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:05,203 : INFO : built Dictionary(19 unique tokens: ['advance', 'ago', 'anyone', 'bottle', 'cheers']...) from 2 documents (total 21 corpus positions)
2019-06-17 11:01:05,274 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:05,293 : INFO : built Dictionary(19 unique tokens: ['cost', 'etc', 'getting', 'give', 'hi']...) from 2 documents (total 24 corpus positions)
2019-06-17 11:01:05,321 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:05,345 : INFO : built Dictionary(19 unique tokens: ['al', 'alcohol', 'beer', 'clubs', 'cold']...) from 2 documents (total 25 corpus positions)
2019-06-17 11:01:05,369 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:05,392 : INFO : built Dictionary(16 unique tokens: ['events', 'light', 'p

2019-06-17 11:01:06,973 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:06,982 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:06,986 : INFO : built Dictionary(25 unique tokens: ['advice', 'airport', 'arrive', 'bring', 'card']...) from 2 documents (total 30 corpus positions)
2019-06-17 11:01:07,024 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:07,047 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:07,052 : INFO : built Dictionary(22 unique tokens: ['baby', 'born', 'bring', 'dear', 'help']...) from 2 documents (total 39 corpus positions)
2019-06-17 11:01:07,041 : INFO : built Dictionary(23 unique tokens: ['business', 'certain', 'doha', 'dress', 'enquiring']...) from 2 documents (total 28 corpus positions)
2019-06-17 11:01:07,064 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:07,071 : INFO : adding document 

2019-06-17 11:01:08,480 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:08,495 : INFO : built Dictionary(51 unique tokens: ['anyone', 'anything', 'anywhere', 'away', 'awesome']...) from 2 documents (total 55 corpus positions)
2019-06-17 11:01:08,554 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:08,569 : INFO : built Dictionary(28 unique tokens: ['admin', 'cut', 'delete', 'found', 'least']...) from 2 documents (total 31 corpus positions)
2019-06-17 11:01:08,595 : INFO : Removed 2 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:08,629 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:08,633 : INFO : built Dictionary(28 unique tokens: ['american', 'anyone', 'chiropractor', 'clinic', 'closed']...) from 2 documents (total 32 corpus positions)
2019-06-17 11:01:08,633 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:08,640 : INFO : built Diction

2019-06-17 11:01:09,882 : INFO : built Dictionary(50 unique tokens: ['ask', 'attend', 'clinics', 'days', 'doha']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:01:09,881 : INFO : built Dictionary(32 unique tokens: ['africa', 'anyone', 'beleive', 'cant', 'come']...) from 2 documents (total 47 corpus positions)
2019-06-17 11:01:09,894 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:09,895 : INFO : built Dictionary(43 unique tokens: ['account', 'advice', 'already', 'anyone', 'applying']...) from 2 documents (total 62 corpus positions)
2019-06-17 11:01:09,907 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:09,910 : INFO : built Dictionary(50 unique tokens: ['applying', 'b', 'basically', 'c', 'checked']...) from 2 documents (total 63 corpus positions)
2019-06-17 11:01:09,915 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:09,921 : INFO : built Dictionary(36 unique tokens: ['approa

2019-06-17 11:01:10,530 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:10,537 : INFO : built Dictionary(38 unique tokens: ['answer', 'anymore', 'apply', 'appreciated', 'company']...) from 2 documents (total 59 corpus positions)
2019-06-17 11:01:10,541 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:10,565 : INFO : built Dictionary(33 unique tokens: ['advisable', 'application', 'appreciate', 'canada', 'consultants']...) from 2 documents (total 39 corpus positions)
2019-06-17 11:01:10,567 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:01:10,589 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:10,601 : INFO : built Dictionary(36 unique tokens: ['appreciated', 'august', 'company', 'doha', 'embassy']...) from 2 documents (total 49 corpus positions)
2019-06-17 11:01:10,628 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:10,645 : INFO :

2019-06-17 11:01:11,204 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:11,210 : INFO : built Dictionary(34 unique tokens: ['application', 'credential', 'hmc', 'right', 'stage']...) from 2 documents (total 41 corpus positions)
2019-06-17 11:01:11,241 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:11,253 : INFO : built Dictionary(55 unique tokens: ['accommodation', 'allowance', 'allowances', 'allownaces', 'applied']...) from 2 documents (total 74 corpus positions)
2019-06-17 11:01:11,316 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:11,344 : INFO : built Dictionary(36 unique tokens: ['anyone', 'apply', 'complicated', 'confuses', 'french']...) from 2 documents (total 45 corpus positions)
2019-06-17 11:01:11,371 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:11,399 : INFO : built Dictionary(42 unique tokens: ['baby', 'best', 'childbirth', 'deliver', 'doctors'

2019-06-17 11:01:12,710 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:12,736 : INFO : built Dictionary(31 unique tokens: ['birth', 'child', 'give', 'gots', 'happens']...) from 2 documents (total 40 corpus positions)
2019-06-17 11:01:12,764 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:12,734 : INFO : built Dictionary(23 unique tokens: ['badminton', 'club', 'details', 'group', 'grp']...) from 2 documents (total 30 corpus positions)
2019-06-17 11:01:12,788 : INFO : built Dictionary(47 unique tokens: ['american', 'answers', 'considering', 'currently', 'dating']...) from 2 documents (total 55 corpus positions)
2019-06-17 11:01:12,786 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:12,805 : INFO : built Dictionary(37 unique tokens: ['anybody', 'badminton', 'boring', 'crowded', 'doha']...) from 2 documents (total 50 corpus positions)
2019-06-17 11:01:12,830 : INFO : adding document #0 to Dicti

2019-06-17 11:01:15,004 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:15,023 : INFO : built Dictionary(32 unique tokens: ['buy', 'carrefour', 'cook', 'enough', 'etc']...) from 2 documents (total 46 corpus positions)
2019-06-17 11:01:15,067 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:15,076 : INFO : built Dictionary(41 unique tokens: ['add', 'also', 'answers', 'anyone', 'away']...) from 2 documents (total 58 corpus positions)
2019-06-17 11:01:15,163 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:15,181 : INFO : built Dictionary(24 unique tokens: ['anyone', 'average', 'breakfast', 'cost', 'could']...) from 2 documents (total 29 corpus positions)
2019-06-17 11:01:15,207 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:15,223 : INFO : built Dictionary(27 unique tokens: ['anyone', 'bring', 'case', 'christmas', 'coming']...) from 2 documents (total 36 corpus p

2019-06-17 11:01:17,619 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:17,653 : INFO : built Dictionary(55 unique tokens: ['american', 'bachelor', 'bayt', 'com', 'cv']...) from 2 documents (total 78 corpus positions)
2019-06-17 11:01:17,754 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:17,769 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:17,781 : INFO : built Dictionary(53 unique tokens: ['car', 'charger', 'clear', 'comments', 'cost']...) from 2 documents (total 70 corpus positions)
2019-06-17 11:01:17,832 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:17,861 : INFO : built Dictionary(34 unique tokens: ['doha', 'find', 'hard', 'job', 'nowadays']...) from 2 documents (total 48 corpus positions)
2019-06-17 11:01:17,892 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:17,915 : INFO : built Dictionary(37 unique tokens

2019-06-17 11:01:19,683 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:19,716 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:19,722 : INFO : built Dictionary(28 unique tokens: ['anyone', 'center', 'curves', 'help', 'hi']...) from 2 documents (total 37 corpus positions)
2019-06-17 11:01:19,737 : INFO : built Dictionary(45 unique tokens: ['ago', 'anyone', 'best', 'counseling', 'difficult']...) from 2 documents (total 51 corpus positions)
2019-06-17 11:01:19,751 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:19,786 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:19,785 : INFO : built Dictionary(39 unique tokens: ['ago', 'anyone', 'arrived', 'doha', 'everyone']...) from 2 documents (total 48 corpus positions)
2019-06-17 11:01:19,801 : INFO : built Dictionary(40 unique tokens: ['accommodation', 'allowance', 'anxious', 'contract', 'corp']...) from 2 documents (to

2019-06-17 11:01:21,815 : INFO : built Dictionary(44 unique tokens: ['already', 'arabic', 'chef', 'doha', 'end']...) from 2 documents (total 51 corpus positions)
2019-06-17 11:01:21,917 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:21,926 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:21,948 : INFO : built Dictionary(35 unique tokens: ['available', 'could', 'daily', 'evening', 'find']...) from 2 documents (total 42 corpus positions)
2019-06-17 11:01:21,973 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:21,989 : INFO : built Dictionary(36 unique tokens: ['agency', 'application', 'apply', 'applying', 'bayt']...) from 2 documents (total 50 corpus positions)
2019-06-17 11:01:22,030 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:22,036 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:22,051 : INFO : built Dict

2019-06-17 11:01:23,637 : INFO : built Dictionary(47 unique tokens: ['admissions', 'admit', 'advance', 'back', 'birla']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:01:23,695 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:23,733 : INFO : built Dictionary(27 unique tokens: ['companies', 'currently', 'daycare', 'employers', 'facility']...) from 2 documents (total 34 corpus positions)
2019-06-17 11:01:23,761 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:23,787 : INFO : built Dictionary(47 unique tokens: ['advise', 'airways', 'anybody', 'british', 'children']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:01:23,821 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:23,840 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:23,847 : INFO : built Dictionary(48 unique tokens: ['administration', 'agent', 'body', 'business', 'change']...) 

2019-06-17 11:01:25,293 : INFO : Removed 2 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:25,308 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:25,325 : INFO : built Dictionary(31 unique tokens: ['american', 'anyone', 'chiropractor', 'clinic', 'closed']...) from 2 documents (total 33 corpus positions)
2019-06-17 11:01:25,363 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:25,384 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:25,401 : INFO : built Dictionary(38 unique tokens: ['advance', 'baby', 'bcg', 'better', 'child']...) from 2 documents (total 45 corpus positions)
2019-06-17 11:01:25,438 : INFO : precomputing L2-norms of word weight vectors
2019-06-17 11:01:25,632 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:25,650 : INFO : built Dictionary(17 unique tokens: ['anyone', 'car', 'company', 'departure', 'drive']...) from 

2019-06-17 11:01:27,354 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:27,388 : INFO : built Dictionary(51 unique tokens: ['anyone', 'apart', 'bring', 'customs', 'expect']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:01:27,396 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:27,425 : INFO : built Dictionary(30 unique tokens: ['appreciate', 'bringing', 'family', 'help', 'limitations']...) from 2 documents (total 43 corpus positions)
2019-06-17 11:01:27,469 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:27,490 : INFO : built Dictionary(39 unique tokens: ['anyone', 'apply', 'completed', 'country', 'exit']...) from 2 documents (total 60 corpus positions)
2019-06-17 11:01:27,500 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:27,502 : INFO : built Dictionary(48 unique tokens: ['allowed', 'come', 'doha', 'dresses', 'formals']...) from 2 documents 

2019-06-17 11:01:29,304 : INFO : built Dictionary(18 unique tokens: ['home', 'room', 'spend', 'time', 'arrange']...) from 2 documents (total 25 corpus positions)
2019-06-17 11:01:29,329 : INFO : built Dictionary(44 unique tokens: ['abaya', 'arab', 'curious', 'dying', 'hey']...) from 2 documents (total 56 corpus positions)
2019-06-17 11:01:29,323 : INFO : Removed 1 and 1 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:29,352 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:29,356 : INFO : built Dictionary(20 unique tokens: ['compound', 'gardens', 'information', 'someone', 'tell']...) from 2 documents (total 23 corpus positions)
2019-06-17 11:01:29,377 : INFO : Removed 1 and 1 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:29,384 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:29,380 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:29,394 : INFO : built Diction

2019-06-17 11:01:30,914 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:30,934 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:30,953 : INFO : built Dictionary(49 unique tokens: ['account', 'across', 'aka', 'axis', 'bank']...) from 2 documents (total 61 corpus positions)
2019-06-17 11:01:31,029 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2019-06-17 11:01:31,061 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:31,076 : INFO : built Dictionary(49 unique tokens: ['account', 'appeared', 'bank', 'bussiness', 'call']...) from 2 documents (total 58 corpus positions)
2019-06-17 11:01:31,136 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-06-17 11:01:31,165 : INFO : built Dictionary(32 unique tokens: ['accomodation', 'day', 'doha', 'free', 'give']...) from 2 documents (total 37 corpus positions)
2019-06-17 11:01:31,196 : INFO : adding document

The table below shows the pointwise estimates of means and standard variances for MAP scores and elapsed times. Baselines and winners for each year are displayed in bold. We can see that the Soft Cosine Measure gives a strong performance on both the 2016 and the 2017 dataset.

In [None]:
from IPython.display import display, Markdown

output = []
baselines = [
    (("2016-test", "**Winner (UH-PRHLT-primary)**"), ((76.70, 0), (0, 0))),
    (("2016-test", "**Baseline 1 (IR)**"), ((74.75, 0), (0, 0))),
    (("2016-test", "**Baseline 2 (random)**"), ((46.98, 0), (0, 0))),
    (("2017-test", "**Winner (SimBow-primary)**"), ((47.22, 0), (0, 0))),
    (("2017-test", "**Baseline 1 (IR)**"), ((41.85, 0), (0, 0))),
    (("2017-test", "**Baseline 2 (random)**"), ((29.81, 0), (0, 0)))]
table_header = ["Dataset | Strategy | MAP score | Elapsed time (sec)", ":---|:---|:---|---:"]
for row, ((dataset, technique), ((mean_map_score, mean_duration), (std_map_score, std_duration))) \
        in enumerate(sorted(chain(zip(args_list, results), baselines), key=lambda x: (x[0][0], -x[1][0][0]))):
    if row % (len(strategies) + 3) == 0:
        output.extend(chain(["\n"], table_header))
    map_score = "%.02f ±%.02f" % (mean_map_score, std_map_score)
    duration = "%.02f ±%.02f" % (mean_duration, std_duration) if mean_duration else ""
    output.append("%s|%s|%s|%s" % (dataset, technique, map_score, duration))

display(Markdown('\n'.join(output)))

## References

1. Grigori Sidorov et al. *Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model*, 2014. ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf))
2. Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017. ([link to PDF](http://www.aclweb.org/anthology/S17-2051))
3. Thomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space, 2013. ([link to PDF](https://arxiv.org/pdf/1301.3781.pdf))
4. Vít Novotný. *Implementation Notes for the Soft Cosine Measure*, 2018. ([link to PDF](https://arxiv.org/pdf/1808.09407))