# Finding similar documents with Word2Vec and Soft Cosine Measure 

Soft Cosine Measure (SCM) is a promising new tool in machine learning that allows us to submit a query and return the most relevant documents. In **part 1**, we will show how you can compute SCM between two documents using the `inner_product` method. In **part 2**, we will use `SoftCosineSimilarity` to retrieve documents most similar to a query and compare the performance against other similarity measures.

First, however, we go through the basics of what Soft Cosine Measure is.

## Soft Cosine Measure basics

Soft Cosine Measure (SCM) is a method that allows us to assess the similarity between two documents in a meaningful way, even when they have no words in common. It uses a measure of similarity between words, which can be derived [2] using [word2vec][] [3] vector embeddings of words. It has been shown to outperform many of the state-of-the-art methods in the semantic text similarity task in the context of community question answering [2].

[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html

SCM is illustrated below for two very similar sentences. The sentences have no words in common, but by modeling synonymy, SCM is able to accurately measure the similarity between the two sentences. The method also uses the bag-of-words vector representation of the documents (simply put, the word's frequencies in the documents). The intution behind the method is that we compute standard cosine similarity assuming that the document vectors are expressed in a non-orthogonal basis, where the angle between two basis vectors is derived from the angle between the word2vec embeddings of the corresponding words.

![Soft Cosine Measure](soft_cosine_tutorial.png)

This method was perhaps first introduced in the article “Soft Measure and Soft Cosine Measure: Measure of Features in Vector Space Model” by Grigori Sidorov, Alexander Gelbukh, Helena Gomez-Adorno, and David Pinto ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf)).

In this tutorial, we will learn how to use Gensim's SCM functionality, which consists of the `inner_product` method for one-off computation, and the `SoftCosineSimilarity` class for corpus-based similarity queries.

> **Note**:
>
> If you use this software, please consider citing [1] and [2].
>

## Running this notebook
You can download this [Jupyter notebook](http://jupyter.org/), and run it on your own computer, provided you have installed the `gensim`, `jupyter`, `sklearn`, `pyemd`, and `wmd` Python packages.

The notebook was run on an Ubuntu machine with an Intel core i7-6700HQ CPU 3.10GHz (4 cores) and 16 GB memory. Assuming all resources required by the notebook have already been downloaded, running the entire notebook on this machine takes about 30 minutes.

In [1]:
# Initialize logging.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Part 1: Computing the Soft Cosine Measure

To use SCM, we need some word embeddings first of all. You could train a [word2vec][] (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but we will use pre-trained word2vec embeddings.

[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html

Let's create some sentences to compare.

In [2]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
sentence_orange = 'Having a tough time finding an orange juice press machine?'.lower().split()

The first two sentences have very similar content, and as such the SCM should be large. Before we compute the SCM, we want to remove stopwords ("the", "to", etc.), as these do not contribute a lot to the information in the sentences.

In [3]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.

# Remove stopwords.
stop_words = stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stop_words]
sentence_president = [w for w in sentence_president if w not in stop_words]
sentence_orange = [w for w in sentence_orange if w not in stop_words]

# Prepare a dictionary and a corpus.
from gensim import corpora
documents = [sentence_obama, sentence_president, sentence_orange]
dictionary = corpora.Dictionary(documents)

# Convert the sentences into bag-of-words vectors.
sentence_obama = dictionary.doc2bow(sentence_obama)
sentence_president = dictionary.doc2bow(sentence_president)
sentence_orange = dictionary.doc2bow(sentence_orange)

  return f(*args, **kwds)
  return f(*args, **kwds)


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/novotny/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


2018-09-11 22:02:01,041 : INFO : 'pattern' package not found; tag filters are not available for English
2018-09-11 22:02:01,044 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:02:01,045 : INFO : built Dictionary(14 unique tokens: ['speaks', 'illinois', 'greets', 'juice', 'chicago']...) from 3 documents (total 15 corpus positions)


Now, as we mentioned earlier, we will be using some downloaded pre-trained embeddings. Note that the embeddings we have chosen here require a lot of memory. We will use the embeddings to construct a term similarity matrix that will be used by the `inner_product` method.

In [4]:
%%time
import gensim.downloader as api
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix

w2v_model = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(w2v_model)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary)

2018-09-11 22:02:01,236 : INFO : loading projection weights from /home/novotny/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2018-09-11 22:02:26,984 : INFO : loaded (400000, 50) matrix from /home/novotny/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2018-09-11 22:02:26,985 : INFO : constructing a sparse term similarity matrix using <gensim.models.keyedvectors.WordEmbeddingSimilarityIndex object at 0x7f8d6e8615c0>
2018-09-11 22:02:26,986 : INFO : iterating over columns in dictionary order
2018-09-11 22:02:26,987 : INFO : PROGRESS: at 7.14% columns (1 / 14, 7.142857% density, 7.142857% projected density)
2018-09-11 22:02:26,988 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):
2018-09-11 22:02:27,273 : INFO : constructed a sparse term similarity matrix with 11.224490% density


CPU times: user 27.8 s, sys: 2.43 s, total: 30.3 s
Wall time: 26.2 s


Let's compute SCM using the `inner_product` method.

In [5]:
similarity = similarity_matrix.inner_product(sentence_obama, sentence_president, normalized=True)
print('similarity = %.4f' % similarity)

similarity = 0.3790


Let's try the same thing with two completely unrelated sentences. Notice that the similarity is smaller.

In [6]:
similarity = similarity_matrix.inner_product(sentence_obama, sentence_orange, normalized=True)
print('similarity = %.4f' % similarity)

similarity = 0.1108


## Part 2: Similarity queries using `SoftCosineSimilarity`
You can use SCM to get the most similar documents to a query, using the `SoftCosineSimilarity` class. Its interface is similar to what is described in the [Similarity Queries](https://radimrehurek.com/gensim/tut3.html) Gensim tutorial.

### Qatar Living unannotated dataset
Contestants solving the community question answering task in the [SemEval 2016][semeval16] and [2017][semeval17] competitions had an unannotated dataset of 189,941 questions and 1,894,456 comments from the [Qatar Living][ql] discussion forums. As our first step, we will use the same dataset to build a corpus.

[semeval16]: http://alt.qcri.org/semeval2016/task3/
[semeval17]: http://alt.qcri.org/semeval2017/task3/
[ql]: http://www.qatarliving.com/forum

In [7]:
%%time
from itertools import chain
import json
from re import sub
from os.path import isfile

import gensim.downloader as api
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk import download


download("stopwords")  # Download stopwords list.
stopwords = set(stopwords.words("english"))

def preprocess(doc):
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    doc = sub(r'<[^<>]+(>|$)', " ", doc)
    doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
    doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

corpus = list(chain(*[
    chain(
        [preprocess(thread["RelQuestion"]["RelQSubject"]), preprocess(thread["RelQuestion"]["RelQBody"])],
        [preprocess(relcomment["RelCText"]) for relcomment in thread["RelComments"]])
    for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated")]))

print("Number of documents: %d" % len(documents))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/novotny/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Number of documents: 3
CPU times: user 2min 37s, sys: 1.62 s, total: 2min 39s
Wall time: 2min 39s


Using the corpus we have just build, we will now construct a [dictionary][], a [TF-IDF model][tfidf], a [word2vec model][word2vec], and a term similarity matrix.

[dictionary]: https://radimrehurek.com/gensim/corpora/dictionary.html
[tfidf]: https://radimrehurek.com/gensim/models/tfidfmodel.html
[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html

In [8]:
%%time
from multiprocessing import cpu_count

from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import Word2Vec
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix

dictionary = Dictionary(corpus)
tfidf = TfidfModel(dictionary=dictionary)
w2v_model = Word2Vec(corpus, workers=cpu_count(), min_count=5, size=300, seed=12345)
similarity_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf, nonzero_limit=100)

2018-09-11 22:05:07,212 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:05:07,485 : INFO : adding document #10000 to Dictionary(20088 unique tokens: ['authours', 'chimney', 'bombard', 'euro', 'lies']...)
2018-09-11 22:05:07,751 : INFO : adding document #20000 to Dictionary(29692 unique tokens: ['biscit', 'pharaonic', 'authours', 'chimney', 'unanimous']...)
2018-09-11 22:05:08,036 : INFO : adding document #30000 to Dictionary(37971 unique tokens: ['biscit', 'pharaonic', 'authours', 'chimney', 'unanimous']...)
2018-09-11 22:05:08,293 : INFO : adding document #40000 to Dictionary(43930 unique tokens: ['biscit', 'chimney', 'strangers', 'untruths', 'apes']...)
2018-09-11 22:05:08,550 : INFO : adding document #50000 to Dictionary(49340 unique tokens: ['biscit', 'chimney', 'strangers', 'zimbabawe', 'untruths']...)
2018-09-11 22:05:08,825 : INFO : adding document #60000 to Dictionary(54734 unique tokens: ['biscit', 'chimney', 'strangers', 'zimbabawe', 'smartness'].

2018-09-11 22:05:22,111 : INFO : adding document #550000 to Dictionary(195951 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:22,389 : INFO : adding document #560000 to Dictionary(197956 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:22,669 : INFO : adding document #570000 to Dictionary(200145 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:22,926 : INFO : adding document #580000 to Dictionary(201859 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:23,178 : INFO : adding document #590000 to Dictionary(203724 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:23,438 : INFO : adding document #600000 to Dictionary(205607 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:23,703 : INFO : adding document #610000 to Dictionary(207387 unique tok

2018-09-11 22:05:36,423 : INFO : adding document #1090000 to Dictionary(291139 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:36,695 : INFO : adding document #1100000 to Dictionary(293838 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:36,955 : INFO : adding document #1110000 to Dictionary(295273 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:37,219 : INFO : adding document #1120000 to Dictionary(296816 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:37,478 : INFO : adding document #1130000 to Dictionary(298552 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:37,692 : INFO : adding document #1140000 to Dictionary(299628 unique tokens: ['bladder', 'appreciat', 'pples', 'adib', 'strangers']...)
2018-09-11 22:05:37,957 : INFO : adding document #1150000 to Dictionary(301139 uni

2018-09-11 22:05:50,846 : INFO : adding document #1630000 to Dictionary(372956 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:05:51,107 : INFO : adding document #1640000 to Dictionary(374282 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:05:51,382 : INFO : adding document #1650000 to Dictionary(375746 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:05:51,646 : INFO : adding document #1660000 to Dictionary(377073 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:05:51,910 : INFO : adding document #1670000 to Dictionary(378393 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:05:52,170 : INFO : adding document #1680000 to Dictionary(379812 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:05:52,433 : INFO : adding document #1690000 to Dic

2018-09-11 22:06:04,891 : INFO : adding document #2160000 to Dictionary(449397 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:06:05,175 : INFO : adding document #2170000 to Dictionary(450649 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:06:05,438 : INFO : adding document #2180000 to Dictionary(451840 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:06:05,714 : INFO : adding document #2190000 to Dictionary(453020 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:06:05,981 : INFO : adding document #2200000 to Dictionary(454160 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:06:06,248 : INFO : adding document #2210000 to Dictionary(455302 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...)
2018-09-11 22:06:06,532 : INFO : adding document #2220000 to Dic

2018-09-11 22:06:11,327 : INFO : PROGRESS: at sentence #540000, processed 9513704 words, keeping 193182 word types
2018-09-11 22:06:11,364 : INFO : PROGRESS: at sentence #550000, processed 9700882 words, keeping 195951 word types
2018-09-11 22:06:11,401 : INFO : PROGRESS: at sentence #560000, processed 9892043 words, keeping 197956 word types
2018-09-11 22:06:11,439 : INFO : PROGRESS: at sentence #570000, processed 10082223 words, keeping 200145 word types
2018-09-11 22:06:11,474 : INFO : PROGRESS: at sentence #580000, processed 10249508 words, keeping 201859 word types
2018-09-11 22:06:11,507 : INFO : PROGRESS: at sentence #590000, processed 10413550 words, keeping 203724 word types
2018-09-11 22:06:11,541 : INFO : PROGRESS: at sentence #600000, processed 10583886 words, keeping 205607 word types
2018-09-11 22:06:11,577 : INFO : PROGRESS: at sentence #610000, processed 10761502 words, keeping 207387 word types
2018-09-11 22:06:11,612 : INFO : PROGRESS: at sentence #620000, processed 1

2018-09-11 22:06:13,847 : INFO : PROGRESS: at sentence #1250000, processed 22068690 words, keeping 317188 word types
2018-09-11 22:06:13,882 : INFO : PROGRESS: at sentence #1260000, processed 22244101 words, keeping 318577 word types
2018-09-11 22:06:13,915 : INFO : PROGRESS: at sentence #1270000, processed 22407248 words, keeping 320245 word types
2018-09-11 22:06:13,953 : INFO : PROGRESS: at sentence #1280000, processed 22594585 words, keeping 321715 word types
2018-09-11 22:06:13,989 : INFO : PROGRESS: at sentence #1290000, processed 22771530 words, keeping 323216 word types
2018-09-11 22:06:14,027 : INFO : PROGRESS: at sentence #1300000, processed 22963365 words, keeping 324767 word types
2018-09-11 22:06:14,061 : INFO : PROGRESS: at sentence #1310000, processed 23129072 words, keeping 326386 word types
2018-09-11 22:06:14,107 : INFO : PROGRESS: at sentence #1320000, processed 23362428 words, keeping 329383 word types
2018-09-11 22:06:14,140 : INFO : PROGRESS: at sentence #1330000,

2018-09-11 22:06:16,419 : INFO : PROGRESS: at sentence #1960000, processed 34532003 words, keeping 424191 word types
2018-09-11 22:06:16,455 : INFO : PROGRESS: at sentence #1970000, processed 34704604 words, keeping 425372 word types
2018-09-11 22:06:16,493 : INFO : PROGRESS: at sentence #1980000, processed 34886895 words, keeping 426641 word types
2018-09-11 22:06:16,528 : INFO : PROGRESS: at sentence #1990000, processed 35059892 words, keeping 427732 word types
2018-09-11 22:06:16,565 : INFO : PROGRESS: at sentence #2000000, processed 35237154 words, keeping 428904 word types
2018-09-11 22:06:16,601 : INFO : PROGRESS: at sentence #2010000, processed 35409658 words, keeping 429960 word types
2018-09-11 22:06:16,639 : INFO : PROGRESS: at sentence #2020000, processed 35599655 words, keeping 431271 word types
2018-09-11 22:06:16,677 : INFO : PROGRESS: at sentence #2030000, processed 35788909 words, keeping 432825 word types
2018-09-11 22:06:16,713 : INFO : PROGRESS: at sentence #2040000,

2018-09-11 22:06:52,466 : INFO : EPOCH 1 - PROGRESS: at 95.26% examples, 1169387 words/s, in_qsize 59, out_qsize 4
2018-09-11 22:06:53,495 : INFO : EPOCH 1 - PROGRESS: at 98.42% examples, 1170643 words/s, in_qsize 61, out_qsize 2
2018-09-11 22:06:53,702 : INFO : worker thread finished; awaiting finish of 31 more threads
2018-09-11 22:06:53,709 : INFO : worker thread finished; awaiting finish of 30 more threads
2018-09-11 22:06:53,734 : INFO : worker thread finished; awaiting finish of 29 more threads
2018-09-11 22:06:53,747 : INFO : worker thread finished; awaiting finish of 28 more threads
2018-09-11 22:06:53,763 : INFO : worker thread finished; awaiting finish of 27 more threads
2018-09-11 22:06:53,772 : INFO : worker thread finished; awaiting finish of 26 more threads
2018-09-11 22:06:53,790 : INFO : worker thread finished; awaiting finish of 25 more threads
2018-09-11 22:06:53,801 : INFO : worker thread finished; awaiting finish of 24 more threads
2018-09-11 22:06:53,818 : INFO : w

2018-09-11 22:07:27,078 : INFO : worker thread finished; awaiting finish of 17 more threads
2018-09-11 22:07:27,081 : INFO : worker thread finished; awaiting finish of 16 more threads
2018-09-11 22:07:27,082 : INFO : worker thread finished; awaiting finish of 15 more threads
2018-09-11 22:07:27,083 : INFO : worker thread finished; awaiting finish of 14 more threads
2018-09-11 22:07:27,084 : INFO : worker thread finished; awaiting finish of 13 more threads
2018-09-11 22:07:27,085 : INFO : worker thread finished; awaiting finish of 12 more threads
2018-09-11 22:07:27,086 : INFO : worker thread finished; awaiting finish of 11 more threads
2018-09-11 22:07:27,087 : INFO : worker thread finished; awaiting finish of 10 more threads
2018-09-11 22:07:27,090 : INFO : worker thread finished; awaiting finish of 9 more threads
2018-09-11 22:07:27,091 : INFO : worker thread finished; awaiting finish of 8 more threads
2018-09-11 22:07:27,093 : INFO : worker thread finished; awaiting finish of 7 more

2018-09-11 22:08:00,120 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-09-11 22:08:00,121 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-09-11 22:08:00,122 : INFO : EPOCH - 3 : training on 40096354 raw words (38514587 effective words) took 33.0s, 1167509 effective words/s
2018-09-11 22:08:01,143 : INFO : EPOCH 4 - PROGRESS: at 2.68% examples, 1023881 words/s, in_qsize 60, out_qsize 3
2018-09-11 22:08:02,151 : INFO : EPOCH 4 - PROGRESS: at 5.81% examples, 1119736 words/s, in_qsize 64, out_qsize 0
2018-09-11 22:08:03,166 : INFO : EPOCH 4 - PROGRESS: at 8.66% examples, 1105341 words/s, in_qsize 59, out_qsize 4
2018-09-11 22:08:04,183 : INFO : EPOCH 4 - PROGRESS: at 11.91% examples, 1132745 words/s, in_qsize 64, out_qsize 0
2018-09-11 22:08:05,197 : INFO : EPOCH 4 - PROGRESS: at 14.98% examples, 1136795 words/s, in_qsize 61, out_qsize 2
2018-09-11 22:08:06,200 : INFO : EPOCH 4 - PROGRESS: at 18.05% examples, 1150014 words/s, in_qsize 64, ou

2018-09-11 22:08:44,141 : INFO : EPOCH 5 - PROGRESS: at 32.88% examples, 1141240 words/s, in_qsize 64, out_qsize 0
2018-09-11 22:08:45,160 : INFO : EPOCH 5 - PROGRESS: at 35.69% examples, 1134927 words/s, in_qsize 59, out_qsize 4
2018-09-11 22:08:46,160 : INFO : EPOCH 5 - PROGRESS: at 38.80% examples, 1140299 words/s, in_qsize 59, out_qsize 4
2018-09-11 22:08:47,164 : INFO : EPOCH 5 - PROGRESS: at 41.93% examples, 1143375 words/s, in_qsize 62, out_qsize 1
2018-09-11 22:08:48,174 : INFO : EPOCH 5 - PROGRESS: at 44.94% examples, 1144228 words/s, in_qsize 60, out_qsize 3
2018-09-11 22:08:49,178 : INFO : EPOCH 5 - PROGRESS: at 47.99% examples, 1145502 words/s, in_qsize 62, out_qsize 1
2018-09-11 22:08:50,178 : INFO : EPOCH 5 - PROGRESS: at 51.25% examples, 1148862 words/s, in_qsize 64, out_qsize 0
2018-09-11 22:08:51,196 : INFO : EPOCH 5 - PROGRESS: at 54.17% examples, 1147181 words/s, in_qsize 63, out_qsize 0
2018-09-11 22:08:52,216 : INFO : EPOCH 5 - PROGRESS: at 57.12% examples, 1146640

2018-09-11 22:09:09,583 : INFO : PROGRESS: at 3.24% columns (15001 / 462807, 0.000221% density, 0.000353% projected density)
2018-09-11 22:09:09,610 : INFO : PROGRESS: at 3.46% columns (16001 / 462807, 0.000221% density, 0.000345% projected density)
2018-09-11 22:09:09,771 : INFO : PROGRESS: at 3.67% columns (17001 / 462807, 0.000221% density, 0.000358% projected density)
2018-09-11 22:09:09,846 : INFO : PROGRESS: at 3.89% columns (18001 / 462807, 0.000222% density, 0.000358% projected density)
2018-09-11 22:09:09,889 : INFO : PROGRESS: at 4.11% columns (19001 / 462807, 0.000222% density, 0.000352% projected density)
2018-09-11 22:09:09,964 : INFO : PROGRESS: at 4.32% columns (20001 / 462807, 0.000222% density, 0.000352% projected density)
2018-09-11 22:09:10,039 : INFO : PROGRESS: at 4.54% columns (21001 / 462807, 0.000222% density, 0.000352% projected density)
2018-09-11 22:09:10,113 : INFO : PROGRESS: at 4.75% columns (22001 / 462807, 0.000223% density, 0.000351% projected density)


2018-09-11 22:09:15,378 : INFO : PROGRESS: at 17.50% columns (81001 / 462807, 0.000243% density, 0.000372% projected density)
2018-09-11 22:09:15,451 : INFO : PROGRESS: at 17.72% columns (82001 / 462807, 0.000244% density, 0.000371% projected density)
2018-09-11 22:09:15,494 : INFO : PROGRESS: at 17.93% columns (83001 / 462807, 0.000244% density, 0.000370% projected density)
2018-09-11 22:09:15,522 : INFO : PROGRESS: at 18.15% columns (84001 / 462807, 0.000244% density, 0.000368% projected density)
2018-09-11 22:09:15,625 : INFO : PROGRESS: at 18.37% columns (85001 / 462807, 0.000244% density, 0.000369% projected density)
2018-09-11 22:09:15,745 : INFO : PROGRESS: at 18.58% columns (86001 / 462807, 0.000245% density, 0.000370% projected density)
2018-09-11 22:09:15,788 : INFO : PROGRESS: at 18.80% columns (87001 / 462807, 0.000245% density, 0.000368% projected density)
2018-09-11 22:09:15,832 : INFO : PROGRESS: at 19.01% columns (88001 / 462807, 0.000245% density, 0.000367% projected d

2018-09-11 22:09:20,523 : INFO : PROGRESS: at 31.55% columns (146001 / 462807, 0.000263% density, 0.000363% projected density)
2018-09-11 22:09:20,583 : INFO : PROGRESS: at 31.76% columns (147001 / 462807, 0.000263% density, 0.000363% projected density)
2018-09-11 22:09:20,656 : INFO : PROGRESS: at 31.98% columns (148001 / 462807, 0.000263% density, 0.000363% projected density)
2018-09-11 22:09:20,731 : INFO : PROGRESS: at 32.20% columns (149001 / 462807, 0.000263% density, 0.000363% projected density)
2018-09-11 22:09:20,883 : INFO : PROGRESS: at 32.41% columns (150001 / 462807, 0.000264% density, 0.000364% projected density)
2018-09-11 22:09:20,956 : INFO : PROGRESS: at 32.63% columns (151001 / 462807, 0.000264% density, 0.000364% projected density)
2018-09-11 22:09:21,015 : INFO : PROGRESS: at 32.84% columns (152001 / 462807, 0.000264% density, 0.000363% projected density)
2018-09-11 22:09:21,150 : INFO : PROGRESS: at 33.06% columns (153001 / 462807, 0.000265% density, 0.000364% pro

2018-09-11 22:09:25,990 : INFO : PROGRESS: at 45.59% columns (211001 / 462807, 0.000283% density, 0.000364% projected density)
2018-09-11 22:09:26,033 : INFO : PROGRESS: at 45.81% columns (212001 / 462807, 0.000284% density, 0.000363% projected density)
2018-09-11 22:09:26,075 : INFO : PROGRESS: at 46.02% columns (213001 / 462807, 0.000284% density, 0.000363% projected density)
2018-09-11 22:09:26,147 : INFO : PROGRESS: at 46.24% columns (214001 / 462807, 0.000284% density, 0.000363% projected density)
2018-09-11 22:09:26,236 : INFO : PROGRESS: at 46.46% columns (215001 / 462807, 0.000284% density, 0.000363% projected density)
2018-09-11 22:09:26,292 : INFO : PROGRESS: at 46.67% columns (216001 / 462807, 0.000284% density, 0.000362% projected density)
2018-09-11 22:09:26,422 : INFO : PROGRESS: at 46.89% columns (217001 / 462807, 0.000285% density, 0.000363% projected density)
2018-09-11 22:09:26,465 : INFO : PROGRESS: at 47.10% columns (218001 / 462807, 0.000285% density, 0.000363% pro

2018-09-11 22:09:34,785 : INFO : PROGRESS: at 59.64% columns (276001 / 462807, 0.000322% density, 0.000393% projected density)
2018-09-11 22:09:34,977 : INFO : PROGRESS: at 59.85% columns (277001 / 462807, 0.000323% density, 0.000394% projected density)
2018-09-11 22:09:35,220 : INFO : PROGRESS: at 60.07% columns (278001 / 462807, 0.000324% density, 0.000395% projected density)
2018-09-11 22:09:35,501 : INFO : PROGRESS: at 60.28% columns (279001 / 462807, 0.000325% density, 0.000397% projected density)
2018-09-11 22:09:35,960 : INFO : PROGRESS: at 60.50% columns (280001 / 462807, 0.000327% density, 0.000400% projected density)
2018-09-11 22:09:36,200 : INFO : PROGRESS: at 60.72% columns (281001 / 462807, 0.000328% density, 0.000401% projected density)
2018-09-11 22:09:36,493 : INFO : PROGRESS: at 60.93% columns (282001 / 462807, 0.000330% density, 0.000403% projected density)
2018-09-11 22:09:36,720 : INFO : PROGRESS: at 61.15% columns (283001 / 462807, 0.000331% density, 0.000404% pro

2018-09-11 22:10:01,510 : INFO : PROGRESS: at 73.68% columns (341001 / 462807, 0.000448% density, 0.000531% projected density)
2018-09-11 22:10:02,327 : INFO : PROGRESS: at 73.90% columns (342001 / 462807, 0.000452% density, 0.000535% projected density)
2018-09-11 22:10:03,267 : INFO : PROGRESS: at 74.11% columns (343001 / 462807, 0.000456% density, 0.000540% projected density)
2018-09-11 22:10:03,968 : INFO : PROGRESS: at 74.33% columns (344001 / 462807, 0.000459% density, 0.000543% projected density)
2018-09-11 22:10:04,741 : INFO : PROGRESS: at 74.55% columns (345001 / 462807, 0.000463% density, 0.000547% projected density)
2018-09-11 22:10:05,494 : INFO : PROGRESS: at 74.76% columns (346001 / 462807, 0.000466% density, 0.000551% projected density)
2018-09-11 22:10:06,487 : INFO : PROGRESS: at 74.98% columns (347001 / 462807, 0.000471% density, 0.000556% projected density)
2018-09-11 22:10:07,188 : INFO : PROGRESS: at 75.19% columns (348001 / 462807, 0.000474% density, 0.000559% pro

2018-09-11 22:17:04,903 : INFO : PROGRESS: at 87.73% columns (406001 / 462807, 0.001926% density, 0.002166% projected density)
2018-09-11 22:17:13,755 : INFO : PROGRESS: at 87.94% columns (407001 / 462807, 0.001954% density, 0.002193% projected density)
2018-09-11 22:17:22,668 : INFO : PROGRESS: at 88.16% columns (408001 / 462807, 0.001983% density, 0.002220% projected density)
2018-09-11 22:17:31,544 : INFO : PROGRESS: at 88.37% columns (409001 / 462807, 0.002011% density, 0.002247% projected density)
2018-09-11 22:17:40,258 : INFO : PROGRESS: at 88.59% columns (410001 / 462807, 0.002037% density, 0.002271% projected density)
2018-09-11 22:17:49,178 : INFO : PROGRESS: at 88.81% columns (411001 / 462807, 0.002065% density, 0.002298% projected density)
2018-09-11 22:17:57,836 : INFO : PROGRESS: at 89.02% columns (412001 / 462807, 0.002090% density, 0.002321% projected density)
2018-09-11 22:18:06,604 : INFO : PROGRESS: at 89.24% columns (413001 / 462807, 0.002117% density, 0.002346% pro

CPU times: user 4h 38min 32s, sys: 4h 24min 33s, total: 9h 3min 5s
Wall time: 20min 43s


### Evaluation
Next, we will load the validation and test datasets that were used by the SemEval 2016 and 2017 contestants. The datasets contain 208 original questions posted by the forum members. For each question, there is a list of 10 threads with a human annotation denoting whether or not the thread is relevant to the original question. Our task will be to order the threads so that relevant threads rank above irrelevant threads.

In [9]:
datasets = api.load("semeval-2016-2017-task3-subtaskBC")

Finally, we will perform an evaluation to compare three unsupervised similarity measures – the Soft Cosine Measure, two different implementations of the [Word Mover's Distance][wmd], and standard cosine similarity. We will use the [Mean Average Precision (MAP)][map] as an evaluation measure and 10-fold cross-validation to get an estimate of the variance of MAP for each similarity measure.

[wmd]: http://vene.ro/blog/word-movers-distance-in-python.html
[map]: https://medium.com/@pds.bangalore/mean-average-precision-abd77d0b9a7e

In [10]:
from math import isnan
from time import time

from gensim.similarities import MatrixSimilarity, WmdSimilarity, SoftCosineSimilarity
import numpy as np
from sklearn.model_selection import KFold
from wmd import WMD

def produce_test_data(dataset):
    for orgquestion in datasets[dataset]:
        query = preprocess(orgquestion["OrgQSubject"]) + preprocess(orgquestion["OrgQBody"])
        documents = [
            preprocess(thread["RelQuestion"]["RelQSubject"]) + preprocess(thread["RelQuestion"]["RelQBody"])
            for thread in orgquestion["Threads"]]
        relevance = [
            thread["RelQuestion"]["RELQ_RELEVANCE2ORGQ"] in ("PerfectMatch", "Relevant")
            for thread in orgquestion["Threads"]]
        yield query, documents, relevance

def cossim(query, documents):
    # Compute cosine similarity between the query and the documents.
    query = tfidf[dictionary.doc2bow(query)]
    index = MatrixSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in documents]],
        num_features=len(dictionary))
    similarities = index[query]
    return similarities

def softcossim(query, documents):
    # Compute Soft Cosine Measure between the query and the documents.
    query = tfidf[dictionary.doc2bow(query)]
    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in documents]],
        similarity_matrix)
    similarities = index[query]
    return similarities

def wmd_gensim(query, documents):
    # Compute Word Mover's Distance as implemented in PyEMD by William Mayner
    # between the query and the documents.
    index = WmdSimilarity(documents, w2v_model)
    similarities = index[query]
    return similarities

def wmd_relax(query, documents):
    # Compute Word Mover's Distance as implemented in WMD by Source{d}
    # between the query and the documents.
    words = [word for word in set(chain(query, *documents)) if word in w2v_model.wv]
    indices, words = zip(*sorted((
        (index, word) for (index, _), word in zip(dictionary.doc2bow(words), words))))
    query = dict(tfidf[dictionary.doc2bow(query)])
    query = [
        (new_index, query[dict_index])
        for new_index, dict_index in enumerate(indices)
        if dict_index in query]
    documents = [dict(tfidf[dictionary.doc2bow(document)]) for document in documents]
    documents = [[
        (new_index, document[dict_index])
        for new_index, dict_index in enumerate(indices)
        if dict_index in document] for document in documents]
    embeddings = np.array([w2v_model.wv[word] for word in words], dtype=np.float32)
    nbow = dict(((index, list(chain([None], zip(*document)))) for index, document in enumerate(documents)))
    nbow["query"] = tuple([None] + list(zip(*query)))
    distances = WMD(embeddings, nbow, vocabulary_min=1).nearest_neighbors("query")
    similarities = [-distance for _, distance in sorted(distances)]
    return similarities

strategies = {
    "cossim" : cossim,
    "softcossim": softcossim,
    "wmd-gensim": wmd_gensim,
    "wmd-relax": wmd_relax}

def evaluate(split, strategy):
    # Perform a single round of evaluation.
    results = []
    start_time = time()
    for query, documents, relevance in split:
        similarities = strategies[strategy](query, documents)
        assert len(similarities) == len(documents)
        precision = [
            (num_correct + 1) / (num_total + 1) for num_correct, num_total in enumerate(
                num_total for num_total, (_, relevant) in enumerate(
                    sorted(zip(similarities, relevance), reverse=True)) if relevant)]
        average_precision = np.mean(precision) if precision else 0.0
        results.append(average_precision)
    return (np.mean(results) * 100, time() - start_time)

def crossvalidate(args):
    # Perform a cross-validation.
    dataset, strategy = args
    test_data = np.array(list(produce_test_data(dataset)))
    kf = KFold(n_splits=10)
    samples = []
    for _, test_index in kf.split(test_data):
        samples.append(evaluate(test_data[test_index], strategy))
    return (np.mean(samples, axis=0), np.std(samples, axis=0))

In [11]:
%%time
from multiprocessing import Pool

args_list = [
    (dataset, technique)
    for dataset in ("2016-test", "2017-test")
    for technique in ("softcossim", "wmd-gensim", "wmd-relax", "cossim")]
with Pool() as pool:
    results = pool.map(crossvalidate, args_list)

2018-09-11 22:25:55,395 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:25:55,423 : INFO : creating matrix with 10 documents and 462807 features
  if np.issubdtype(vec.dtype, np.int):
2018-09-11 22:25:55,444 : INFO : Vocabulary size: 35 500
2018-09-11 22:25:55,447 : INFO : WCD
2018-09-11 22:25:55,450 : INFO : 0.0
2018-09-11 22:25:55,451 : INFO : First K WMD
2018-09-11 22:25:55,457 : INFO : creating matrix with 10 documents and 462807 features
  if np.issubdtype(vec.dtype, np.int):
2018-09-11 22:25:55,473 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:55,472 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:25:55,523 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:55,519 : INFO : [(-18.3532657623291, 6), (-17.696285247802734, 8), (-18.018333435058594, 5), (-17.275238037109375, 2), (-17.690784454345703, 4), (-17.200286865234375, 9), (-17.388216018676758, 7), (-0.0, 1), (-16.468210220336

2018-09-11 22:25:55,803 : INFO : Vocabulary size: 3 500
2018-09-11 22:25:55,804 : INFO : WCD
2018-09-11 22:25:55,807 : INFO : 0.0
2018-09-11 22:25:55,808 : INFO : First K WMD
2018-09-11 22:25:55,813 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:55,815 : INFO : [(-23.323942184448242, 1), (-23.062299728393555, 2), (-23.259197235107422, 4), (-22.928569793701172, 8), (-22.76712989807129, 7), (-22.777070999145508, 6), (-22.7784423828125, 3), (-21.306692123413086, 0), (-22.706424713134766, 5), (-20.88772201538086, 9)]
2018-09-11 22:25:55,817 : INFO : 0.0
2018-09-11 22:25:55,818 : INFO : P&P
2018-09-11 22:25:55,819 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:55,820 : INFO : stopped by early_stop condition
2018-09-11 22:25:55,822 : INFO : Vocabulary size: 21 500
2018-09-11 22:25:55,824 : INFO : WCD
2018-09-11 22:25:55,826 : INFO : 0.0
2018-09-11 22:25:55,828 : INFO : First K WMD
2018-09-11 22:25:55,841 : INFO : [(-21.3825092

2018-09-11 22:25:56,066 : INFO : [(-17.6803035736084, 9), (-16.80386734008789, 7), (-15.549836158752441, 0), (-15.82524299621582, 3), (-16.073200225830078, 6), (-14.176229476928711, 1), (-15.330875396728516, 4), (-15.516229629516602, 2), (-15.30439567565918, 5), (-15.97109317779541, 8)]
2018-09-11 22:25:56,067 : INFO : 0.0
2018-09-11 22:25:56,068 : INFO : P&P
2018-09-11 22:25:56,069 : INFO : stopped by early_stop condition
2018-09-11 22:25:56,075 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:56,079 : INFO : Vocabulary size: 12 500
2018-09-11 22:25:56,081 : INFO : WCD
2018-09-11 22:25:56,084 : INFO : 0.0
2018-09-11 22:25:56,085 : INFO : First K WMD
2018-09-11 22:25:56,091 : INFO : Vocabulary size: 7 500
2018-09-11 22:25:56,093 : INFO : WCD
2018-09-11 22:25:56,095 : INFO : 0.0
2018-09-11 22:25:56,095 : INFO : [(-21.149734497070312, 3), (-21.115989685058594, 9), (-20.278905868530273, 4), (-20.617753982543945, 7), (-21.0872859954834, 1), (-17.8920364379882

2018-09-11 22:25:56,310 : INFO : 0.0
2018-09-11 22:25:56,311 : INFO : P&P
2018-09-11 22:25:56,312 : INFO : stopped by early_stop condition
2018-09-11 22:25:56,313 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:56,316 : INFO : Vocabulary size: 18 500
2018-09-11 22:25:56,318 : INFO : WCD
2018-09-11 22:25:56,320 : INFO : 0.0
2018-09-11 22:25:56,321 : INFO : First K WMD
2018-09-11 22:25:56,332 : INFO : [(-20.602643966674805, 4), (-18.422197341918945, 7), (-19.27498435974121, 5), (-17.853513717651367, 1), (-18.41405487060547, 2), (-18.674968719482422, 9), (-17.420278549194336, 3), (-16.40138053894043, 6), (-0.0, 0), (-16.361116409301758, 8)]
2018-09-11 22:25:56,333 : INFO : 0.0
2018-09-11 22:25:56,335 : INFO : P&P
2018-09-11 22:25:56,335 : INFO : Vocabulary size: 12 500
2018-09-11 22:25:56,336 : INFO : stopped by early_stop condition
2018-09-11 22:25:56,336 : INFO : WCD
2018-09-11 22:25:56,338 : INFO : 0.0
2018-09-11 22:25:56,339 : INFO : First K WMD
2018-09

2018-09-11 22:25:56,630 : INFO : P&P
2018-09-11 22:25:56,631 : INFO : stopped by early_stop condition
2018-09-11 22:25:56,633 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:56,642 : INFO : [(-20.412126541137695, 1), (-19.901622772216797, 6), (-18.32648468017578, 9), (-17.38874053955078, 3), (-15.66531753540039, 0), (-15.41468334197998, 8), (-16.87537384033203, 5), (-16.159189224243164, 7), (-16.56991195678711, 2), (-0.0, 4)]
2018-09-11 22:25:56,643 : INFO : 0.0
2018-09-11 22:25:56,644 : INFO : P&P
2018-09-11 22:25:56,645 : INFO : stopped by early_stop condition
2018-09-11 22:25:56,652 : INFO : Vocabulary size: 10 500
2018-09-11 22:25:56,653 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:56,653 : INFO : WCD
2018-09-11 22:25:56,655 : INFO : 0.0
2018-09-11 22:25:56,656 : INFO : First K WMD
2018-09-11 22:25:56,662 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:56,664 : INFO : [(-20.3319644927

2018-09-11 22:25:56,938 : INFO : 0.0
2018-09-11 22:25:56,938 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:56,939 : INFO : P&P
2018-09-11 22:25:56,940 : INFO : stopped by early_stop condition
2018-09-11 22:25:56,942 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:56,952 : INFO : Vocabulary size: 34 500
2018-09-11 22:25:56,953 : INFO : WCD
2018-09-11 22:25:56,956 : INFO : 0.0
2018-09-11 22:25:56,957 : INFO : First K WMD
2018-09-11 22:25:56,962 : INFO : Vocabulary size: 5 500
2018-09-11 22:25:56,963 : INFO : WCD
2018-09-11 22:25:56,966 : INFO : 0.0
2018-09-11 22:25:56,967 : INFO : First K WMD
2018-09-11 22:25:56,967 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:56,973 : INFO : [(-18.67586326599121, 9), (-18.339338302612305, 4), (-16.85841941833496, 2), (-17.52264404296875, 5), (-17.893905639648438, 0), (-15.966352462768555, 1), (-16.120573043823242, 7), (-15.883320808410645, 8), (-16.41198

2018-09-11 22:25:57,176 : INFO : WCD
2018-09-11 22:25:57,177 : INFO : WCD
2018-09-11 22:25:57,179 : INFO : 0.0
2018-09-11 22:25:57,179 : INFO : 0.0
2018-09-11 22:25:57,179 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:57,180 : INFO : First K WMD
2018-09-11 22:25:57,180 : INFO : First K WMD
2018-09-11 22:25:57,186 : INFO : [(-21.144359588623047, 7), (-20.713117599487305, 5), (-20.755130767822266, 9), (-19.895933151245117, 4), (-17.951595306396484, 1), (-18.24154281616211, 0), (-20.379283905029297, 8), (-19.393970489501953, 3), (-18.360334396362305, 6), (-0.0, 2)]
2018-09-11 22:25:57,187 : INFO : 0.0
2018-09-11 22:25:57,188 : INFO : [(-18.68290901184082, 7), (-17.395431518554688, 5), (-17.43307876586914, 6), (-17.24639892578125, 8), (-17.380598068237305, 2), (-17.050798416137695, 9), (-16.874797821044922, 0), (-16.620691299438477, 4), (-13.969255447387695, 3), (-15.286347389221191, 1)]
2018-09-11 22:25:57,188 : INFO : P&P
2018-09-11 22:25:57,189 : INFO :

2018-09-11 22:25:57,415 : INFO : 0.0
2018-09-11 22:25:57,415 : INFO : [(-18.13376235961914, 8), (-16.11659812927246, 6), (-15.611549377441406, 4), (-15.593467712402344, 7), (-14.747987747192383, 3), (-15.354251861572266, 0), (-14.523130416870117, 1), (-14.768147468566895, 5), (-12.155661582946777, 2), (-14.562323570251465, 9)]
2018-09-11 22:25:57,415 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:57,416 : INFO : P&P
2018-09-11 22:25:57,417 : INFO : 0.0
2018-09-11 22:25:57,417 : INFO : stopped by early_stop condition
2018-09-11 22:25:57,418 : INFO : P&P
2018-09-11 22:25:57,418 : INFO : stopped by early_stop condition
2018-09-11 22:25:57,427 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:57,446 : INFO : Vocabulary size: 29 500
2018-09-11 22:25:57,450 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:57,455 : INFO : Vocabulary size: 11 500
2018-09-11 22:25:57,455 : INFO : WCD
2018-09-11 22:25:5

2018-09-11 22:25:57,755 : INFO : WCD
2018-09-11 22:25:57,770 : INFO : Vocabulary size: 11 500
2018-09-11 22:25:57,772 : INFO : WCD
2018-09-11 22:25:57,771 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:57,773 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:57,776 : INFO : 0.0
2018-09-11 22:25:57,773 : INFO : 0.0
2018-09-11 22:25:57,777 : INFO : First K WMD
2018-09-11 22:25:57,777 : INFO : First K WMD
2018-09-11 22:25:57,788 : INFO : [(-21.864727020263672, 7), (-21.44795799255371, 1), (-19.778404235839844, 8), (-20.00682258605957, 6), (-18.828405380249023, 9), (-17.995689392089844, 2), (-18.10761070251465, 4), (-19.479032516479492, 3), (-18.34626007080078, 5), (-15.891501426696777, 0)]
2018-09-11 22:25:57,788 : INFO : [(-20.17798614501953, 9), (-19.862642288208008, 7), (-19.94223976135254, 8), (-19.309282302856445, 5), (-18.08249855041504, 3), (-19.61968421936035, 6), (-0.0, 0), (-18.393888473510742, 1), (-19.3023548126220

2018-09-11 22:25:57,995 : INFO : 0.0
2018-09-11 22:25:57,996 : INFO : P&P
2018-09-11 22:25:57,997 : INFO : stopped by early_stop condition
2018-09-11 22:25:57,998 : INFO : Vocabulary size: 17 500
2018-09-11 22:25:57,998 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:25:57,999 : INFO : WCD
2018-09-11 22:25:58,000 : INFO : built Dictionary(58 unique tokens: ['bayt', 'tell', 'almost', 'cv', 'respond']...) from 2 documents (total 74 corpus positions)
2018-09-11 22:25:58,001 : INFO : 0.0
2018-09-11 22:25:58,002 : INFO : First K WMD
2018-09-11 22:25:58,008 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:58,009 : INFO : [(-23.84600830078125, 3), (-21.9279727935791, 7), (-23.31334686279297, 8), (-21.53450584411621, 6), (-20.583724975585938, 5), (-20.156784057617188, 9), (-19.488155364990234, 1), (-20.923620223999023, 4), (-0.0, 0), (-19.14923667907715, 2)]
2018-09-11 22:25:58,011 : INFO : 0.0
2018-09-11 22:25:58,011 : INFO : P&P
201

2018-09-11 22:25:58,159 : INFO : Vocabulary size: 14 500
2018-09-11 22:25:58,159 : INFO : stopped by early_stop condition
2018-09-11 22:25:58,160 : INFO : built Dictionary(40 unique tokens: ['getting', 'get', 'also', 'bayt', 'back']...) from 2 documents (total 49 corpus positions)
2018-09-11 22:25:58,160 : INFO : WCD
2018-09-11 22:25:58,162 : INFO : 0.0
2018-09-11 22:25:58,163 : INFO : First K WMD
2018-09-11 22:25:58,166 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:25:58,167 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:58,168 : INFO : built Dictionary(35 unique tokens: ['qualify', 'someone', 'qatar', 'explain', 'vacation']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:25:58,170 : INFO : [(-22.46072006225586, 7), (-19.800731658935547, 9), (-18.909976959228516, 6), (-19.044780731201172, 8), (-17.199432373046875, 5), (-12.679047584533691, 3), (-18.82349967956543, 4), (-15.129741668701172, 2), (-15.36417961

2018-09-11 22:25:58,304 : INFO : 0.0
2018-09-11 22:25:58,305 : INFO : P&P
2018-09-11 22:25:58,305 : INFO : [(-21.058446884155273, 6), (-20.922012329101562, 1), (-19.987409591674805, 3), (-18.932270050048828, 9), (-17.863933563232422, 5), (-18.062917709350586, 2), (-19.771121978759766, 7), (-18.137680053710938, 8), (-18.860422134399414, 4), (-17.472457885742188, 0)]
2018-09-11 22:25:58,306 : INFO : stopped by early_stop condition
2018-09-11 22:25:58,306 : INFO : 0.0
2018-09-11 22:25:58,307 : INFO : P&P
2018-09-11 22:25:58,307 : INFO : stopped by early_stop condition
2018-09-11 22:25:58,327 : INFO : Vocabulary size: 17 500
2018-09-11 22:25:58,327 : INFO : Vocabulary size: 12 500
2018-09-11 22:25:58,327 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:25:58,328 : INFO : WCD
2018-09-11 22:25:58,328 : INFO : WCD
2018-09-11 22:25:58,328 : INFO : creating matrix with 10 documents and 462807 features
2018-09-11 22:25:58,329 : INFO : built Dictionary(57 unique tokens

2018-09-11 22:25:58,526 : INFO : 0.0
2018-09-11 22:25:58,529 : INFO : P&P
2018-09-11 22:25:58,534 : INFO : stopped by early_stop condition
2018-09-11 22:25:58,550 : INFO : Vocabulary size: 15 500
2018-09-11 22:25:58,570 : INFO : Vocabulary size: 8 500
2018-09-11 22:25:58,572 : INFO : WCD
2018-09-11 22:25:58,575 : INFO : WCD
2018-09-11 22:25:58,578 : INFO : 0.0
2018-09-11 22:25:58,583 : INFO : First K WMD
2018-09-11 22:25:58,582 : INFO : 0.0
2018-09-11 22:25:58,587 : INFO : First K WMD
2018-09-11 22:25:58,593 : INFO : [(-17.911226272583008, 9), (-17.565181732177734, 3), (-17.15479850769043, 7), (-15.223217964172363, 5), (-16.388784408569336, 2), (-16.640281677246094, 4), (-14.70413589477539, 8), (-15.15954875946045, 1), (-0.0, 0), (-14.843816757202148, 6)]
2018-09-11 22:25:58,598 : INFO : 0.0
2018-09-11 22:25:58,599 : INFO : P&P
2018-09-11 22:25:58,601 : INFO : stopped by early_stop condition
2018-09-11 22:25:58,598 : INFO : [(-19.13587760925293, 9), (-18.659883499145508, 8), (-16.65754

2018-09-11 22:25:58,845 : INFO : 0.0
2018-09-11 22:25:58,846 : INFO : P&P
2018-09-11 22:25:58,847 : INFO : stopped by early_stop condition
2018-09-11 22:25:58,866 : INFO : Vocabulary size: 9 500
2018-09-11 22:25:58,868 : INFO : WCD
2018-09-11 22:25:58,870 : INFO : 0.0
2018-09-11 22:25:58,871 : INFO : First K WMD
2018-09-11 22:25:58,878 : INFO : [(-19.23527717590332, 3), (-19.186975479125977, 7), (-17.86774253845215, 8), (-18.0268497467041, 1), (-16.744600296020508, 4), (-16.31690788269043, 9), (-17.59482192993164, 6), (-15.280181884765625, 2), (-16.7364559173584, 0), (-13.971843719482422, 5)]
2018-09-11 22:25:58,879 : INFO : 0.0
2018-09-11 22:25:58,880 : INFO : P&P
2018-09-11 22:25:58,880 : INFO : stopped by early_stop condition
2018-09-11 22:25:58,899 : INFO : Vocabulary size: 6 500
2018-09-11 22:25:58,901 : INFO : WCD
2018-09-11 22:25:58,903 : INFO : 0.0
2018-09-11 22:25:58,904 : INFO : First K WMD
2018-09-11 22:25:58,909 : INFO : [(-23.60029411315918, 9), (-23.017337799072266, 6), (

2018-09-11 22:26:00,395 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:00,397 : INFO : built Dictionary(40 unique tokens: ['wife', 'counter', 'trying', 'submission', 'rent']...) from 2 documents (total 52 corpus positions)
2018-09-11 22:26:00,413 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:00,415 : INFO : built Dictionary(41 unique tokens: ['reply', 'approve', 'review', 'showing', 'trying']...) from 2 documents (total 53 corpus positions)
2018-09-11 22:26:00,432 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:00,433 : INFO : built Dictionary(34 unique tokens: ['ask', 'wife', 'get', 'civil', 'profession']...) from 2 documents (total 56 corpus positions)
2018-09-11 22:26:00,446 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:00,447 : INFO : built Dictionary(37 unique tokens: ['wife', 'qatar', 'civil', 'trying', 'even']...) from 2 documents (total 54 corpus 

2018-09-11 22:26:02,842 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:02,843 : INFO : built Dictionary(63 unique tokens: ['cover', 'air', 'ticket', 'tell', 'decent']...) from 2 documents (total 82 corpus positions)
2018-09-11 22:26:02,890 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:02,892 : INFO : built Dictionary(61 unique tokens: ['pharma', 'air', 'ticket', 'manager', 'years']...) from 2 documents (total 76 corpus positions)
2018-09-11 22:26:02,938 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:02,940 : INFO : built Dictionary(58 unique tokens: ['air', 'say', 'travel', 'qar', 'ticket']...) from 2 documents (total 82 corpus positions)
2018-09-11 22:26:02,997 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:03,007 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:03,009 : INFO : built Dictionary(44 unique tokens: [

2018-09-11 22:26:07,150 : INFO : built Dictionary(33 unique tokens: ['shop', 'within', 'knows', 'area', 'sew']...) from 2 documents (total 36 corpus positions)
2018-09-11 22:26:07,159 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:07,160 : INFO : built Dictionary(42 unique tokens: ['shop', 'get', 'would', 'area', 'selling']...) from 2 documents (total 51 corpus positions)
2018-09-11 22:26:07,180 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:07,181 : INFO : built Dictionary(53 unique tokens: ['shop', 'would', 'either', 'selection', 'trousers']...) from 2 documents (total 59 corpus positions)
2018-09-11 22:26:07,208 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:07,209 : INFO : built Dictionary(46 unique tokens: ['work', 'would', 'garage', 'trousers', 'upcoming']...) from 2 documents (total 53 corpus positions)
2018-09-11 22:26:07,231 : INFO : adding document #0 to Dictionary(0 unique toke

2018-09-11 22:26:09,445 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:09,446 : INFO : built Dictionary(27 unique tokens: ['rewarding', 'countries', 'hope', 'started', 'label']...) from 2 documents (total 31 corpus positions)
2018-09-11 22:26:09,451 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:09,451 : INFO : built Dictionary(22 unique tokens: ['picture', 'idolize', 'even', 'small', 'one']...) from 2 documents (total 24 corpus positions)
2018-09-11 22:26:09,455 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:09,456 : INFO : built Dictionary(46 unique tokens: ['actually', 'shall', 'say', 'day', 'good']...) from 2 documents (total 51 corpus positions)
2018-09-11 22:26:09,459 : INFO : Removed 0 and 3 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:09,460 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:09,461 : INFO : built Dictionary(32 uniqu

2018-09-11 22:26:11,815 : INFO : built Dictionary(27 unique tokens: ['ask', 'qatar', 'experienced', 'could', 'nagpur']...) from 2 documents (total 39 corpus positions)
2018-09-11 22:26:11,824 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:26:13,831 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:13,832 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:13,833 : INFO : built Dictionary(18 unique tokens: ['please', 'near', 'thanks', 'duhail', 'house']...) from 2 documents (total 22 corpus positions)
2018-09-11 22:26:13,837 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:13,838 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:13,839 : INFO : built Dictionary(39 unique tokens: ['friend', 'problem', 'slender', 'woman', 'seem']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:26:13,841 : INFO : adding document #0 to

2018-09-11 22:26:16,203 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:16,204 : INFO : built Dictionary(38 unique tokens: ['work', 'problem', 'parcel', 'travelling', 'pay']...) from 2 documents (total 50 corpus positions)
2018-09-11 22:26:16,217 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:16,218 : INFO : built Dictionary(31 unique tokens: ['permit', 'drive', 'parcel', 'residence', 'rp']...) from 2 documents (total 46 corpus positions)
2018-09-11 22:26:16,218 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:16,219 : INFO : built Dictionary(46 unique tokens: ['projects', 'usd', 'work', 'participated', 'please']...) from 2 documents (total 61 corpus positions)
2018-09-11 22:26:16,229 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:16,230 : INFO : built Dictionary(50 unique tokens: ['info', 'said', 'renewal', 'need', 'valid']...) from 2 documents (total 62 cor

2018-09-11 22:26:18,521 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:18,522 : INFO : built Dictionary(38 unique tokens: ['dies', 'dozed', 'anybody', 'afternoon', 'bus']...) from 2 documents (total 52 corpus positions)
2018-09-11 22:26:18,534 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:18,535 : INFO : built Dictionary(31 unique tokens: ['living', 'woman', 'thanks', 'conference', 'lives']...) from 2 documents (total 37 corpus positions)
2018-09-11 22:26:18,544 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:26:20,198 : INFO : Removed 1 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:20,199 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:20,201 : INFO : built Dictionary(41 unique tokens: ['fourth', 'wife', 'website', 'review', 'means']...) from 2 documents (total 56 corpus positions)
2018-09-11 22:26:20,219 : INFO : Removed 0 and 1 OOV words fro

2018-09-11 22:26:22,461 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:22,462 : INFO : built Dictionary(22 unique tokens: ['wife', 'qatar', 'soon', 'please', 'dear']...) from 2 documents (total 31 corpus positions)
2018-09-11 22:26:22,467 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:22,468 : INFO : built Dictionary(20 unique tokens: ['qatar', 'visa', 'required', 'medical', 'recent']...) from 2 documents (total 26 corpus positions)
2018-09-11 22:26:22,473 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:22,474 : INFO : built Dictionary(21 unique tokens: ['recommendations', 'qatar', 'required', 'deliver', 'month']...) from 2 documents (total 23 corpus positions)
2018-09-11 22:26:22,479 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:22,480 : INFO : built Dictionary(16 unique tokens: ['qatar', 'extend', 'visitor', 'soon', 'month']...) from 2 documents (total 2

2018-09-11 22:26:25,113 : INFO : built Dictionary(56 unique tokens: ['ask', 'would', 'experienced', 'connection', 'mbps']...) from 2 documents (total 73 corpus positions)
2018-09-11 22:26:25,143 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:25,145 : INFO : built Dictionary(54 unique tokens: ['info', 'signs', 'pack', 'value', 'else']...) from 2 documents (total 68 corpus positions)
2018-09-11 22:26:25,175 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:25,177 : INFO : built Dictionary(31 unique tokens: ['website', 'getting', 'get', 'shed', 'thought']...) from 2 documents (total 43 corpus positions)
2018-09-11 22:26:25,185 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:25,186 : INFO : built Dictionary(42 unique tokens: ['getting', 'moving', 'light', 'thought', 'qtel']...) from 2 documents (total 53 corpus positions)
2018-09-11 22:26:25,203 : INFO : adding document #0 to Dictionary(0 unique 

2018-09-11 22:26:28,962 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:28,962 : INFO : built Dictionary(26 unique tokens: ['book', 'getting', 'qatar', 'wife', 'thanks']...) from 2 documents (total 41 corpus positions)
2018-09-11 22:26:28,971 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:28,972 : INFO : built Dictionary(28 unique tokens: ['permit', 'qatar', 'woman', 'relocate', 'month']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:26:28,980 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:28,981 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:28,982 : INFO : built Dictionary(23 unique tokens: ['work', 'get', 'pls', 'follow', 'need']...) from 2 documents (total 33 corpus positions)
2018-09-11 22:26:28,989 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:28,990 : INFO : built Dictionary(20 unique tokens

2018-09-11 22:26:31,823 : INFO : built Dictionary(56 unique tokens: ['revising', 'spouse', 'almost', 'masters', 'good']...) from 2 documents (total 68 corpus positions)
2018-09-11 22:26:31,853 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:31,855 : INFO : built Dictionary(48 unique tokens: ['suzuki', 'okay', 'spouse', 'full', 'qar']...) from 2 documents (total 57 corpus positions)
2018-09-11 22:26:31,880 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:31,881 : INFO : built Dictionary(39 unique tokens: ['get', 'okay', 'spouse', 'thanks', 'say']...) from 2 documents (total 45 corpus positions)
2018-09-11 22:26:31,896 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:31,898 : INFO : built Dictionary(46 unique tokens: ['drive', 'okay', 'spouse', 'cost', 'good']...) from 2 documents (total 58 corpus positions)
2018-09-11 22:26:31,919 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2

2018-09-11 22:26:35,595 : INFO : built Dictionary(48 unique tokens: ['someone', 'know', 'following', 'garage', 'please']...) from 2 documents (total 77 corpus positions)
2018-09-11 22:26:35,613 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:35,615 : INFO : built Dictionary(24 unique tokens: ['received', 'thanks', 'soon', 'qar', 'moving']...) from 2 documents (total 32 corpus positions)
2018-09-11 22:26:35,622 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:35,623 : INFO : built Dictionary(33 unique tokens: ['ask', 'wanna', 'qatar', 'ok', 'thanks']...) from 2 documents (total 40 corpus positions)
2018-09-11 22:26:35,634 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:35,635 : INFO : built Dictionary(31 unique tokens: ['taken', 'admin', 'mean', 'thanks', 'care']...) from 2 documents (total 33 corpus positions)
2018-09-11 22:26:35,645 : INFO : precomputing L2-norms of word weight vectors
2018-

2018-09-11 22:26:38,554 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:38,555 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:38,556 : INFO : built Dictionary(29 unique tokens: ['wife', 'moving', 'actually', 'doctor', 'full']...) from 2 documents (total 36 corpus positions)
2018-09-11 22:26:38,564 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:38,565 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:38,566 : INFO : built Dictionary(48 unique tokens: ['make', 'group', 'full', 'valid', 'comming']...) from 2 documents (total 57 corpus positions)
2018-09-11 22:26:38,588 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:38,590 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:38,591 : INFO : built Dictionary(40 unique tokens: ['thanks', 'wife', 'moving', 'full', 'darling']...)

2018-09-11 22:26:41,731 : INFO : built Dictionary(44 unique tokens: ['changing', 'manager', 'agent', 'went', 'please']...) from 2 documents (total 60 corpus positions)
2018-09-11 22:26:41,749 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:41,750 : INFO : built Dictionary(43 unique tokens: ['info', 'getting', 'qatar', 'days', 'tell']...) from 2 documents (total 56 corpus positions)
2018-09-11 22:26:41,769 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:41,771 : INFO : built Dictionary(38 unique tokens: ['getting', 'get', 'days', 'passport', 'country']...) from 2 documents (total 56 corpus positions)
2018-09-11 22:26:41,785 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:41,786 : INFO : built Dictionary(48 unique tokens: ['ask', 'god', 'soon', 'went', 'please']...) from 2 documents (total 61 corpus positions)
2018-09-11 22:26:41,809 : INFO : adding document #0 to Dictionary(0 unique tokens: [

2018-09-11 22:26:44,884 : INFO : built Dictionary(63 unique tokens: ['work', 'drive', 'labour', 'clean', 'shop']...) from 2 documents (total 74 corpus positions)
2018-09-11 22:26:44,924 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:44,925 : INFO : built Dictionary(31 unique tokens: ['shop', 'drive', 'dhabi', 'labour', 'laws']...) from 2 documents (total 70 corpus positions)
2018-09-11 22:26:44,952 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:44,953 : INFO : built Dictionary(49 unique tokens: ['work', 'drive', 'better', 'shop', 'benefits']...) from 2 documents (total 61 corpus positions)
2018-09-11 22:26:44,977 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:44,978 : INFO : built Dictionary(40 unique tokens: ['shop', 'drive', 'dhabi', 'located', 'tell']...) from 2 documents (total 51 corpus positions)
2018-09-11 22:26:44,990 : INFO : adding document #0 to Dictionary(0 unique tokens: [])


2018-09-11 22:26:47,431 : INFO : built Dictionary(38 unique tokens: ['reply', 'approve', 'review', 'showing', 'daughter']...) from 2 documents (total 50 corpus positions)
2018-09-11 22:26:47,443 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:47,444 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:47,445 : INFO : built Dictionary(34 unique tokens: ['accepted', 'pls', 'review', 'ok', 'showing']...) from 2 documents (total 46 corpus positions)
2018-09-11 22:26:47,455 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:47,456 : INFO : built Dictionary(37 unique tokens: ['wife', 'counter', 'qar', 'submission', 'rp']...) from 2 documents (total 49 corpus positions)
2018-09-11 22:26:47,467 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:47,468 : INFO : built Dictionary(43 unique tokens: ['ask', 'get', 'went', 'thanks', 'meet']...) from 2 documents (total 57 cor

2018-09-11 22:26:49,282 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:49,283 : INFO : built Dictionary(52 unique tokens: ['ask', 'nurse', 'please', 'good', 'may']...) from 2 documents (total 64 corpus positions)
2018-09-11 22:26:49,304 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:49,305 : INFO : built Dictionary(48 unique tokens: ['formalities', 'rules', 'card', 'appreciated', 'required']...) from 2 documents (total 62 corpus positions)
2018-09-11 22:26:49,325 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:26:50,417 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:50,418 : INFO : built Dictionary(32 unique tokens: ['ask', 'book', 'get', 'said', 'elsewhere']...) from 2 documents (total 86 corpus positions)
2018-09-11 22:26:50,448 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:50,449 : INFO : built Dictionary(54 unique tokens: ['ask', '

2018-09-11 22:26:52,534 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:52,535 : INFO : built Dictionary(32 unique tokens: ['info', 'living', 'website', 'thanks', 'aussie']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:26:52,546 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:26:52,843 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:52,844 : INFO : built Dictionary(22 unique tokens: ['body', 'associates', 'p', 'qatar', 'somebody']...) from 2 documents (total 32 corpus positions)
2018-09-11 22:26:52,849 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:26:52,850 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:52,851 : INFO : built Dictionary(31 unique tokens: ['plz', 'qatar', 'days', 'associates', 'p']...) from 2 documents (total 39 corpus positions)
2018-09-11 22:26:52,861 : INFO : adding document #0 to Dictionary(0 un

2018-09-11 22:26:54,676 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:54,677 : INFO : built Dictionary(37 unique tokens: ['hospitals', 'qatar', 'also', 'insurance', 'thanks']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:26:54,688 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:54,689 : INFO : built Dictionary(39 unique tokens: ['anxious', 'qatar', 'pls', 'nurse', 'per']...) from 2 documents (total 45 corpus positions)
2018-09-11 22:26:54,701 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:26:56,147 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:56,149 : INFO : built Dictionary(34 unique tokens: ['perm', 'matt', 'moving', 'wife', 'thanks']...) from 2 documents (total 72 corpus positions)
2018-09-11 22:26:56,182 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:56,183 : INFO : built Dictionary(47 unique tokens: ['perm', 'wo

2018-09-11 22:26:58,198 : INFO : built Dictionary(27 unique tokens: ['mandoob', 'get', 'baldiya', 'visit', 'even']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:26:58,200 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:58,201 : INFO : built Dictionary(14 unique tokens: ['reasonable', 'stores', 'center', 'thank', 'mall']...) from 2 documents (total 28 corpus positions)
2018-09-11 22:26:58,204 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:58,205 : INFO : built Dictionary(25 unique tokens: ['get', 'thanks', 'mu', 'daughter', 'also']...) from 2 documents (total 39 corpus positions)
2018-09-11 22:26:58,205 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:26:58,206 : INFO : built Dictionary(19 unique tokens: ['best', 'showrooms', 'stores', 'lot', 'thanks']...) from 2 documents (total 31 corpus positions)
2018-09-11 22:26:58,210 : INFO : Removed 11 and 0 OOV words from document 1 and

2018-09-11 22:27:00,105 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:00,106 : INFO : built Dictionary(39 unique tokens: ['get', 'know', 'apartments', 'ok', 'seem']...) from 2 documents (total 50 corpus positions)
2018-09-11 22:27:00,120 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:27:01,673 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:01,675 : INFO : built Dictionary(35 unique tokens: ['qatar', 'suggestions', 'receive', 'nman', 'hello']...) from 2 documents (total 51 corpus positions)
2018-09-11 22:27:01,684 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:01,685 : INFO : built Dictionary(25 unique tokens: ['charges', 'warm', 'say', 'hav', 'india']...) from 2 documents (total 39 corpus positions)
2018-09-11 22:27:01,691 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:01,692 : INFO : adding document #0 to Dictionary(0 uniqu

2018-09-11 22:27:03,495 : INFO : built Dictionary(19 unique tokens: ['got', 'application', 'bringing', 'schedule', 'meet']...) from 2 documents (total 30 corpus positions)
2018-09-11 22:27:03,499 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:27:03,527 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:03,528 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:03,529 : INFO : built Dictionary(30 unique tokens: ['nationals', 'qatar', 'admin', 'could', 'benefits']...) from 2 documents (total 42 corpus positions)
2018-09-11 22:27:03,540 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:03,541 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:03,542 : INFO : built Dictionary(26 unique tokens: ['current', 'qatar', 'admin', 'could', 'junior']...) from 2 documents (total 42 corpus positions)
2018-09-11 22:27:03,551 : INFO : Removed 0 and

2018-09-11 22:27:05,403 : INFO : Removed 4 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:05,404 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:05,405 : INFO : built Dictionary(41 unique tokens: ['moving', 'pls', 'area', 'comments', 'hello']...) from 2 documents (total 52 corpus positions)
2018-09-11 22:27:05,415 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:05,415 : INFO : built Dictionary(14 unique tokens: ['get', 'people', 'lot', 'must', 'plan']...) from 2 documents (total 14 corpus positions)
2018-09-11 22:27:05,418 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:05,419 : INFO : built Dictionary(34 unique tokens: ['getting', 'mumbai', 'night', 'anybody', 'n']...) from 2 documents (total 37 corpus positions)
2018-09-11 22:27:05,427 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:27:06,969 : INFO : adding document #0 to Dictionary(0 unique tok

2018-09-11 22:27:08,831 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:08,832 : INFO : built Dictionary(25 unique tokens: ['alternative', 'qatar', 'thanks', 'getting', 'long']...) from 2 documents (total 34 corpus positions)
2018-09-11 22:27:08,839 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:08,840 : INFO : built Dictionary(30 unique tokens: ['cancelled', 'work', 'qatar', 'wait', 'long']...) from 2 documents (total 40 corpus positions)
2018-09-11 22:27:08,849 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:08,850 : INFO : built Dictionary(41 unique tokens: ['ask', 'work', 'entry', 'went', 'vacation']...) from 2 documents (total 54 corpus positions)
2018-09-11 22:27:08,865 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:08,866 : INFO : built Dictionary(22 unique tokens: ['work', 'get', 'employment', 'long', 'hello']...) from 2 documents (total 29 corpus po

2018-09-11 22:27:10,757 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:10,758 : INFO : built Dictionary(43 unique tokens: ['difference', 'received', 'qatar', 'capacity', 'need']...) from 2 documents (total 61 corpus positions)
2018-09-11 22:27:10,775 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:10,776 : INFO : built Dictionary(41 unique tokens: ['received', 'qatar', 'would', 'means', 'iphone']...) from 2 documents (total 54 corpus positions)
2018-09-11 22:27:10,793 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:10,794 : INFO : built Dictionary(50 unique tokens: ['barely', 'round', 'would', 'better', 'p']...) from 2 documents (total 59 corpus positions)
2018-09-11 22:27:10,818 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:10,819 : INFO : built Dictionary(47 unique tokens: ['round', 'recently', 'men', 'received', 'wearing']...) from 2 documents (total 63 

2018-09-11 22:27:14,199 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:14,200 : INFO : built Dictionary(50 unique tokens: ['ask', 'nurse', 'good', 'months', 'years']...) from 2 documents (total 63 corpus positions)
2018-09-11 22:27:14,218 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:14,219 : INFO : built Dictionary(37 unique tokens: ['qatar', 'new', 'old', 'cost', 'one']...) from 2 documents (total 49 corpus positions)
2018-09-11 22:27:14,230 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:14,231 : INFO : built Dictionary(35 unique tokens: ['work', 'qatar', 'could', 'government', 'soon']...) from 2 documents (total 43 corpus positions)
2018-09-11 22:27:14,242 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:14,243 : INFO : built Dictionary(35 unique tokens: ['wife', 'work', 'qatar', 'weeks', 'thanks']...) from 2 documents (total 45 corpus positions)
2018-09

2018-09-11 22:27:16,281 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:16,282 : INFO : built Dictionary(30 unique tokens: ['flavours', 'qatar', 'isnt', 'store', 'seem']...) from 2 documents (total 37 corpus positions)
2018-09-11 22:27:16,288 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:16,289 : INFO : built Dictionary(12 unique tokens: ['shoe', 'qatar', 'isnt', 'clothes', 'apple']...) from 2 documents (total 21 corpus positions)
2018-09-11 22:27:16,291 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:16,292 : INFO : built Dictionary(24 unique tokens: ['deserves', 'get', 'isnt', 'thanks', 'apple']...) from 2 documents (total 31 corpus positions)
2018-09-11 22:27:16,296 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:16,297 : INFO : built Dictionary(15 unique tokens: ['isnt', 'moving', 'stores', 'thanks', 'apple']...) from 2 documents (total 22 corpus positio

2018-09-11 22:27:19,588 : INFO : built Dictionary(36 unique tokens: ['wife', 'qatar', 'days', 'visit', 'shall']...) from 2 documents (total 50 corpus positions)
2018-09-11 22:27:19,599 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:19,600 : INFO : built Dictionary(37 unique tokens: ['days', 'get', 'validity', 'except', 'shut']...) from 2 documents (total 57 corpus positions)
2018-09-11 22:27:19,612 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:19,613 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:19,614 : INFO : built Dictionary(45 unique tokens: ['cancelled', 'release', 'manager', 'shall', 'rp']...) from 2 documents (total 57 corpus positions)
2018-09-11 22:27:19,630 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:19,631 : INFO : built Dictionary(20 unique tokens: ['qatar', 'extend', 'expiry', 'penalty', 'shall']...) from 2 documents (total 30 c

2018-09-11 22:27:21,764 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:21,765 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:21,766 : INFO : built Dictionary(34 unique tokens: ['trap', 'qatar', 'crazy', 'intend', 'weeks']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:27:21,775 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:21,776 : INFO : built Dictionary(38 unique tokens: ['flying', 'drive', 'travel', 'bodybuilding', 'supplements']...) from 2 documents (total 48 corpus positions)
2018-09-11 22:27:21,789 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:21,790 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:21,790 : INFO : built Dictionary(44 unique tokens: ['intend', 'engagement', 'back', 'travel', 'drive']...) from 2 documents (total 54 corpus positions)
2018-09-11 22:27:21,807 : INFO : ad

2018-09-11 22:27:24,910 : INFO : built Dictionary(52 unique tokens: ['best', 'plz', 'open', 'know', 'healthy']...) from 2 documents (total 63 corpus positions)
2018-09-11 22:27:24,933 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:24,934 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:24,935 : INFO : built Dictionary(41 unique tokens: ['would', 'plz', 'get', 'suggessions', 'tab']...) from 2 documents (total 50 corpus positions)
2018-09-11 22:27:24,951 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:24,952 : INFO : built Dictionary(31 unique tokens: ['plz', 'qatar', 'open', 'know', 'lock']...) from 2 documents (total 40 corpus positions)
2018-09-11 22:27:24,962 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:24,963 : INFO : built Dictionary(46 unique tokens: ['plz', 'said', 'open', 'tab', 'pad']...) from 2 documents (total 65 corpus positions)
2018-0

2018-09-11 22:27:27,457 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:27,459 : INFO : built Dictionary(26 unique tokens: ['plz', 'shop', 'open', 'ive', 'foreign']...) from 2 documents (total 36 corpus positions)
2018-09-11 22:27:27,467 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:27,468 : INFO : built Dictionary(28 unique tokens: ['shop', 'get', 'ive', 'ok', 'thought']...) from 2 documents (total 39 corpus positions)
2018-09-11 22:27:27,477 : INFO : Removed 2 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:27,478 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:27,479 : INFO : built Dictionary(32 unique tokens: ['alternative', 'reply', 'ive', 'healthy', 'could']...) from 2 documents (total 41 corpus positions)
2018-09-11 22:27:27,490 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:27,491 : INFO : built Dictionary(28 unique tokens: ['

2018-09-11 22:27:30,223 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:30,225 : INFO : built Dictionary(26 unique tokens: ['mess', 'better', 'thank', 'one', 'food']...) from 2 documents (total 37 corpus positions)
2018-09-11 22:27:30,233 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:30,234 : INFO : built Dictionary(30 unique tokens: ['mess', 'work', 'drive', 'area', 'guys']...) from 2 documents (total 40 corpus positions)
2018-09-11 22:27:30,243 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:30,244 : INFO : built Dictionary(32 unique tokens: ['mess', 'korean', 'thanks', 'thank', 'please']...) from 2 documents (total 46 corpus positions)
2018-09-11 22:27:30,254 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:30,255 : INFO : built Dictionary(41 unique tokens: ['mess', 'shop', 'also', 'opening', 'let']...) from 2 documents (total 58 corpus positions)
2018-09-

2018-09-11 22:27:32,112 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:27:32,872 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:32,874 : INFO : built Dictionary(31 unique tokens: ['ask', 'permit', 'qatar', 'would', 'wife']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:27:32,882 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:32,882 : INFO : built Dictionary(32 unique tokens: ['wife', 'work', 'qatar', 'steps', 'n']...) from 2 documents (total 56 corpus positions)
2018-09-11 22:27:32,894 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:32,895 : INFO : built Dictionary(42 unique tokens: ['ask', 'work', 'get', 'would', 'wife']...) from 2 documents (total 59 corpus positions)
2018-09-11 22:27:32,914 : INFO : Removed 2 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:32,915 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
20

2018-09-11 22:27:34,857 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:34,858 : INFO : built Dictionary(46 unique tokens: ['aside', 'dance', 'wicked', 'cool', 'community']...) from 2 documents (total 56 corpus positions)
2018-09-11 22:27:34,875 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:27:35,605 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:35,607 : INFO : built Dictionary(21 unique tokens: ['qatar', 'pls', 'psychologist', 'children', 'child']...) from 2 documents (total 27 corpus positions)
2018-09-11 22:27:35,612 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:35,613 : INFO : built Dictionary(50 unique tokens: ['childrens', 'spectrum', 'second', 'son', 'moving']...) from 2 documents (total 59 corpus positions)
2018-09-11 22:27:35,631 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:35,632 : INFO : built Dictionary(20 unique tokens

2018-09-11 22:27:37,527 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:37,528 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:37,529 : INFO : built Dictionary(45 unique tokens: ['resign', 'resignation', 'please', 'contract', 'months']...) from 2 documents (total 54 corpus positions)
2018-09-11 22:27:37,546 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:27:38,383 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:38,385 : INFO : built Dictionary(53 unique tokens: ['open', 'outlandish', 'food', 'offered', 'house']...) from 2 documents (total 65 corpus positions)
2018-09-11 22:27:38,412 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:38,413 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:38,414 : INFO : built Dictionary(45 unique tokens: ['shape', 'full', 'qatar', 'food', 'house']...) from 2 d

2018-09-11 22:27:40,427 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:40,428 : INFO : built Dictionary(57 unique tokens: ['work', 'ok', 'pants', 'wear', 'qatar']...) from 2 documents (total 69 corpus positions)
2018-09-11 22:27:40,461 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:40,462 : INFO : built Dictionary(50 unique tokens: ['said', 'countries', 'ok', 'work', 'attitiude']...) from 2 documents (total 62 corpus positions)
2018-09-11 22:27:40,486 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:40,487 : INFO : built Dictionary(55 unique tokens: ['work', 'know', 'ok', 'qatari', 'wear']...) from 2 documents (total 68 corpus positions)
2018-09-11 22:27:40,517 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:27:41,022 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:41,023 : INFO : built Dictionary(33 unique tokens: ['info', 'within', 'movi

2018-09-11 22:27:42,916 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:42,917 : INFO : built Dictionary(46 unique tokens: ['section', 'c', 'labour', 'deliver', 'care']...) from 2 documents (total 51 corpus positions)
2018-09-11 22:27:42,935 : INFO : Removed 2 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:42,936 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:42,937 : INFO : built Dictionary(31 unique tokens: ['make', 'suggestions', 'closed', 'insurance', 'prices']...) from 2 documents (total 33 corpus positions)
2018-09-11 22:27:42,946 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:42,947 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:42,948 : INFO : built Dictionary(38 unique tokens: ['proceed', 'get', 'suggestions', 'better', 'thanks']...) from 2 documents (total 45 corpus positions)
2018-09-11 22:27:42,961 : INFO : prec

2018-09-11 22:27:45,927 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:45,928 : INFO : built Dictionary(49 unique tokens: ['shop', 'trace', 'small', 'please', 'india']...) from 2 documents (total 58 corpus positions)
2018-09-11 22:27:45,947 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:45,948 : INFO : built Dictionary(52 unique tokens: ['shop', 'kill', 'trace', 'small', 'police']...) from 2 documents (total 61 corpus positions)
2018-09-11 22:27:45,970 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:45,972 : INFO : built Dictionary(52 unique tokens: ['shop', 'drive', 'second', 'small', 'please']...) from 2 documents (total 59 corpus positions)
2018-09-11 22:27:45,994 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:27:46,372 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:46,373 : INFO : built Dictionary(39 unique tokens: ['wife', 'qatar',

2018-09-11 22:27:48,250 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:48,251 : INFO : built Dictionary(24 unique tokens: ['group', 'c', 'thanks', 'lives', 'please']...) from 2 documents (total 30 corpus positions)
2018-09-11 22:27:48,258 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:48,258 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:48,259 : INFO : built Dictionary(18 unique tokens: ['player', 'people', 'basket', 'knows', 'playin']...) from 2 documents (total 24 corpus positions)
2018-09-11 22:27:48,263 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:48,264 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:48,265 : INFO : built Dictionary(35 unique tokens: ['group', 'back', 'could', 'sports', 'lives']...) from 2 documents (total 44 corpus positions)
2018-09-11 22:27:48,277 : INFO : Removed 0 and 1 OOV w

2018-09-11 22:27:51,376 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:51,377 : INFO : built Dictionary(40 unique tokens: ['wife', 'work', 'old', 'cost', 'resume']...) from 2 documents (total 51 corpus positions)
2018-09-11 22:27:51,393 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:51,394 : INFO : built Dictionary(55 unique tokens: ['wife', 'drive', 'course', 'work', 'cleaning']...) from 2 documents (total 60 corpus positions)
2018-09-11 22:27:51,419 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:51,421 : INFO : built Dictionary(44 unique tokens: ['intend', 'group', 'work', 'contact', 'options']...) from 2 documents (total 52 corpus positions)
2018-09-11 22:27:51,439 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:51,440 : INFO : built Dictionary(30 unique tokens: ['wife', 'work', 'know', 'thanks', 'english']...) from 2 documents (total 37 corpus positions

2018-09-11 22:27:53,613 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:53,614 : INFO : built Dictionary(37 unique tokens: ['work', 'problem', 'household', 'pay', 'house']...) from 2 documents (total 42 corpus positions)
2018-09-11 22:27:53,626 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:53,627 : INFO : built Dictionary(37 unique tokens: ['availabe', 'qatar', 'know', 'located', 'transport']...) from 2 documents (total 41 corpus positions)
2018-09-11 22:27:53,640 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:53,641 : INFO : built Dictionary(35 unique tokens: ['clear', 'goods', 'household', 'companies', 'thanks']...) from 2 documents (total 42 corpus positions)
2018-09-11 22:27:53,652 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:53,653 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:53,654 : INFO : built Diction

2018-09-11 22:27:57,009 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:57,010 : INFO : built Dictionary(41 unique tokens: ['would', 'added', 'ql', 'starting', 'better']...) from 2 documents (total 49 corpus positions)
2018-09-11 22:27:57,023 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:57,024 : INFO : built Dictionary(24 unique tokens: ['qatar', 'starting', 'hobby', 'past', 'comments']...) from 2 documents (total 28 corpus positions)
2018-09-11 22:27:57,030 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:57,031 : INFO : built Dictionary(47 unique tokens: ['starting', 'climbing', 'soon', 'moving', 'good']...) from 2 documents (total 57 corpus positions)
2018-09-11 22:27:57,049 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:57,050 : INFO : built Dictionary(43 unique tokens: ['info', 'getting', 'problem', 'pack', 'settings']...) from 2 documents (total 52 co

2018-09-11 22:27:59,069 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:59,070 : INFO : built Dictionary(27 unique tokens: ['plz', 'except', 'thanks', 'one', 'mazda']...) from 2 documents (total 32 corpus positions)
2018-09-11 22:27:59,078 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:59,079 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:59,079 : INFO : built Dictionary(25 unique tokens: ['get', 'cleaned', 'except', 'thanks', 'cleaning']...) from 2 documents (total 34 corpus positions)
2018-09-11 22:27:59,087 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:27:59,088 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:27:59,089 : INFO : built Dictionary(25 unique tokens: ['qatar', 'except', 'thanks', 'costs', 'selection']...) from 2 documents (total 28 corpus positions)
2018-09-11 22:27:59,095 : INFO : Removed 0 and 

2018-09-11 22:28:02,351 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:02,352 : INFO : built Dictionary(10 unique tokens: ['ireland', 'anyone', 'thanks', 'recommend', 'hairdressers']...) from 2 documents (total 22 corpus positions)
2018-09-11 22:28:02,356 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:28:02,357 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:02,358 : INFO : built Dictionary(17 unique tokens: ['hair', 'pay', 'fantastic', 'thanks', 'hairdressers']...) from 2 documents (total 23 corpus positions)
2018-09-11 22:28:02,362 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:02,363 : INFO : built Dictionary(28 unique tokens: ['wife', 'thanks', 'hairdressers', 'care', 'recommend']...) from 2 documents (total 32 corpus positions)
2018-09-11 22:28:02,371 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:28:02,37

2018-09-11 22:28:04,460 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:04,462 : INFO : built Dictionary(37 unique tokens: ['enter', 'qatar', 'entry', 'passport', 'form']...) from 2 documents (total 53 corpus positions)
2018-09-11 22:28:04,477 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:04,478 : INFO : built Dictionary(27 unique tokens: ['entry', 'maximum', 'per', 'month', 'explain']...) from 2 documents (total 41 corpus positions)
2018-09-11 22:28:04,487 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:04,488 : INFO : built Dictionary(23 unique tokens: ['qatar', 'per', 'month', 'explain', 'one']...) from 2 documents (total 35 corpus positions)
2018-09-11 22:28:04,495 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:04,496 : INFO : built Dictionary(31 unique tokens: ['entry', 'thanks', 'arrival', 'explain', 'please']...) from 2 documents (total 48 corpus pos

2018-09-11 22:28:07,682 : INFO : built Dictionary(20 unique tokens: ['working', 'current', 'qatar', 'indian', 'sponsored']...) from 2 documents (total 46 corpus positions)
2018-09-11 22:28:07,694 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:07,695 : INFO : built Dictionary(24 unique tokens: ['current', 'get', 'work', 'lady', 'qatar']...) from 2 documents (total 33 corpus positions)
2018-09-11 22:28:07,700 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:07,701 : INFO : built Dictionary(42 unique tokens: ['work', 'problem', 'card', 'sponsored', 'pay']...) from 2 documents (total 52 corpus positions)
2018-09-11 22:28:07,719 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:07,720 : INFO : built Dictionary(37 unique tokens: ['current', 'qatar', 'experienced', 'visit', 'work']...) from 2 documents (total 50 corpus positions)
2018-09-11 22:28:07,734 : INFO : Removed 1 and 0 OOV words from documen

2018-09-11 22:28:09,813 : INFO : built Dictionary(31 unique tokens: ['cancelled', 'body', 'council', 'master', 'american']...) from 2 documents (total 38 corpus positions)
2018-09-11 22:28:09,823 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:09,824 : INFO : built Dictionary(33 unique tokens: ['reasonable', 'qatar', 'pharmacist', 'say', 'scheme']...) from 2 documents (total 48 corpus positions)
2018-09-11 22:28:09,835 : INFO : Removed 3 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:28:09,836 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:09,836 : INFO : built Dictionary(34 unique tokens: ['qatar', 'salary', 'curious', 'providers', 'american']...) from 2 documents (total 37 corpus positions)
2018-09-11 22:28:09,847 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:28:09,847 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:09,848 : IN

2018-09-11 22:28:13,140 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:13,142 : INFO : built Dictionary(27 unique tokens: ['justice', 'open', 'air', 'closed', 'excluding']...) from 2 documents (total 34 corpus positions)
2018-09-11 22:28:13,150 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:13,151 : INFO : built Dictionary(26 unique tokens: ['justice', 'missing', 'opening', 'excluding', 'per']...) from 2 documents (total 33 corpus positions)
2018-09-11 22:28:13,159 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:13,160 : INFO : built Dictionary(48 unique tokens: ['open', 'opening', 'excluding', 'soon', 'qatar']...) from 2 documents (total 60 corpus positions)
2018-09-11 22:28:13,180 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:13,181 : INFO : built Dictionary(41 unique tokens: ['justice', 'qatar', 'sizes', 'excluding', 'opening']...) from 2 documents (tot

2018-09-11 22:28:16,939 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:16,939 : INFO : built Dictionary(25 unique tokens: ['living', 'version', 'need', 'thanks', 'garmin']...) from 2 documents (total 31 corpus positions)
2018-09-11 22:28:16,945 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:16,946 : INFO : built Dictionary(35 unique tokens: ['version', 'work', 'get', 'would', 'helps']...) from 2 documents (total 47 corpus positions)
2018-09-11 22:28:16,954 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:16,955 : INFO : built Dictionary(30 unique tokens: ['wanna', 'qatar', 'version', 'purchased', 'cmon']...) from 2 documents (total 37 corpus positions)
2018-09-11 22:28:16,962 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:16,963 : INFO : built Dictionary(18 unique tokens: ['version', 'prefer', 'living', 'afraid', 'hello']...) from 2 documents (total 19 corpu

2018-09-11 22:28:22,254 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:22,255 : INFO : built Dictionary(42 unique tokens: ['resign', 'give', 'cover', 'number', 'cash']...) from 2 documents (total 62 corpus positions)
2018-09-11 22:28:22,268 : INFO : Removed 1 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:28:22,269 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:22,270 : INFO : built Dictionary(26 unique tokens: ['qatar', 'open', 'edge', 'thanks', 'roughly']...) from 2 documents (total 33 corpus positions)
2018-09-11 22:28:22,277 : INFO : Removed 0 and 1 OOV words from document 1 and 2 (respectively).
2018-09-11 22:28:22,278 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:22,279 : INFO : built Dictionary(45 unique tokens: ['work', 'open', 'c', 'better', 'get']...) from 2 documents (total 63 corpus positions)
2018-09-11 22:28:22,293 : INFO : Removed 1 and 1 OOV words fro

2018-09-11 22:28:25,950 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:25,951 : INFO : built Dictionary(31 unique tokens: ['know', 'thanks', 'arrival', 'would', 'like']...) from 2 documents (total 48 corpus positions)
2018-09-11 22:28:25,962 : INFO : precomputing L2-norms of word weight vectors
2018-09-11 22:28:27,649 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:27,650 : INFO : built Dictionary(30 unique tokens: ['entry', 'qa', 'need', 'month', 'also']...) from 2 documents (total 46 corpus positions)
2018-09-11 22:28:27,660 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:27,661 : INFO : built Dictionary(32 unique tokens: ['get', 'right', 'long', 'july', 'travel']...) from 2 documents (total 47 corpus positions)
2018-09-11 22:28:27,671 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:27,672 : INFO : built Dictionary(24 unique tokens: ['get', 'thanks', 'arriv

2018-09-11 22:28:33,135 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:33,136 : INFO : built Dictionary(37 unique tokens: ['wife', 'qatar', 'follow', 'tell', 'civil']...) from 2 documents (total 60 corpus positions)
2018-09-11 22:28:33,153 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:33,154 : INFO : built Dictionary(41 unique tokens: ['ask', 'wife', 'qatar', 'would', 'follow']...) from 2 documents (total 72 corpus positions)
2018-09-11 22:28:33,173 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:33,175 : INFO : built Dictionary(31 unique tokens: ['wife', 'qatar', 'follow', 'hello', 'son']...) from 2 documents (total 47 corpus positions)
2018-09-11 22:28:33,186 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:33,187 : INFO : built Dictionary(29 unique tokens: ['wife', 'qatar', 'follow', 'hello', 'son']...) from 2 documents (total 42 corpus positions)
2018-09

2018-09-11 22:28:38,734 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:38,736 : INFO : built Dictionary(43 unique tokens: ['info', 'poll', 'living', 'people', 'weeks']...) from 2 documents (total 61 corpus positions)
2018-09-11 22:28:38,752 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:38,753 : INFO : built Dictionary(25 unique tokens: ['open', 'customer', 'thanks', 'soon', 'please']...) from 2 documents (total 39 corpus positions)
2018-09-11 22:28:38,760 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2018-09-11 22:28:38,761 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:38,762 : INFO : built Dictionary(29 unique tokens: ['open', 'customer', 'thanks', 'roughly', 'stll']...) from 2 documents (total 43 corpus positions)
2018-09-11 22:28:38,770 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-09-11 22:28:38,771 : INFO : built Dictionary(39 unique

CPU times: user 2.14 s, sys: 5.08 s, total: 7.22 s
Wall time: 2min 51s


The table below shows the pointwise estimates of means and standard variances for MAP scores and elapsed times. Baselines and winners for each year are displayed in bold. We can see that the Soft Cosine Measure gives a strong performance on both the 2016 and the 2017 dataset.

In [12]:
from IPython.display import display, Markdown

output = []
baselines = [
    (("2016-test", "**Winner (UH-PRHLT-primary)**"), ((76.70, 0), (0, 0))),
    (("2016-test", "**Baseline 1 (IR)**"), ((74.75, 0), (0, 0))),
    (("2016-test", "**Baseline 2 (random)**"), ((46.98, 0), (0, 0))),
    (("2017-test", "**Winner (SimBow-primary)**"), ((47.22, 0), (0, 0))),
    (("2017-test", "**Baseline 1 (IR)**"), ((41.85, 0), (0, 0))),
    (("2017-test", "**Baseline 2 (random)**"), ((29.81, 0), (0, 0)))]
table_header = ["Dataset | Strategy | MAP score | Elapsed time (sec)", ":---|:---|:---|---:"]
for row, ((dataset, technique), ((mean_map_score, mean_duration), (std_map_score, std_duration))) \
        in enumerate(sorted(chain(zip(args_list, results), baselines), key=lambda x: (x[0][0], -x[1][0][0]))):
    if row % (len(strategies) + 3) == 0:
        output.extend(chain(["\n"], table_header))
    map_score = "%.02f ±%.02f" % (mean_map_score, std_map_score)
    duration = "%.02f ±%.02f" % (mean_duration, std_duration) if mean_duration else ""
    output.append("%s|%s|%s|%s" % (dataset, technique, map_score, duration))

display(Markdown('\n'.join(output)))



Dataset | Strategy | MAP score | Elapsed time (sec)
:---|:---|:---|---:
2016-test|softcossim|77.15 ±10.83|4.48 ±0.56
2016-test|**Winner (UH-PRHLT-primary)**|76.70 ±0.00|
2016-test|cossim|76.45 ±10.40|0.25 ±0.04
2016-test|wmd-gensim|76.15 ±11.51|13.79 ±1.39
2016-test|**Baseline 1 (IR)**|74.75 ±0.00|
2016-test|wmd-relax|72.03 ±11.33|0.34 ±0.07
2016-test|**Baseline 2 (random)**|46.98 ±0.00|


Dataset | Strategy | MAP score | Elapsed time (sec)
:---|:---|:---|---:
2017-test|**Winner (SimBow-primary)**|47.22 ±0.00|
2017-test|wmd-relax|45.04 ±15.44|0.39 ±0.07
2017-test|cossim|44.38 ±14.71|0.29 ±0.05
2017-test|softcossim|44.25 ±15.68|4.89 ±0.80
2017-test|wmd-gensim|44.08 ±15.96|16.69 ±1.90
2017-test|**Baseline 1 (IR)**|41.85 ±0.00|
2017-test|**Baseline 2 (random)**|29.81 ±0.00|

## References

1. Grigori Sidorov et al. *Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model*, 2014. ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf))
2. Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017. ([link to PDF](http://www.aclweb.org/anthology/S17-2051))
3. Thomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space, 2013. ([link to PDF](https://arxiv.org/pdf/1301.3781.pdf))