Add Sent2Vec model. Fix #1376 #1619

prerna135 · 2017-10-10T13:52:12Z

Rough initial code for sent2vec.

Fixes warnings in the .py files

@menshikh-iv

@menshikh-iv Fixing warnings in the .py files according to the Google Code Style. Most of the warnings were due to indentation errors.

build succeeded, 21 warnings. Getting there. :-)

Now I'm down to, `build succeeded, 5 warnings.` However, I'm in a bit of a fix. Changing `doc2vec.rst` and `word2vec.rst` to `.inc` files removed the duplicate warnings but it also invalidates the references to these documents from my main toctree and the following warnings are produced. `apiref.rst:8: WARNING: toctree contains reference to nonexisting document u'models/doc2vec'` `apiref.rst:8: WARNING: toctree contains reference to nonexisting document u'models/word2vec'`

…velop

@menshikh-iv

@menshikh-iv

@menshikh-iv

@menshikh-iv

…velop

Rough initial code for sent2vec and tests in jupyter notebook

menshikh-iv

Great start:+1:

What's need to add

Add docstrings in Numpy format https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
Add sent2vec.rst to docs/src/model + update apiref.rst
Remove code duplication (exactly same methods with fasttext)
Add tests for your class
Notebook with reproducing results

menshikh-iv · 2017-10-11T05:04:13Z

gensim/models/sent2vec.py

+
+logger = logging.getLogger(__name__)
+
+MAX_WORDS_IN_BATCH = 10000


You can import this constant instead of explicit definition

menshikh-iv · 2017-10-11T05:04:34Z

gensim/models/sent2vec.py

+            self, sentences=None, sg=0, hs=0, size=100, alpha=0.2, window=5, min_count=5,
+            max_vocab_size=None, word_ngrams=2, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
+            negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000,
+            trim_rule=None, batch_words=MAX_WORDS_IN_BATCH, dropoutK=2):


All parameters should be in lowercase (dropoutK)

menshikh-iv · 2017-10-11T05:05:25Z

gensim/models/sent2vec.py

+from numpy import dot
+from gensim import utils, matutils
+
+from gensim.models.word2vec import Word2Vec


Useless import

menshikh-iv · 2017-10-11T05:05:44Z

gensim/models/sent2vec.py

+            trim_rule=None, batch_words=MAX_WORDS_IN_BATCH, dropoutK=2):
+
+        # sent2vec specific params
+        #dropoutK is the number of ngrams dropped while training a sent2vec model


misssing space after # (here and anywhere)

menshikh-iv · 2017-10-11T05:06:21Z

gensim/models/sent2vec.py

+            trim_rule=trim_rule, sorted_vocab=sorted_vocab, batch_words=batch_words)
+
+    def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
+        """Do an initial scan of all words appearing in sentences."""


Please add more description in docstring, what's a difference between fasttext and sent2vec.

menshikh-iv · 2017-10-11T05:06:48Z

gensim/models/sent2vec.py

+                line_size = len(sentence)
+                discard = [False] * line_size
+                while (num_discarded < self.dropoutK and line_size - num_discarded > 2):
+                    token_to_discard = randint(0,line_size-1)


need space after , + spaces around -

menshikh-iv · 2017-10-11T05:07:41Z

gensim/models/sent2vec.py

+                discard = [False] * line_size
+                while (num_discarded < self.dropoutK and line_size - num_discarded > 2):
+                    token_to_discard = randint(0,line_size-1)
+                    if discard[token_to_discard] == False:


if discard[token_to_discard] == False: -> if not discard[token_to_discard]:

menshikh-iv · 2017-10-11T05:09:13Z

gensim/models/sent2vec.py

+    def word_vec(self, word, use_norm=False):
+        return FastTextKeyedVectors.word_vec(self.wv, word, use_norm=use_norm)
+
+    def sent_vec(self, sentence):


This method is bad, don't need to split the raw string to tokens, should pass for example list of tokens.

menshikh-iv · 2017-10-11T05:09:58Z

gensim/models/sent2vec.py

+              word_count=0, queue_factor=2, report_delay=1.0):
+        super(Sent2Vec, self).train(sentences, total_examples=total_examples, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha)
+
+    def __getitem__(self, word):


what's a difference between __getitem__ and word_vec ?

Adding word ngrams for sentences and changing training function accordingly

menshikh-iv

Only one question - why you don't re-use same functionality from current fasttext implementation?

menshikh-iv · 2017-10-25T07:07:59Z

gensim/models/sent2vec.py

+# TODO: add docstrings and tests
+
+
+class Entry():


Please add inheritance from object explicitly (here and anywhere)

For Entry proposes, you can use namedtuple

I can't use tuple because it is immutable.

menshikh-iv · 2017-10-25T07:57:58Z

gensim/models/sent2vec.py

+        self.subwords = subwords
+
+
+class Dictionary():


We already have Dictionary class in gensim.corpora, please rename to avoid confusion

Done. Kindly verify in the current commit.

menshikh-iv · 2017-10-25T07:58:45Z

gensim/models/sent2vec.py

+            self.words[self.word2int[h]].count += 1
+
+    def read(self, sentences, min_count):
+        minThreshold = 1


No camelCase, only lowercase_with_underscores (here and everywhere).

Done. Kindly verify in the current commit.

menshikh-iv · 2017-10-25T08:00:15Z

gensim/models/sent2vec.py

+                    self.threshold(minThreshold)
+
+        self.threshold(min_count)
+        self.initTableDiscard()


same as for variables (no camel case), here and everywhere.

Done. Kindly verify in the current commit.

menshikh-iv · 2017-10-25T08:01:32Z

gensim/models/sent2vec.py

+        return ntokens, hashes, words
+
+
+class Model():


Are you really need this class (why this isn't a part of Sent2Vec?)

Done. Merged with sent2vec now.

menshikh-iv · 2017-10-25T08:02:42Z

gensim/models/sent2vec.py

+
+logger = logging.getLogger(__name__)
+# TODO: add logger statements instead of print statements
+# TODO: add docstrings and tests


It's time to make resolve this TODO's, start from logger and tests.

Done. Kindly verify in the current commit.

menshikh-iv · 2017-10-25T08:18:15Z

gensim/models/sent2vec.py

+            print "Progress: ", progress * 100, "% lr: ", lr, " loss: ", self.model.loss / self.model.nexamples
+        print "\n\nTotal training time: %s seconds" % (time.time() - start_time)
+
+    def sentence_vectors(self, sentence_string):


Useless method (no need to make tokenization in model)

Done. Kindly verify in the current commit. Now sentence is passed as a list of unicode strings.

menshikh-iv · 2017-10-25T08:18:41Z

gensim/models/sent2vec.py

+        sent_vec *= (1.0 / len(line))
+        return sent_vec
+
+    def similarity(self, sent1, sent2):


sent1 and sent2 should be already list of tokens (no need to tokenize it).

Done. Kindly verify in the current commit.

…docstrings

menshikh-iv · 2017-11-02T07:17:24Z

gensim/models/sent2vec.py

+    """
+
+    def __init__(self, vector_size=100, lr=0.2, lr_update_rate=100, epochs=5,
+            min_count=5, neg=10, word_ngrams=2, loss_type='ns', bucket=2000000, t=0.0001,


Need vertical intend (for method definition only).

menshikh-iv · 2017-11-02T07:18:27Z

gensim/models/sent2vec.py

+    """
+
+    def __init__(self, word=None, count=0, subwords=[]):
+        """


Please use numpy-style, here and everywhere.

Done. Kindly verify in the current commit.

menshikh-iv · 2017-11-02T07:19:39Z

gensim/models/sent2vec.py

+    word and character ngrams.
+    """
+
+    def __init__(self, t, bucket, minn, maxn, max_vocab_size=30000000, max_line_size=1024):


Please add docstrings everywhere (with parameter description + types)

Done. Kindly verify in the current commit.

menshikh-iv · 2017-11-02T07:43:03Z

gensim/models/sent2vec.py

+        `dropoutk` = Number of ngrams dropped when training a sent2vec model. Default is 2.
+        """
+
+        random.seed(seed)


Incorrect, you pin "global" random seed, please use

from gensim.utils import get_random_state random_state = get_random_state(seed)

Done. Kindly verify in the current commit.

menshikh-iv · 2017-11-02T08:15:27Z

gensim/models/sent2vec.py

+        For Sent2Vec, each sentence must be a list of unicode strings.
+        """
+
+        logger.info("Creating dictionary...")


Change this method according to w2v (train in init if sentences is provided and so on)

Done. Kindly verify in the current commit.

menshikh-iv · 2017-11-08T16:34:20Z

gensim/models/sent2vec.py

@@ -75,6 +75,25 @@ class ModelDictionary():
    """

    def __init__(self, t, bucket, minn, maxn, max_vocab_size=30000000, max_line_size=1024):
+        """
+        Initialize a sent2vec dictionary.


Numpy-style please: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt and http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html (here and anywhere)

Sorry, my bad! I used the word2vec code as a reference. I've updated the docstrings. Kindly verify in the latest commit.

Adding numpy docstrings, function to read corpus directly from disk, link to evaluation scripts in the notebook, evaluation of original c++ sent2vec to final table

horpto · 2019-01-10T12:13:04Z

gensim/models/sent2vec_inner.pyx

+        for j from i + 1 <= j < line_size:
+            if j >= i + n or discard[j] == 1:
+                break
+            h = h * 116049371 + line[j]


it's quite not obvious what a magic number is 116049371

this is mimic to FB implementation

A good idea to include that info in a comment -- so someone doesn't accidentally change the magic numbers in the future.

horpto · 2019-01-10T12:15:25Z

gensim/models/sent2vec_inner.pyx

+    return ntokens
+
+
+cdef void add_ngrams_train(vector[int] &line, int n, int k, int bucket, int size)nogil:


shouldn't empty space stand before nogil ?

Not necessary as I remember

yeah, unnecessary, but r e a d a b i l i t y

I'll fix that myself of course

menshikh-iv · 2019-01-11T04:48:53Z

blocked by #2313 (should be merged before we can continue with current PR)

menshikh-iv · 2019-01-15T08:29:32Z

@prerna135 sent2vec built successfully (some issues with FT and python2, but unrelated)

Bad performance (you can check it yourself, can be related with correct hash function, but I'm not sure)
Looks like model doesn't really learn, for example

import logging
from gensim.models import Sent2Vec
import gensim.downloader as api
from gensim.utils import simple_preprocess
import numpy as np
from scipy.spatial.distance import cdist

logging.basicConfig(level=logging.INFO)

corpus = [simple_preprocess(_["data"]) for _ in api.load("20-newsgroups")]
model = Sent2Vec(corpus)

c_vectors = np.array([model[d] for d in corpus])
fst_vector = c_vectors[0]

similarities = (1 - cdist(new_vector.reshape((1, new_vector.shape[0])), c_vectors, metric='cosine')).reshape(-1)
print(similarities, similarities.mean())

# (array([0.99890759, 0.99885709, 0.99865863, ..., 0.99843101, 0.99866404,
#       0.99762219]), 0998304)

I.e. all vectors from corpus super-near, that's very suspicious IMO, even if I try to model.similarity(random_words, different_random_words) - it's also too high, any ideas, what's wrong here?

Original sent2vec released several models that we can't load (important feature I think, see https://github.com/epfml/sent2vec#downloading-pre-trained-models)

Unfortunatelly, I can't merge PR in current state :( Not ready for 3.7.0.
@prerna135 when you'll have a time to resolve mentioned problems?

horpto · 2019-01-15T17:03:39Z

gensim/models/sent2vec.py

+                if line not in ['\n', '\r\n']:
+                    sentence = list(tokenize(line))
+                if sentence:
+                    yield sentence


if line is a new line chars then previous sentence will be yielded twice.
Is it bug or a feature?

hm, I guess that's a bug

prerna135 · 2019-01-20T08:34:09Z

@prerna135 sent2vec built successfully (some issues with FT and python2, but unrelated)

Bad performance (you can check it yourself, can be related with correct hash function, but I'm not sure)

Looks like model doesn't really learn, for example
import logging
from gensim.models import Sent2Vec
import gensim.downloader as api
from gensim.utils import simple_preprocess
import numpy as np
from scipy.spatial.distance import cdist

logging.basicConfig(level=logging.INFO)

corpus = [simple_preprocess(_["data"]) for _ in api.load("20-newsgroups")]
model = Sent2Vec(corpus)

c_vectors = np.array([model[d] for d in corpus])
fst_vector = c_vectors[0]

similarities = (1 - cdist(new_vector.reshape((1, new_vector.shape[0])), c_vectors, metric='cosine')).reshape(-1)
print(similarities, similarities.mean())

# (array([0.99890759, 0.99885709, 0.99865863, ..., 0.99843101, 0.99866404,
#       0.99762219]), 0998304)
I.e. all vectors from corpus super-near, that's very suspicious IMO, even if I try to model.similarity(random_words, different_random_words) - it's also too high, any ideas, what's wrong here?

Original sent2vec released several models that we can't load (important feature I think, see https://github.com/epfml/sent2vec#downloading-pre-trained-models)

Unfortunatelly, I can't merge PR in current state :( Not ready for 3.7.0.
@prerna135 when you'll have a time to resolve mentioned problems?

Hi @menshikh-iv. The bad performance part is surprising since it outperformed doc2vec in all the evaluation tasks as you can see here. I'll try to check what could be causing the problem while calculating sentence similarities. Can't promise quick results though, as my semester begins next week.

menshikh-iv · 2019-01-20T10:20:54Z

@prerna135 I guess I know what’s a reason of bad performance (problem in hash function), @mpenkov will fix it soon and I’ll update your PR and ping you, ok ?

menshikh-iv · 2019-01-20T11:42:23Z

Possible problem (hash func):
#1261 (comment)
Will be fixed in #2340 (block current PR)

# Conflicts: # gensim/models/_utils_any2vec.c # gensim/models/word2vec_inner.c # setup.py

menshikh-iv · 2019-01-24T06:06:14Z

@prerna135 I fixed a performance issue (this was really hashing issue) & build itself.
Unfortunately, the quality of result still low (2 from #1619 (comment)), feel free to investigate (btw, check also my changes that I don't break any myself).

So, TODO for you

check my changes in your PR, make sure that I don't break any
investigate & fix, what happens with distance
make sure that you calculate ngrams/hashes in exactly same way (bytes or string ngram?), this is an important one, because fasttext use bytes for it (not strings) -> we hit with incompatibility issue with FB models, see fasttext ft_hash and unicode handling #2059 and Fix critical issues in FastText #2313 + Optimize FastText.load_fasttext_model #2340
(optional) Implement load_sent2vec_format for support pre-trained from https://github.com/epfml/sent2vec#downloading-pre-trained-models, similar to https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format (load binary to Sent2Vec class -> possible to infer sentence vector and continue an training)

menshikh-iv · 2019-01-24T06:34:23Z

@prerna135 also, please fix calls (to avoid deprecation warning)

gensim/test/test_sent2vec.py::TestSent2VecModel::test_online_learning
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:638: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_online_learning2
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_online_learning_after_save
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:638: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_persistence
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_sent2vec_for_document
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_training
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).

menshikh-iv · 2019-01-29T03:56:39Z

ping @prerna135, any updates?

mpenkov · 2019-04-21T14:33:11Z

@prerna135 are you able to finish this PR?

prerna135 · 2019-04-23T14:56:45Z

@menshikh-iv @mpenkov I have been very busy with grad school in the past year. Apologies for the delay. I'll try to look into this over the summer.

piskvorky · 2019-07-04T08:35:57Z

@prerna135 ping. This project has been under way for nearly 2 years.

piskvorky · 2019-08-21T09:23:57Z

@prerna135 what's the status? Summer is nearly over.

@mpenkov unless Prerna finishes the PR, we'll have to kill it + her incubator blog post. It's getting ridiculous.

prerna135 · 2019-08-21T09:33:35Z

@menshikh-iv @piskvorky I tried to look into the code over the summer (hash function and distance computation issue). I'd have to go over the entire original c++ code to detect the bug, refractor code according to avoid deprecation warnings, retrain models and run benchmarking experiments again. I'm afraid I won't be able to devote the time required to do this. Apologies for dragging this out. I tried to run the existing code to replicate blog post results, but too many things were breaking for me to figure out the issue quickly.

prerna135 added 16 commits June 23, 2017 02:10

Fixes a part of piskvorky#1192

023c141

Fixes warnings in the .py files

Removing additional whitespaces

ad33484

Removing additional whitespaces from utils.py

4a5143a

Removing trailing/leading whitespaces from

9c31c01

Making changes according to Google Code Style

91980d6

@menshikh-iv Fixing warnings in the .py files according to the Google Code Style. Most of the warnings were due to indentation errors.

Removing trailing spaces after Travis build

5eb9008

Removing duplication citation, toctree and non-local image uri warnings

ecfd353

build succeeded, 21 warnings. Getting there. :-)

Adding .inc files to flake8 ignore list

f62e113

Merge branch 'develop' into develop

c7ffec3

Merge branch 'develop' of https://github.com/prerna135/gensim into de…

c37e2e7

…velop

Removing the last few warnings

02fb823

[WIP] Address Detection Evaluation on various NER libraries

ade83fc

@menshikh-iv

[WIP] Address Detection Evaluation on various NER libraries

8eb3666

@menshikh-iv

Merge branch 'develop' of https://github.com/prerna135/gensim into de…

318ffe7

…velop

[WIP] Native implementation of sent2vec in gensim

06cefaa

Rough initial code for sent2vec and tests in jupyter notebook

menshikh-iv added the incubator project PR is RaRe incubator project label Oct 10, 2017

menshikh-iv suggested changes Oct 11, 2017

View reviewed changes

prerna135 added 5 commits October 17, 2017 19:19

Revamping sent2vec class

e1078e5

Adding word ngrams for sentences and changing training function accordingly

Bug fixes and resolving travis build errors

66b8bca

Fixing pep8 issues

1a18310

Resolving bugs, adding elementary tests to jupyter notebook

9e213b4

Adding tests, comparison to c++ implementation

c066b53

menshikh-iv suggested changes Oct 25, 2017

View reviewed changes

Adding tests from paper, comparison to doc2vec and fasttext, logger, …

e40e3f3

…docstrings

menshikh-iv suggested changes Nov 2, 2017

View reviewed changes

Adding docstrings

43f4baf

menshikh-iv suggested changes Nov 8, 2017

View reviewed changes

prerna135 added 2 commits November 11, 2017 10:27

Notebook and code edits

bc73f7e

Adding numpy docstrings, function to read corpus directly from disk, link to evaluation scripts in the notebook, evaluation of original c++ sent2vec to final table

Adding missing imports

993f0d8

horpto reviewed Jan 10, 2019

View reviewed changes

menshikh-iv added 5 commits January 15, 2019 11:04

Merge remote-tracking branch 'upstream/develop' into sent2vec

d5f37d1

use correct hash function

e045ace

unicode all things

f6a821b

make sure than build_vocab isn't mandatory

0d5b7ef

cleanup

b777ea5

menshikh-iv removed the 3.7.0 label Jan 15, 2019

horpto reviewed Jan 15, 2019

View reviewed changes

upd

e5a7531

menshikh-iv force-pushed the sent2vec branch from 055ab9c to e5a7531 Compare January 24, 2019 05:21

menshikh-iv added 3 commits January 24, 2019 10:23

Merge remote-tracking branch 'upstream/develop' into sent2vec

9443375

# Conflicts: # gensim/models/_utils_any2vec.c # gensim/models/word2vec_inner.c # setup.py

use optimized hash function & type fixes

ccb0678

add encoding before hash

2cfb5be

mpenkov added this to Needs triage in PR triage Nov 3, 2019

mpenkov closed this Jun 10, 2020

PR triage automation moved this from Needs triage to Closed Jun 10, 2020


		logger = logging.getLogger(__name__)

		MAX_WORDS_IN_BATCH = 10000

		return ntokens


		cdef void add_ngrams_train(vector[int] &line, int n, int k, int bucket, int size)nogil:

Add Sent2Vec model. Fix #1376 #1619

Add Sent2Vec model. Fix #1376 #1619

Conversation

prerna135 commented Oct 10, 2017 • edited Loading

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Jan 11, 2019

menshikh-iv commented Jan 15, 2019 • edited Loading

horpto Jan 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prerna135 commented Jan 20, 2019 • edited Loading

menshikh-iv commented Jan 20, 2019

menshikh-iv commented Jan 20, 2019

menshikh-iv commented Jan 24, 2019 • edited Loading

menshikh-iv commented Jan 24, 2019

menshikh-iv commented Jan 29, 2019

mpenkov commented Apr 21, 2019

prerna135 commented Apr 23, 2019

piskvorky commented Jul 4, 2019

piskvorky commented Aug 21, 2019 • edited Loading

prerna135 commented Aug 21, 2019

prerna135 commented Oct 10, 2017 •

edited

Loading

menshikh-iv left a comment •

edited

Loading

menshikh-iv Nov 8, 2017 •

edited

Loading

menshikh-iv commented Jan 15, 2019 •

edited

Loading

horpto Jan 15, 2019 •

edited

Loading

prerna135 commented Jan 20, 2019 •

edited

Loading

menshikh-iv commented Jan 24, 2019 •

edited

Loading

piskvorky commented Aug 21, 2019 •

edited

Loading