misc ways to improve infer_vector #515

gojomo · 2015-11-09T00:22:30Z

The default steps should probably be higher: perhaps 10, or the same as the training iter value.

If there are no known-tokens in the supplied doc_words, the method will return the randomly-initialized, low-magnitude, untrained starting vector. (There were no target-words to predict and thus inference was a no-op, except for that random initialization.) This is probably not what the user wants, and the realistic-looking values may fool the user into thinking some work has happened. It would probably be better to return some constant (perhaps zero) vector in such cases, to be clear that all such text-runs with unknown words are similarly opaque, as far as the model is concerned.

The inference loop might be able to make use of #450 word/example batching (especially if it is extended to decay alpha), to turn all steps into one native call.

The method could take many examples, to infer a chunk of examples in one call. At the extreme, it could also make use of multithreading so such a large chunk finishes more quickly.

The text was updated successfully, but these errors were encountered:

gojomo · 2016-01-10T09:58:24Z

The infer_vector method could also allow a user-supplied starting vector, allowing someone to try an alternate policy like the mean-of-word-vectors (as suggested in #460) or even just the zero-vector (which might be just as good as a random starting vector, for this purpose).

gojomo · 2016-04-13T23:02:40Z

As requested on the forum at https://groups.google.com/d/msg/gensim/IIumafg4WkA/Ua3FdeFCJQAJ infer_vector() could also allow known-doctags for the example text to be supplied. At least in some training modes (or if training code in DBOW mode were slightly altered), this might lead to an inferred vector that better represents the 'residual' meaning not already captured in the other doctag(s).

tmylk · 2016-09-22T13:37:18Z

Madhumathi raised the question of infer_vector speed-up on the forum

He is suggesting simply locking all the pre-trained doc and word-vectors and then calling model.train() in order to take advantage of multiple workers and batching.

gojomo · 2016-09-22T17:36:49Z

Multithreading likely wouldn't offer much benefit for the case the method currently supports: a single new document. (Maybe with a really long document, but I doubt it.) It could definitely offer a benefit if inferring in large batches, as mentioned above.

Moving the loop inside the cythonized-methods might offer some benefit in the single-document case, especially with large documents and large steps.

(Also noted in my forum reply: that calling infer_vector() from multiple threads, or better yet multiple processes all loaded with the same model, would also offer speedups.)

gerner · 2020-02-21T19:24:58Z

Any progress on any of these suggestions? in particular, batching and/or support for corpusfile format seem desirable. I saw a big performance speedup using a corpus file for training and at this point bulk training is (potentially much) faster than bulk inference. I assume a lot of this is lack of batching support and leaving a fairly tight loop in python.

Note, I parallelized things out of process by loading a saved model into separate python processes, each responsible for 1/n of the dataset to infer. Obviously that helps at lot. So performance improvements to the core algo will have a big, multiplied effect!

Also, nice work on putting this together and supporting it. Thank you.

gojomo · 2020-02-23T22:19:12Z

So far, this is just a "wishlist" of things that could be tackled to improve infer_vector() - there's no specific plan to do any of this, unless/until someone shows up with the need & interest to contribute it.

gerner · 2020-02-24T19:04:10Z

so... you're saying someone should just make a pull request and send it over?

also, FWIW, I took a (very cursory) look at fasttext: https://github.com/facebookresearch/fastText/
it has comparable model performance (P/R, AUC) on my problem with similar run-time performance (secs) for training and substantially faster inference performance (with similar levels of concurrency).

gojomo · 2020-02-24T22:54:22Z

Not sure what you mean by FastText's "inference". Are you referring to its supervised-classification mode? (That's not the same algorithm as Doc2Vec, & gensim's FastText doesn't implement that mode at all.)

But yes, these improvements or others really just need a quality PR.

gerner · 2020-02-24T23:31:48Z

Facebook Research's fastText implementation has a mode that trains word embeddings and then combines them (one of their papers suggests it's an average) to do sentence embeddings. Certainly a different algorithm. But some similar intuition (at least to word2vec) from a similar people (at least Mikolov is a shared author).

I didn't realize gensim has an implementation of fasttext too.

gojomo · 2020-02-25T00:09:19Z

Yes, I believe that's the mode activated with the -supervised flag, where the word-embeddings are trained to be better at predicting known-labels, when summed/averaged. Summing vectors will inherently be a lot faster than the iterative inference – simulated training – that's used in Doc2Vec (aka 'Paragraph Vectors').

(I'd like that mode to be considered for gensim's FT implementation – for clear parity with the original implementation, and given its high similarity to other modes. But gensim's focus has been truly unsupervised methods, so thus far @piskvorky hasn't considered that mode to be in-scope. So it'd require his approval & a competent Python/Cython implementation to appear.)

gerner · 2020-08-20T21:25:43Z

I was searching for gensim perf stuff and found this bug again. This time I'm taking a closer look.

It looks to me like infer_vector() uses a code path that calls into train_document_* methods, similar to how _do_train_job does. The analog for corpus file is in _do_train_epoch which calls into d2v_train_epoch_* methods. It seems like infer_vector could be extended (or a parallel method like infer_corpus or something) to take a corpus file and call into those same d2v_train_epoch_* methods in an analogous way.

Seems like I'd need to set learn_doctags=False, learn_words=False, learn_hidden=False, and prepare some state similar to what's happening in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec.py#L613-L626 (initializing doctag_vectors, doctags_lockf`, etc.)

work, neu1, and cython_vocab all look they are handled a little differently though. It seems like these could be handled as in word2vec's code path for _train_epoch_corpusfile

I'm not super familiar with the code base, so before I get any deeper, does this make sense as an approach? That is, create a parallel method infer_corpus, follow the analogous path from word2vec's code path for _train_epoch_corpusfile to initialize some state each epoch and call into d2v_train_epoch_* setting the learn params false?

gojomo · 2020-08-21T01:54:46Z

@gerner Is your main goal parallelism for batches? In such a case, most important would be to mimic the classic iterable-interface (rather than corpus_file).

gerner · 2020-08-21T17:16:35Z

@gojomo yes, parallelism is part of the goal. Are you suggesting trying to match what's happening in Doc2Vec._do_train_job which gets called when you use set corpus_iterable and not corpus_file? From what I've seen there isn't much parallelism when that happens (although I see all the threading code that is should achieve that). I usually see my 8-core CPU around 150% max when training in that path. When using corpus_file I see CPU utilization around 750%.

In addition to that, it seems like the corpus_file path is better optimized. All the data processing happens in cython or native code. It seems like if I want to be able to use this in production, or even just for fast research/development iterations, that's what I want to be using.

In the past I've spawned separate python processes to parallelize the inference. That works, but it's somewhat inconvenient, and it seems like corpus_file support for infer_vector won't be that hard to do, which I'm considering implementing.

gojomo · 2020-08-21T23:08:12Z

Yes, since the corpus_iterable path is both older & more general, improving it will usually be a higher priority - rather than more special-format/special-purpose paths like the corpus_file approach.

The corpus_iterable path does still suffer from Python-GIL-related bottlenecks that prevent all-core-utilization. (I'm surprised you max at 1.5 cores utilized - it should be possible to get higher. But yes, corpus_file offers a simpler and more complete path to near-all-cores saturated.)

My hope, though, is that the corpus_iterable path could be raised to parity with corpus_file utilization, probably through an approach like that suggested in this comment (and surrounding related comments). It is overwhelmingly the avoidance of GIL-contention and waiting on a single IO thread, and not other optimizations, that give the corpus_file path its current advantage. But, that advantage comes at the cost of redundant training logic across the two paths, and maintaining/documenting/debugging both paths. (Mode-specific bugs, like the potential corpus_file issues in #2757 & #2693, and mode-specific limitations, like how corpus_file Doc2Vec training can only use plain-integer tags, one per document, are especially frustrating.)

mohzhang · 2021-07-29T22:16:01Z

Any update to speed up infer vector call?

gojomo · 2021-07-30T02:55:47Z

Any update to speed up infer vector call?

The code in the gensim-4.0.1 release is the same as the current develop trunk, and no optimization work is currently underway. A theoretical future update to the interface to allow batches of docs to be inferred together might give somewhat of a speedup, for users who have large batches of new documents.

Barring that work, calling .infer_vector() from multiple threads or processes (with shared-memory models) might help users achieve the highest throughput.

mohzhang · 2021-07-30T22:30:36Z

The code in the gensim-4.0.1 release is the same as the current develop trunk, and no optimization work is currently underway. A theoretical future update to the interface to allow batches of docs to be inferred together might give somewhat of a speedup, for users who have large batches of new documents.

Barring that work, calling .infer_vector() from multiple threads or processes (with shared-memory models) might help users achieve the highest throughput.

Batch process is good for inferring multiple vectors. Is there any plan to improve single long document?

gojomo · 2021-08-02T14:59:33Z

Batch process is good for inferring multiple vectors. Is there any plan to improve single long document?

Inference uses the same code as training, and I don't know any pending unimplemented ideas for making that code run any faster at the level of individual texts. So no such plans currently, and any such plans would require a theory for how the code might be faster.

tmylk added the wishlist Feature request label Jan 9, 2016

gojomo mentioned this issue Jan 10, 2016

seed doc2vec.infer_vector() with mean of word vectors to get more stable results #460

Closed

gojomo mentioned this issue Aug 23, 2016

Similarity between documents? #836

Closed

tmylk added the difficulty medium Medium issue: required good gensim understanding & python skills label Sep 25, 2016

gojomo mentioned this issue Oct 31, 2023

Support (> 10000)-token texts in infer_vector() #2583

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misc ways to improve infer_vector #515

misc ways to improve infer_vector #515

gojomo commented Nov 9, 2015

gojomo commented Jan 10, 2016

gojomo commented Apr 13, 2016

tmylk commented Sep 22, 2016

gojomo commented Sep 22, 2016 •

edited

Loading

gerner commented Feb 21, 2020

gojomo commented Feb 23, 2020

gerner commented Feb 24, 2020

gojomo commented Feb 24, 2020 •

edited

Loading

gerner commented Feb 24, 2020

gojomo commented Feb 25, 2020

gerner commented Aug 20, 2020

gojomo commented Aug 21, 2020

gerner commented Aug 21, 2020

gojomo commented Aug 21, 2020

mohzhang commented Jul 29, 2021

gojomo commented Jul 30, 2021 •

edited

Loading

mohzhang commented Jul 30, 2021

gojomo commented Aug 2, 2021 •

edited

Loading

misc ways to improve infer_vector #515

misc ways to improve infer_vector #515

Comments

gojomo commented Nov 9, 2015

gojomo commented Jan 10, 2016

gojomo commented Apr 13, 2016

tmylk commented Sep 22, 2016

gojomo commented Sep 22, 2016 • edited Loading

gerner commented Feb 21, 2020

gojomo commented Feb 23, 2020

gerner commented Feb 24, 2020

gojomo commented Feb 24, 2020 • edited Loading

gerner commented Feb 24, 2020

gojomo commented Feb 25, 2020

gerner commented Aug 20, 2020

gojomo commented Aug 21, 2020

gerner commented Aug 21, 2020

gojomo commented Aug 21, 2020

mohzhang commented Jul 29, 2021

gojomo commented Jul 30, 2021 • edited Loading

mohzhang commented Jul 30, 2021

gojomo commented Aug 2, 2021 • edited Loading

gojomo commented Sep 22, 2016 •

edited

Loading

gojomo commented Feb 24, 2020 •

edited

Loading

gojomo commented Jul 30, 2021 •

edited

Loading

gojomo commented Aug 2, 2021 •

edited

Loading