-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc ways to improve infer_vector #515
Comments
The |
As requested on the forum at https://groups.google.com/d/msg/gensim/IIumafg4WkA/Ua3FdeFCJQAJ |
Madhumathi raised the question of He is suggesting simply locking all the pre-trained doc and word-vectors and then calling model.train() in order to take advantage of multiple workers and batching. |
Multithreading likely wouldn't offer much benefit for the case the method currently supports: a single new document. (Maybe with a really long document, but I doubt it.) It could definitely offer a benefit if inferring in large batches, as mentioned above. Moving the loop inside the cythonized-methods might offer some benefit in the single-document case, especially with large documents and large (Also noted in my forum reply: that calling |
Any progress on any of these suggestions? in particular, batching and/or support for corpusfile format seem desirable. I saw a big performance speedup using a corpus file for training and at this point bulk training is (potentially much) faster than bulk inference. I assume a lot of this is lack of batching support and leaving a fairly tight loop in python. Note, I parallelized things out of process by loading a saved model into separate python processes, each responsible for 1/n of the dataset to infer. Obviously that helps at lot. So performance improvements to the core algo will have a big, multiplied effect! Also, nice work on putting this together and supporting it. Thank you. |
So far, this is just a "wishlist" of things that could be tackled to improve |
so... you're saying someone should just make a pull request and send it over? also, FWIW, I took a (very cursory) look at fasttext: https://github.com/facebookresearch/fastText/ |
Not sure what you mean by FastText's "inference". Are you referring to its supervised-classification mode? (That's not the same algorithm as But yes, these improvements or others really just need a quality PR. |
Facebook Research's fastText implementation has a mode that trains word embeddings and then combines them (one of their papers suggests it's an average) to do sentence embeddings. Certainly a different algorithm. But some similar intuition (at least to word2vec) from a similar people (at least Mikolov is a shared author). I didn't realize gensim has an implementation of fasttext too. |
Yes, I believe that's the mode activated with the (I'd like that mode to be considered for gensim's FT implementation – for clear parity with the original implementation, and given its high similarity to other modes. But gensim's focus has been truly unsupervised methods, so thus far @piskvorky hasn't considered that mode to be in-scope. So it'd require his approval & a competent Python/Cython implementation to appear.) |
I was searching for gensim perf stuff and found this bug again. This time I'm taking a closer look. It looks to me like Seems like I'd need to set
I'm not super familiar with the code base, so before I get any deeper, does this make sense as an approach? That is, create a parallel method |
@gerner Is your main goal parallelism for batches? In such a case, most important would be to mimic the classic iterable-interface (rather than |
@gojomo yes, parallelism is part of the goal. Are you suggesting trying to match what's happening in In addition to that, it seems like the corpus_file path is better optimized. All the data processing happens in cython or native code. It seems like if I want to be able to use this in production, or even just for fast research/development iterations, that's what I want to be using. In the past I've spawned separate python processes to parallelize the inference. That works, but it's somewhat inconvenient, and it seems like corpus_file support for infer_vector won't be that hard to do, which I'm considering implementing. |
Yes, since the The My hope, though, is that the |
Any update to speed up infer vector call? |
The code in the Barring that work, calling |
Batch process is good for inferring multiple vectors. Is there any plan to improve single long document? |
Inference uses the same code as training, and I don't know any pending unimplemented ideas for making that code run any faster at the level of individual texts. So no such plans currently, and any such plans would require a theory for how the code might be faster. |
The default
steps
should probably be higher: perhaps 10, or the same as the trainingiter
value.If there are no known-tokens in the supplied
doc_words
, the method will return the randomly-initialized, low-magnitude, untrained starting vector. (There were no target-words to predict and thus inference was a no-op, except for that random initialization.) This is probably not what the user wants, and the realistic-looking values may fool the user into thinking some work has happened. It would probably be better to return some constant (perhaps zero) vector in such cases, to be clear that all such text-runs with unknown words are similarly opaque, as far as the model is concerned.The inference loop might be able to make use of #450 word/example batching (especially if it is extended to decay alpha), to turn all steps into one native call.
The method could take many examples, to infer a chunk of examples in one call. At the extreme, it could also make use of multithreading so such a large chunk finishes more quickly.
The text was updated successfully, but these errors were encountered: