Training of new embeddings #32

bevankoopman · 2018-10-30T02:30:11Z

First, thanks for open sourcing this.

Am I correct in that there is currently not support for training new embeddings for a corpus of text? This seems like a critical feature for any word2vec implementation. Is there a plan for this to added?

Thanks.

AjayP13 · 2018-10-30T04:37:17Z

No, sorry this library is meant to be a utility focused on speed, efficiency, and robustness for loading embeddings only, not training. Training is an entirely different beast and there are plenty of tools out there that let you do it fairly easily. Moreover, methods for training embeddings vary widely and it would be nearly impossible to add all of them to this this library and it would bloat the package. I'd consider it in the future if one method dominated and everyone was using it to train embeddings but for now it seems as though there are many competing implementations for training embeddings. If you think I'm wrong about this and there would be an elegant way to add it to the library that you can think of, shoot me an e-mail ajay@plasticity.ai and I'll look into it.

Once you have the embeddings, it should be easy to convert them to the Magnitude format using the converter though!

Santosh-Gupta · 2019-07-26T02:18:06Z

I think it would be very useful for this library to be able to train vectors, the reason being that for very large amounts of vectors (8-9 figures), it takes a lot of memory for all those vectors to be loaded at once.

I don't think you have to implement the whole training setup, you just need a way to update the vectors in the Magnitude file with the gradients.

Here's a setup I thought of off the top of my head. You have a dummy embedding layer in Keras. Say that you are doing skip-gram training, with window size 3, and negative sampling rate 16. So that dummy embedding variable has a size of embedding_dim*(1 + 3*2 + 16).

We get our target, context, and negative samples vectors from the Magnitude object, and copy those values to the Keras embedding, then perform the forward step.

After the backwards pass, we copy the updated keras embedding values to the vectors in the Magnitude object.

Then before we next forward pass, we get new values for the target, context, and negative samples, and copy those to the keras embedding layer.

This was a bit tricky to explain, so let me know if some parts were confusing to understand.

One major drawback is that we won't be able to use any optimizers with parameter-specific momentum, like Adagrad/Adam. But as far as I know this is the only vector library where the vectors are stored to disk (if not, please let me know!).

If copy values to keras is surprisingly difficult, then it definitely should be easy with a pure numpy training implementation.

So all we need to do is to be able to update the individual vectors in the magnitude object. From looking at the code, I don't see a way to do that. But I think it shouldn't be too tricky since it's just a sqlite database.

bevankoopman changed the title ~~Training of new models~~ Training of new embeddings Oct 30, 2018

AjayP13 self-assigned this Oct 30, 2018

AjayP13 added the enhancement New feature or request label Oct 30, 2018

AjayP13 closed this as completed Nov 5, 2018

Santosh-Gupta mentioned this issue Jul 26, 2019

How to use for word2vec training? ddbourgin/numpy-ml#34

Open

Santosh-Gupta mentioned this issue Jul 27, 2019

Feature Request: Option to update individual embeddings #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training of new embeddings #32

Training of new embeddings #32

bevankoopman commented Oct 30, 2018

AjayP13 commented Oct 30, 2018

Santosh-Gupta commented Jul 26, 2019 •

edited

Training of new embeddings #32

Training of new embeddings #32

Comments

bevankoopman commented Oct 30, 2018

AjayP13 commented Oct 30, 2018

Santosh-Gupta commented Jul 26, 2019 • edited

Santosh-Gupta commented Jul 26, 2019 •

edited