Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training of new embeddings #32

Closed
bevankoopman opened this issue Oct 30, 2018 · 2 comments
Closed

Training of new embeddings #32

bevankoopman opened this issue Oct 30, 2018 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@bevankoopman
Copy link

First, thanks for open sourcing this.

Am I correct in that there is currently not support for training new embeddings for a corpus of text? This seems like a critical feature for any word2vec implementation. Is there a plan for this to added?

Thanks.

@bevankoopman bevankoopman changed the title Training of new models Training of new embeddings Oct 30, 2018
@AjayP13 AjayP13 self-assigned this Oct 30, 2018
@AjayP13 AjayP13 added the enhancement New feature or request label Oct 30, 2018
@AjayP13
Copy link
Contributor

AjayP13 commented Oct 30, 2018

No, sorry this library is meant to be a utility focused on speed, efficiency, and robustness for loading embeddings only, not training. Training is an entirely different beast and there are plenty of tools out there that let you do it fairly easily. Moreover, methods for training embeddings vary widely and it would be nearly impossible to add all of them to this this library and it would bloat the package. I'd consider it in the future if one method dominated and everyone was using it to train embeddings but for now it seems as though there are many competing implementations for training embeddings. If you think I'm wrong about this and there would be an elegant way to add it to the library that you can think of, shoot me an e-mail ajay@plasticity.ai and I'll look into it.

Once you have the embeddings, it should be easy to convert them to the Magnitude format using the converter though!

@Santosh-Gupta
Copy link

Santosh-Gupta commented Jul 26, 2019

I think it would be very useful for this library to be able to train vectors, the reason being that for very large amounts of vectors (8-9 figures), it takes a lot of memory for all those vectors to be loaded at once.

I don't think you have to implement the whole training setup, you just need a way to update the vectors in the Magnitude file with the gradients.

Here's a setup I thought of off the top of my head. You have a dummy embedding layer in Keras. Say that you are doing skip-gram training, with window size 3, and negative sampling rate 16. So that dummy embedding variable has a size of embedding_dim*(1 + 3*2 + 16).

We get our target, context, and negative samples vectors from the Magnitude object, and copy those values to the Keras embedding, then perform the forward step.

After the backwards pass, we copy the updated keras embedding values to the vectors in the Magnitude object.

Then before we next forward pass, we get new values for the target, context, and negative samples, and copy those to the keras embedding layer.

This was a bit tricky to explain, so let me know if some parts were confusing to understand.

One major drawback is that we won't be able to use any optimizers with parameter-specific momentum, like Adagrad/Adam. But as far as I know this is the only vector library where the vectors are stored to disk (if not, please let me know!).

If copy values to keras is surprisingly difficult, then it definitely should be easy with a pure numpy training implementation.

So all we need to do is to be able to update the individual vectors in the magnitude object. From looking at the code, I don't see a way to do that. But I think it shouldn't be too tricky since it's just a sqlite database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants