-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training of new embeddings #32
Comments
No, sorry this library is meant to be a utility focused on speed, efficiency, and robustness for loading embeddings only, not training. Training is an entirely different beast and there are plenty of tools out there that let you do it fairly easily. Moreover, methods for training embeddings vary widely and it would be nearly impossible to add all of them to this this library and it would bloat the package. I'd consider it in the future if one method dominated and everyone was using it to train embeddings but for now it seems as though there are many competing implementations for training embeddings. If you think I'm wrong about this and there would be an elegant way to add it to the library that you can think of, shoot me an e-mail ajay@plasticity.ai and I'll look into it. Once you have the embeddings, it should be easy to convert them to the Magnitude format using the converter though! |
I think it would be very useful for this library to be able to train vectors, the reason being that for very large amounts of vectors (8-9 figures), it takes a lot of memory for all those vectors to be loaded at once. I don't think you have to implement the whole training setup, you just need a way to update the vectors in the Magnitude file with the gradients. Here's a setup I thought of off the top of my head. You have a dummy embedding layer in Keras. Say that you are doing skip-gram training, with window size 3, and negative sampling rate 16. So that dummy embedding variable has a size of embedding_dim*(1 + 3*2 + 16). We get our target, context, and negative samples vectors from the Magnitude object, and copy those values to the Keras embedding, then perform the forward step. After the backwards pass, we copy the updated keras embedding values to the vectors in the Magnitude object. Then before we next forward pass, we get new values for the target, context, and negative samples, and copy those to the keras embedding layer. This was a bit tricky to explain, so let me know if some parts were confusing to understand. One major drawback is that we won't be able to use any optimizers with parameter-specific momentum, like Adagrad/Adam. But as far as I know this is the only vector library where the vectors are stored to disk (if not, please let me know!). If copy values to keras is surprisingly difficult, then it definitely should be easy with a pure numpy training implementation. So all we need to do is to be able to update the individual vectors in the magnitude object. From looking at the code, I don't see a way to do that. But I think it shouldn't be too tricky since it's just a sqlite database. |
First, thanks for open sourcing this.
Am I correct in that there is currently not support for training new embeddings for a corpus of text? This seems like a critical feature for any word2vec implementation. Is there a plan for this to added?
Thanks.
The text was updated successfully, but these errors were encountered: