Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Doc2Vec.load_word2vec_format return a Doc2Vec instance? #322

Closed
frnsys opened this issue Apr 11, 2015 · 6 comments
Closed

Should Doc2Vec.load_word2vec_format return a Doc2Vec instance? #322

frnsys opened this issue Apr 11, 2015 · 6 comments

Comments

@frnsys
Copy link

frnsys commented Apr 11, 2015

Currently Doc2Vec.load_word2vec_format returns a Word2Vec object, shouldn't it be a Doc2Vec object?

@piskvorky
Copy link
Owner

CC @temerick @gojomo @diloreto

@ghost ghost mentioned this issue Aug 2, 2015
@topinsky
Copy link

topinsky commented Sep 3, 2015

Previously, Doc2Vec didn't have separate container for document vectors (docvecs) and probably that's why returning word2vec object was not a big issue. But now if it returns Word2Vec then it loses all information about docvecs... =(

@gojomo
Copy link
Collaborator

gojomo commented Sep 3, 2015

It's not clear to me what the most useful behavior would be in this case.

Do you want to influence a Doc2Vec session with reused word vectors? In such a case, you might be able to cobble together the desired effect using a multi-step process that at some point uses the intersect_word2vec_format() method to bring in some/all word vectors. (Since that just modifies an existing Doc2Vec model, you'd have the right kind of object at the end.)

Or is it that you saved a prior-version Doc2Vec model in _word2vec_format, so it also has doc vectors mixed with words in that format, and you want to convert it forward? Since many conventions for naming the doc-vecs are possible, that'd require some user-specific coding, I think, but still might be possible leveraging intersect_word2vec_format(), and then copying the vectors you know (by your own naming convention) are doc vectors into the DocvecsArray component.

@topinsky
Copy link

topinsky commented Sep 3, 2015

I don't get your answer. It's little bit unclear for me.
intersect_word2vec_format() -- what is that?

My use case is simple.
I built Doc2Vec and wanted to save it in binary format.
I did it as before by using save_word2vec_format method.

And now I want to load this binary format.
But if I use load_word2vec_format then I will get Word2Vec object.

How can I do that ?

@gojomo
Copy link
Collaborator

gojomo commented Sep 3, 2015

I recommend you just use the plain (gensim-native) save() and load() methods. They'll save and load the full model.

(The word2vec.c format was only meant for string-keyed vectors – which the docvecs won't be if you're being maximally memory efficient. And it never saved all the model information. So I think you'd only want to use it if needing to maintain compatibility with other code.)

intersect_word2vec_format() is a method on Word2Vec that lets you load word2vec.c-format word vector values into an existing Word2Vec model, for only those words already in the model's vocabulary. (It replaces the model's vector values with those in the supplied file.) It's experimental but might support some of the reasons people would want to load_word2vec_format() into a Doc2Vec model.

@topinsky
Copy link

topinsky commented Sep 3, 2015

Thank You

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants