Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential Doc2Vec feature: reverse inference, to synthesize doc/summary words #2459

Open
gojomo opened this issue Apr 21, 2019 · 5 comments
Open
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest wishlist Feature request

Comments

@gojomo
Copy link
Collaborator

gojomo commented Apr 21, 2019

Motivated by the SO question: https://stackoverflow.com/questions/55768598/interpret-the-doc2vec-vectors-clusters-representation/55779049#55779049

Doc2Vec could plausibly have a function that's reverse-inference: take a doc-vector, return a (ranked) list of words most-predicted by that input vector. It'd work highly analogously to Word2Vec.predict_output_word(). Such a list of words might be useful as a sort-of summary or label for a doc-vector.

@gojomo gojomo added feature Issue described a new feature wishlist Feature request difficulty medium Medium issue: required good gensim understanding & python skills good first issue Issue for new contributors (not required gensim understanding + very simple) labels Apr 21, 2019
@saraswatmks
Copy link
Contributor

Hi @gojomo Is it yet to be done ? can I take it up?

@gojomo
Copy link
Collaborator Author

gojomo commented May 1, 2019

@saraswatmks Yes, and further, you don't have to ask permission: a good PR submission will be reviewed/welcomed even without having declared your interest first. It's more important to "show working code" than "declare interest".

@saraswatmks
Copy link
Contributor

saraswatmks commented May 4, 2019

@gojomo Just to be sure, I have multiple implementations for this in mind:
Option 1:

  1. Takes a doc-vector (I will check for vector length as a sanity check)
  2. Returns a list of tuples of words and scores which are most similar to the doc-vector
  3. One thing to note, the words will be returned from the whole corpus.

Option 2:

  1. Instead of input vector, I ask for doc-id
  2. Get the vector for that doc-id
  3. Filter all the words which occurred in that doc-id and get vector of all those words
  4. Do the dot product only on the filtered word vectors. This way we ensure that we return only those words which actually occurred in that doc-idd.

What do you think ? Which one we should go with ? Or if there's any other way this can be done please let me know.

@gojomo
Copy link
Collaborator Author

gojomo commented May 6, 2019

Re: Option 1 – It's not "most similar" words which are needed here. Rather, it's "most predicted". The logic and behavior should be highly analogous to Word2Vec.predict_output_word(), except taking a single vector rather than a list-of-words.

(Option 2 is impossible, as the model includes no record of the words which occurred in that doc-id.)

@gojomo
Copy link
Collaborator Author

gojomo commented May 17, 2022

See also similar #2152.

Simply refactoring predict_output_word to the part that looks-up words, and the part that turns a raw-context-vector-into-ranked-predictions, might essentially create this doc2vec feature (supply the doc-vec as the raw-context-vector) almost effortlessly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest wishlist Feature request
Projects
None yet
Development

No branches or pull requests

3 participants