potential Doc2Vec feature: reverse inference, to synthesize doc/summary words #2459

gojomo · 2019-04-21T01:09:52Z

Motivated by the SO question: https://stackoverflow.com/questions/55768598/interpret-the-doc2vec-vectors-clusters-representation/55779049#55779049

Doc2Vec could plausibly have a function that's reverse-inference: take a doc-vector, return a (ranked) list of words most-predicted by that input vector. It'd work highly analogously to Word2Vec.predict_output_word(). Such a list of words might be useful as a sort-of summary or label for a doc-vector.

The text was updated successfully, but these errors were encountered:

saraswatmks · 2019-05-01T00:50:52Z

Hi @gojomo Is it yet to be done ? can I take it up?

gojomo · 2019-05-01T03:04:15Z

@saraswatmks Yes, and further, you don't have to ask permission: a good PR submission will be reviewed/welcomed even without having declared your interest first. It's more important to "show working code" than "declare interest".

saraswatmks · 2019-05-04T17:16:07Z

@gojomo Just to be sure, I have multiple implementations for this in mind:
Option 1:

Takes a doc-vector (I will check for vector length as a sanity check)
Returns a list of tuples of words and scores which are most similar to the doc-vector
One thing to note, the words will be returned from the whole corpus.

Option 2:

Instead of input vector, I ask for doc-id
Get the vector for that doc-id
Filter all the words which occurred in that doc-id and get vector of all those words
Do the dot product only on the filtered word vectors. This way we ensure that we return only those words which actually occurred in that doc-idd.

What do you think ? Which one we should go with ? Or if there's any other way this can be done please let me know.

gojomo · 2019-05-06T00:34:54Z

Re: Option 1 – It's not "most similar" words which are needed here. Rather, it's "most predicted". The logic and behavior should be highly analogous to Word2Vec.predict_output_word(), except taking a single vector rather than a list-of-words.

(Option 2 is impossible, as the model includes no record of the words which occurred in that doc-id.)

gojomo · 2022-05-17T19:28:10Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential Doc2Vec feature: reverse inference, to synthesize doc/summary words #2459

potential Doc2Vec feature: reverse inference, to synthesize doc/summary words #2459

gojomo commented Apr 21, 2019

saraswatmks commented May 1, 2019

gojomo commented May 1, 2019

saraswatmks commented May 4, 2019 •

edited

gojomo commented May 6, 2019

gojomo commented May 17, 2022

potential Doc2Vec feature: reverse inference, to synthesize doc/summary words #2459

potential Doc2Vec feature: reverse inference, to synthesize doc/summary words #2459

Comments

gojomo commented Apr 21, 2019

saraswatmks commented May 1, 2019

gojomo commented May 1, 2019

saraswatmks commented May 4, 2019 • edited

gojomo commented May 6, 2019

gojomo commented May 17, 2022

saraswatmks commented May 4, 2019 •

edited