Add distributed document representations #204

dhammack · 2014-05-26T17:40:24Z

Paper here http://arxiv.org/abs/1405.4053 shows how to learn distributed document representations similar to word2vec. Several state of the art results are demonstrated. It would be a great addition to gensim, if feasible.

junwei-pan · 2014-06-08T07:43:54Z

Any one who is interested in??

seanlindsey · 2014-08-12T21:08:48Z

In the following fork, from ccri, under gensim.models there are "doc2vec" implementations, seems to attempt to label the document like the paragraph vector approach from the paper. Playing with it myself at this moment, it runs on small sets, now crossing my fingers and running it on a larger set. Instead of sentences it requires an iterator of "LabeledText" which is a sentence and document ID(s) so, I imagine, its always looking at the label during the learning phase.

https://github.com/ccri/gensim

For an iterator, I made a csv of tokenized documents and a corresponding ID for each, there is another LabeledText iterator called DocSet in the code.

import ast, csv
class LTIterator():
    def init(self, fname):
        self.fname = fname
    def iter(self):
        rcsv = csv.reader(open(self.fname))
        for lt_row in self.rcsv:
            yield LabeledText(ast.literal_eval(lt_row[0]), [lt_row[1]])

it = LTIterator(some_filename)
model = Doc2Vec(it, size=400, window=10, min_count=5, workers=11, sg=0) # sg=1 should be fine
your_docs_paragraph_vec = model[your_docs_label] # I imagine

I would love to hear any comments, input, elaboration.

As for Cython/Threading, I think that's what its going for, but I'm not sure.

Haven't tested it, but it seems neat.

piskvorky · 2014-08-13T14:39:00Z

Well, for comments, it would probably be best to CC its author: @temerick .

temerick · 2014-08-13T18:53:23Z

Yes, the doc2vec implementation is intended to model the algorithm from the Distributed Representations of Sentences and Documents paper. @seanlindsey has the right idea about how to use it. All you need is an iterator of LabeledText elements, which are made up of two lists: 1) the text of the sentence, as in the current gensim word2vec implementation and 2) a list of labels for the text. The goal is for you to be able to add as many or as few labels as you want, although I've mostly experimented with a single label, as in the paper. (The idea being, hopefully, to enable labels at multiple levels of granularity: ['doc1', 'para2', 'sent4'], for example.)

One of the known differences between this code and the paper is that, in the paper, they pad short blocks of text with a null character to make them long enough, whereas here we throw them out of the vocabulary if their length is below the min_count threshold. This design choice makes sense for our context, but may not make sense for yours (if you're looking at twitter data, for instance).

Since it's largely copied and modified code from @piskvorky's excellent word2vec implementation, threading should go through without a hitch. The cython code should also work, if you want your jobs to get done before the heat death of the universe.

Depending on what you define as a large amount of data, this code may scale reasonably well to what you're looking for. I've successfully run this over a collection of over 2 million paragraphs in less than 10 mins. However, I tried to run it on 20x that much data and my box ran out of RAM since it needed to create a new vector for each paragraph.

Here's a picture I threw together from paragraphs extracted from some popular project Gutenberg documents. It shows the distributions of paragraphs for each of the documents after running a couple of epochs of training using CBOW (called PV-DM in the paper):

I hope that helps clear things up a little bit. Feel free to let me know if you have any comments or questions, or if you find any bugs.

piskvorky · 2014-08-14T08:19:09Z

@temerick thanks for the info! What are your plans with this implementation -- did you consider a pull request?

(cc @gojomo )

seanlindsey · 2014-08-14T19:16:55Z

@temerick
A couple questions.
Would using the paper's PV-DM be achieved by initializing Doc2Vec with sg=0 and would setting sg=1 use the PV-DBOW method?
Would you suppose the papers uses hierarchical sampling as opposed to negative sampling?

To get the papers implementation of the sentence padding what I'm trying is pre-pending a bunch of null characters to sentence.text (haven't tested it) while building the vocab, then reset the sentence length to get the vocab filter to keep the label later on. Also the vocab wont lose out on having these null characters later on. Then I repeat the process in the prepare_sentences function defined in Doc2Vec's train function, so we don't lose out on the null characters in the learning phase. Sound right?

temerick · 2014-08-15T01:07:50Z

@piskvorky I'd be happy to make a pr once I get the code cleaned up a little bit. At the moment, there is too much duplicate code between my doc2vec file and your word2vec file, which is no good.

@seanlindsey

Your assumption about the sg=0/1 toggle is correct. It gets a bit confusing, since the term "bag of words" is used refer to "many in, one out" in the original word2vec paper, but refers to "one in, many out" in the document/paragraph paper.
My understanding is that the paper just uses the hierarchical softmax algorithm.
Your implementation with padding also sounds correct; I'd be interested in hearing about differences in vector quality with-vs-without padding.

piskvorky · 2014-09-05T07:27:13Z

@temerick any progress on the PR? CC @gojomo

temerick · 2014-09-05T14:18:54Z

@piskvorky Yes, I think I should be able to make it by the middle of next week at the latest.

piskvorky · 2014-09-09T20:46:49Z

Brilliant, thanks. Closing this, to be continued in #231 .

piskvorky added the wishlist label May 26, 2014

piskvorky changed the title ~~[wishlist] Add distributed document representations~~ Add distributed document representations May 26, 2014

piskvorky removed the wishlist label Sep 5, 2014

piskvorky closed this as completed Sep 9, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed document representations #204

Add distributed document representations #204

dhammack commented May 26, 2014

junwei-pan commented Jun 8, 2014

seanlindsey commented Aug 12, 2014

piskvorky commented Aug 13, 2014

temerick commented Aug 13, 2014

piskvorky commented Aug 14, 2014

seanlindsey commented Aug 14, 2014

temerick commented Aug 15, 2014

piskvorky commented Sep 5, 2014

temerick commented Sep 5, 2014

piskvorky commented Sep 9, 2014

Add distributed document representations #204

Add distributed document representations #204

Comments

dhammack commented May 26, 2014

junwei-pan commented Jun 8, 2014

seanlindsey commented Aug 12, 2014

piskvorky commented Aug 13, 2014

temerick commented Aug 13, 2014

piskvorky commented Aug 14, 2014

seanlindsey commented Aug 14, 2014

temerick commented Aug 15, 2014

piskvorky commented Sep 5, 2014

temerick commented Sep 5, 2014

piskvorky commented Sep 9, 2014