Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed document representations #204

Closed
dhammack opened this issue May 26, 2014 · 10 comments
Closed

Add distributed document representations #204

dhammack opened this issue May 26, 2014 · 10 comments

Comments

@dhammack
Copy link

Paper here http://arxiv.org/abs/1405.4053 shows how to learn distributed document representations similar to word2vec. Several state of the art results are demonstrated. It would be a great addition to gensim, if feasible.

@piskvorky piskvorky changed the title [wishlist] Add distributed document representations Add distributed document representations May 26, 2014
@junwei-pan
Copy link

Any one who is interested in??

@seanlindsey
Copy link

In the following fork, from ccri, under gensim.models there are "doc2vec" implementations, seems to attempt to label the document like the paragraph vector approach from the paper. Playing with it myself at this moment, it runs on small sets, now crossing my fingers and running it on a larger set. Instead of sentences it requires an iterator of "LabeledText" which is a sentence and document ID(s) so, I imagine, its always looking at the label during the learning phase.

https://github.com/ccri/gensim

For an iterator, I made a csv of tokenized documents and a corresponding ID for each, there is another LabeledText iterator called DocSet in the code.

import ast, csv
class LTIterator():
    def init(self, fname):
        self.fname = fname
    def iter(self):
        rcsv = csv.reader(open(self.fname))
        for lt_row in self.rcsv:
            yield LabeledText(ast.literal_eval(lt_row[0]), [lt_row[1]])

it = LTIterator(some_filename)
model = Doc2Vec(it, size=400, window=10, min_count=5, workers=11, sg=0) # sg=1 should be fine
your_docs_paragraph_vec = model[your_docs_label] # I imagine

I would love to hear any comments, input, elaboration.

As for Cython/Threading, I think that's what its going for, but I'm not sure.

Haven't tested it, but it seems neat.

@piskvorky
Copy link
Owner

Well, for comments, it would probably be best to CC its author: @temerick .

@temerick
Copy link
Contributor

Yes, the doc2vec implementation is intended to model the algorithm from the Distributed Representations of Sentences and Documents paper. @seanlindsey has the right idea about how to use it. All you need is an iterator of LabeledText elements, which are made up of two lists: 1) the text of the sentence, as in the current gensim word2vec implementation and 2) a list of labels for the text. The goal is for you to be able to add as many or as few labels as you want, although I've mostly experimented with a single label, as in the paper. (The idea being, hopefully, to enable labels at multiple levels of granularity: ['doc1', 'para2', 'sent4'], for example.)

One of the known differences between this code and the paper is that, in the paper, they pad short blocks of text with a null character to make them long enough, whereas here we throw them out of the vocabulary if their length is below the min_count threshold. This design choice makes sense for our context, but may not make sense for yours (if you're looking at twitter data, for instance).

Since it's largely copied and modified code from @piskvorky's excellent word2vec implementation, threading should go through without a hitch. The cython code should also work, if you want your jobs to get done before the heat death of the universe.

Depending on what you define as a large amount of data, this code may scale reasonably well to what you're looking for. I've successfully run this over a collection of over 2 million paragraphs in less than 10 mins. However, I tried to run it on 20x that much data and my box ran out of RAM since it needed to create a new vector for each paragraph.

Here's a picture I threw together from paragraphs extracted from some popular project Gutenberg documents. It shows the distributions of paragraphs for each of the documents after running a couple of epochs of training using CBOW (called PV-DM in the paper): sample3x3

I hope that helps clear things up a little bit. Feel free to let me know if you have any comments or questions, or if you find any bugs.

@piskvorky
Copy link
Owner

@temerick thanks for the info! What are your plans with this implementation -- did you consider a pull request?

(cc @gojomo )

@seanlindsey
Copy link

@temerick
A couple questions.
Would using the paper's PV-DM be achieved by initializing Doc2Vec with sg=0 and would setting sg=1 use the PV-DBOW method?
Would you suppose the papers uses hierarchical sampling as opposed to negative sampling?

To get the papers implementation of the sentence padding what I'm trying is pre-pending a bunch of null characters to sentence.text (haven't tested it) while building the vocab, then reset the sentence length to get the vocab filter to keep the label later on. Also the vocab wont lose out on having these null characters later on. Then I repeat the process in the prepare_sentences function defined in Doc2Vec's train function, so we don't lose out on the null characters in the learning phase. Sound right?

@temerick
Copy link
Contributor

@piskvorky I'd be happy to make a pr once I get the code cleaned up a little bit. At the moment, there is too much duplicate code between my doc2vec file and your word2vec file, which is no good.

@seanlindsey

  1. Your assumption about the sg=0/1 toggle is correct. It gets a bit confusing, since the term "bag of words" is used refer to "many in, one out" in the original word2vec paper, but refers to "one in, many out" in the document/paragraph paper.
  2. My understanding is that the paper just uses the hierarchical softmax algorithm.
  3. Your implementation with padding also sounds correct; I'd be interested in hearing about differences in vector quality with-vs-without padding.

@piskvorky piskvorky removed the wishlist label Sep 5, 2014
@piskvorky
Copy link
Owner

@temerick any progress on the PR? CC @gojomo

@temerick
Copy link
Contributor

temerick commented Sep 5, 2014

@piskvorky Yes, I think I should be able to make it by the middle of next week at the latest.

@piskvorky
Copy link
Owner

Brilliant, thanks. Closing this, to be continued in #231 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants