Added phrase support #135

sumitborar · 2013-11-10T15:10:45Z

No description provided.

RadixSeven · 2013-11-10T18:56:17Z

I was puzzled by a number of your changes. On several lines, you undo improvements contributed by Radim and by myself. After thinking about it, I am not sure it was intentional. Maybe there was a problem in how you merged?

sumitborar · 2013-11-10T19:14:56Z

That was unintentional . Looks like there was a problem with merge. I will remerge and submit the changes again .

piskvorky · 2013-11-10T22:33:17Z

gensim/models/word2vec.py

+
+                last_word = word
+                last_word_count = word_count
+            new_sentences.append( new_sentence )    


this is not going to work -- you are creating the entire corpus as a list in memory! that wouldn't scale very well :)

the bigram sentences must be coming out as a stream (one sentence at a time).

sumitborar · 2013-11-11T03:29:04Z

I can use temporary files to store the sentences and then reload them. I was thinking of doing this , but it would be slower.

As for the second case, I am not sure what's a good solution ?. We could do the pruning but as you pointed out its a hack .

sumitborar · 2013-11-11T07:31:01Z

For word combinations, we can do a two pass approach. In first pass we collect all the unigrams and their frequencies and then only add combinations of words which are above the threshold. This would reduce the size of the dictionary.

piskvorky · 2013-11-11T10:38:14Z

My idea was to stream the sentences from an iterator object, like I described in issue #130. Is there a reason this wouldn't work?

Basically, separate the phrase building as corpus preprocessing, rather than part of word2vec training.

About performance: unless something is super slow, the performance is not the primary objective. The primary objective is that it's easy to use and works over large corpora. If the user wants the fastest input possible, he can serialize the sentences to disk in LineSentence format and avoid any online transformations/processing.

RadixSeven · 2013-11-11T16:50:34Z

Streaming the sentences and outputting the combined words as you read each sentence won't work because you don't know the bigram frequencies until you've read the whole corpus (imagine that "New York" only occurs twice - once at the beginning and once at the end). You need at least two passes, one to count the unigrams and bigrams and one to keep the better-than-unigram bigrams.

Storing the bigram counts in memory could be challenging. The Google n-gram dump reports a 23 fold increase between unigrams and bigrams. I'm working with a 2.7G word corpus in which there are 2.8M terms, so, a rough estimate in my case would be 64M bigrams - that works out to 128 bytes/bigram on an 8GiB machine. Probably enough. As an upper bound, their 1T word corpus gave 314M bigrams. That gives you 26 bytes (3.25 ints) to store the average bigram on an 8GiB machine - probably not enough (pointer to object, pointer to string, and pointer to dict overhead leaves you with only 2 bytes for the bigram text).

To get all the bigrams in constant RAM, the best algorithm I can think of is:

echo each bigram to a file and also remember unigram probabilities in same pass
do merge-sort on the file, leaving a count field for duplicates (Under Unix: sort tmpfile | uniq -c)
keep only those bigrams that exceed their unigram probabilities
process the corpus, outputting the kept bigrams wherever they appear

piskvorky · 2013-11-11T17:20:37Z

Yes, I think N passes are needed, where N=number of phrase detection passes. N=1 for bigrams. Each pass must collect stats from its predecessor. I don't think that's a problem though.

To avoid misunderstandings, this is the flow that I had in mind in #130:

sentences = LineSentences('/where/ever.txt')  # for example

bigram = Phrases(sentences)  # collect the necessary count stats
model1 = Word2Vec(bigram[sentences], ...)  # transform sentences on the fly

trigram = Phrases(sentences, gram=3)  # same as trigram = Phrases(bigram[sentences])
model = Word2Vec(trigram[sentences], ...)

I didn't look into this very deep, but I think this should work.

piskvorky · 2013-11-11T17:22:58Z

re. RAM for count stats: out-of-core sorting is a proper, solid solution. But doing it the same way the C tool does it may be easier to start with, and perhaps good enough :)

Incidentally, another of my open source tools, sqlitedict, is great for managing such out-of-core mappings. It uses sqlite as backend but acts like a dict.

RadixSeven · 2013-11-11T20:34:28Z

You're right that a decent approximation is all we probably need. The improvement in the model from detecting less frequent phrases as phrases is probably not very high. So, as long as the phrase detection gets the most important ones, it is probably good enough.

mfcabrera · 2014-10-24T09:38:44Z

Hi, What is the status of this?. Is someone working currently on this? I would like to give it a shot.

piskvorky · 2014-10-24T09:56:39Z

Not that I know of -- you're welcome @mfcabrera :)

This pull request is unmergeable -- 1) seriously out of date; and 2) corpus preprocessing (incl. phrase detection) should really be independent of word2vec, as described above.

piskvorky · 2014-11-02T01:13:04Z

@mfcabrera progress? Let me know if you need any help -- should be a straightforward job :)

mfcabrera · 2014-11-02T11:01:30Z

Due to time constraints I just started last friday. I am currently working on it. I am writing a separate module for this. My code is based word2phrase.c implementation. Although there are functionality/log that is going to be repeated from word2vec code, particularly the learn vocabulary. When I have something working I will create a pull request :). If I understood correctly in the first pass I need to calculate the counts / stats for n-grams and then apply the generate the bigrams of the fly from the 1-gram sentence iterator.

piskvorky · 2014-11-02T11:26:06Z

That's right. The point is for the result object to be usable as an on-the-fly transformation.

I'm not sure what you mean by "repeat the learn vocabulary from word2vec" -- why would we repeat that?

mfcabrera · 2014-11-03T10:31:43Z

@piskvorky, I meant that to learn the stats word2phrase learns the vocabulary (unigrams and bigrams) from the input/training file (LearnVocabFromTrainFile). word2vec module (and the C version as well) does that (but without the bigrams). So "some" of the logic is repeated. I have a couple of questions: This should be a standalone module right? (e.g. phrases.py) and where should be located? (models?)

piskvorky · 2014-11-03T10:51:42Z

Yes, definitely standalone. This will be useful all around, not just in word2vec. Putting it in models is fine.

piskvorky · 2014-11-03T18:09:15Z

Superseded by #258 , closing here.

Borar added 2 commits November 10, 2013 16:58

Added phrase support

1e2ffbe

Remove tempfile import

9fbf421

piskvorky reviewed Nov 10, 2013
View reviewed changes

mfcabrera mentioned this pull request Nov 3, 2014

Phrase support based on word2phrase #258

Closed

piskvorky closed this Nov 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added phrase support #135

Added phrase support #135

sumitborar commented Nov 10, 2013

RadixSeven commented Nov 10, 2013

sumitborar commented Nov 10, 2013

piskvorky Nov 10, 2013

sumitborar commented Nov 11, 2013

sumitborar commented Nov 11, 2013

piskvorky commented Nov 11, 2013

RadixSeven commented Nov 11, 2013

piskvorky commented Nov 11, 2013

piskvorky commented Nov 11, 2013

RadixSeven commented Nov 11, 2013

mfcabrera commented Oct 24, 2014

piskvorky commented Oct 24, 2014

piskvorky commented Nov 2, 2014

mfcabrera commented Nov 2, 2014

piskvorky commented Nov 2, 2014

mfcabrera commented Nov 3, 2014

piskvorky commented Nov 3, 2014

piskvorky commented Nov 3, 2014

Added phrase support #135

Added phrase support #135

Conversation

sumitborar commented Nov 10, 2013

RadixSeven commented Nov 10, 2013

sumitborar commented Nov 10, 2013

piskvorky Nov 10, 2013

Choose a reason for hiding this comment

sumitborar commented Nov 11, 2013

sumitborar commented Nov 11, 2013

piskvorky commented Nov 11, 2013

RadixSeven commented Nov 11, 2013

piskvorky commented Nov 11, 2013

piskvorky commented Nov 11, 2013

RadixSeven commented Nov 11, 2013

mfcabrera commented Oct 24, 2014

piskvorky commented Oct 24, 2014

piskvorky commented Nov 2, 2014

mfcabrera commented Nov 2, 2014

piskvorky commented Nov 2, 2014

mfcabrera commented Nov 3, 2014

piskvorky commented Nov 3, 2014

piskvorky commented Nov 3, 2014