New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added phrase support #135
Added phrase support #135
Conversation
I was puzzled by a number of your changes. On several lines, you undo improvements contributed by Radim and by myself. After thinking about it, I am not sure it was intentional. Maybe there was a problem in how you merged? |
That was unintentional . Looks like there was a problem with merge. I will remerge and submit the changes again . |
|
||
last_word = word | ||
last_word_count = word_count | ||
new_sentences.append( new_sentence ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not going to work -- you are creating the entire corpus as a list in memory! that wouldn't scale very well :)
the bigram sentences must be coming out as a stream (one sentence at a time).
I can use temporary files to store the sentences and then reload them. I was thinking of doing this , but it would be slower. As for the second case, I am not sure what's a good solution ?. We could do the pruning but as you pointed out its a hack . |
For word combinations, we can do a two pass approach. In first pass we collect all the unigrams and their frequencies and then only add combinations of words which are above the threshold. This would reduce the size of the dictionary. |
My idea was to stream the sentences from an iterator object, like I described in issue #130. Is there a reason this wouldn't work? Basically, separate the phrase building as corpus preprocessing, rather than part of word2vec training. About performance: unless something is super slow, the performance is not the primary objective. The primary objective is that it's easy to use and works over large corpora. If the user wants the fastest input possible, he can serialize the sentences to disk in LineSentence format and avoid any online transformations/processing. |
Streaming the sentences and outputting the combined words as you read each sentence won't work because you don't know the bigram frequencies until you've read the whole corpus (imagine that "New York" only occurs twice - once at the beginning and once at the end). You need at least two passes, one to count the unigrams and bigrams and one to keep the better-than-unigram bigrams. Storing the bigram counts in memory could be challenging. The Google n-gram dump reports a 23 fold increase between unigrams and bigrams. I'm working with a 2.7G word corpus in which there are 2.8M terms, so, a rough estimate in my case would be 64M bigrams - that works out to 128 bytes/bigram on an 8GiB machine. Probably enough. As an upper bound, their 1T word corpus gave 314M bigrams. That gives you 26 bytes (3.25 ints) to store the average bigram on an 8GiB machine - probably not enough (pointer to object, pointer to string, and pointer to dict overhead leaves you with only 2 bytes for the bigram text). To get all the bigrams in constant RAM, the best algorithm I can think of is:
|
Yes, I think N passes are needed, where N=number of phrase detection passes. N=1 for bigrams. Each pass must collect stats from its predecessor. I don't think that's a problem though. To avoid misunderstandings, this is the flow that I had in mind in #130:
I didn't look into this very deep, but I think this should work. |
re. RAM for count stats: out-of-core sorting is a proper, solid solution. But doing it the same way the C tool does it may be easier to start with, and perhaps good enough :) Incidentally, another of my open source tools, |
You're right that a decent approximation is all we probably need. The improvement in the model from detecting less frequent phrases as phrases is probably not very high. So, as long as the phrase detection gets the most important ones, it is probably good enough. |
Hi, What is the status of this?. Is someone working currently on this? I would like to give it a shot. |
Not that I know of -- you're welcome @mfcabrera :) This pull request is unmergeable -- 1) seriously out of date; and 2) corpus preprocessing (incl. phrase detection) should really be independent of word2vec, as described above. |
@mfcabrera progress? Let me know if you need any help -- should be a straightforward job :) |
Due to time constraints I just started last friday. I am currently working on it. I am writing a separate module for this. My code is based word2phrase.c implementation. Although there are functionality/log that is going to be repeated from word2vec code, particularly the learn vocabulary. When I have something working I will create a pull request :). If I understood correctly in the first pass I need to calculate the counts / stats for n-grams and then apply the generate the bigrams of the fly from the 1-gram sentence iterator. |
That's right. The point is for the result object to be usable as an on-the-fly transformation. I'm not sure what you mean by "repeat the learn vocabulary from word2vec" -- why would we repeat that? |
@piskvorky, I meant that to learn the stats word2phrase learns the vocabulary (unigrams and bigrams) from the input/training file (LearnVocabFromTrainFile). word2vec module (and the C version as well) does that (but without the bigrams). So "some" of the logic is repeated. I have a couple of questions: This should be a standalone module right? (e.g. phrases.py) and where should be located? (models?) |
Yes, definitely standalone. This will be useful all around, not just in word2vec. Putting it in |
Superseded by #258 , closing here. |
No description provided.