Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added phrase support #135

Closed
wants to merge 2 commits into from
Closed

Conversation

sumitborar
Copy link

No description provided.

@RadixSeven
Copy link
Contributor

I was puzzled by a number of your changes. On several lines, you undo improvements contributed by Radim and by myself. After thinking about it, I am not sure it was intentional. Maybe there was a problem in how you merged?

@sumitborar
Copy link
Author

That was unintentional . Looks like there was a problem with merge. I will remerge and submit the changes again .


last_word = word
last_word_count = word_count
new_sentences.append( new_sentence )
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not going to work -- you are creating the entire corpus as a list in memory! that wouldn't scale very well :)

the bigram sentences must be coming out as a stream (one sentence at a time).

@sumitborar
Copy link
Author

I can use temporary files to store the sentences and then reload them. I was thinking of doing this , but it would be slower.

As for the second case, I am not sure what's a good solution ?. We could do the pruning but as you pointed out its a hack .

@sumitborar
Copy link
Author

For word combinations, we can do a two pass approach. In first pass we collect all the unigrams and their frequencies and then only add combinations of words which are above the threshold. This would reduce the size of the dictionary.

@piskvorky
Copy link
Owner

My idea was to stream the sentences from an iterator object, like I described in issue #130. Is there a reason this wouldn't work?

Basically, separate the phrase building as corpus preprocessing, rather than part of word2vec training.

About performance: unless something is super slow, the performance is not the primary objective. The primary objective is that it's easy to use and works over large corpora. If the user wants the fastest input possible, he can serialize the sentences to disk in LineSentence format and avoid any online transformations/processing.

@RadixSeven
Copy link
Contributor

Streaming the sentences and outputting the combined words as you read each sentence won't work because you don't know the bigram frequencies until you've read the whole corpus (imagine that "New York" only occurs twice - once at the beginning and once at the end). You need at least two passes, one to count the unigrams and bigrams and one to keep the better-than-unigram bigrams.

Storing the bigram counts in memory could be challenging. The Google n-gram dump reports a 23 fold increase between unigrams and bigrams. I'm working with a 2.7G word corpus in which there are 2.8M terms, so, a rough estimate in my case would be 64M bigrams - that works out to 128 bytes/bigram on an 8GiB machine. Probably enough. As an upper bound, their 1T word corpus gave 314M bigrams. That gives you 26 bytes (3.25 ints) to store the average bigram on an 8GiB machine - probably not enough (pointer to object, pointer to string, and pointer to dict overhead leaves you with only 2 bytes for the bigram text).

To get all the bigrams in constant RAM, the best algorithm I can think of is:

echo each bigram to a file and also remember unigram probabilities in same pass
do merge-sort on the file, leaving a count field for duplicates (Under Unix: sort tmpfile | uniq -c)
keep only those bigrams that exceed their unigram probabilities
process the corpus, outputting the kept bigrams wherever they appear

@piskvorky
Copy link
Owner

Yes, I think N passes are needed, where N=number of phrase detection passes. N=1 for bigrams. Each pass must collect stats from its predecessor. I don't think that's a problem though.

To avoid misunderstandings, this is the flow that I had in mind in #130:

sentences = LineSentences('/where/ever.txt')  # for example

bigram = Phrases(sentences)  # collect the necessary count stats
model1 = Word2Vec(bigram[sentences], ...)  # transform sentences on the fly

trigram = Phrases(sentences, gram=3)  # same as trigram = Phrases(bigram[sentences])
model = Word2Vec(trigram[sentences], ...)

I didn't look into this very deep, but I think this should work.

@piskvorky
Copy link
Owner

re. RAM for count stats: out-of-core sorting is a proper, solid solution. But doing it the same way the C tool does it may be easier to start with, and perhaps good enough :)

Incidentally, another of my open source tools, sqlitedict, is great for managing such out-of-core mappings. It uses sqlite as backend but acts like a dict.

@RadixSeven
Copy link
Contributor

You're right that a decent approximation is all we probably need. The improvement in the model from detecting less frequent phrases as phrases is probably not very high. So, as long as the phrase detection gets the most important ones, it is probably good enough.

@mfcabrera
Copy link
Contributor

Hi, What is the status of this?. Is someone working currently on this? I would like to give it a shot.

@piskvorky
Copy link
Owner

Not that I know of -- you're welcome @mfcabrera :)

This pull request is unmergeable -- 1) seriously out of date; and 2) corpus preprocessing (incl. phrase detection) should really be independent of word2vec, as described above.

@piskvorky
Copy link
Owner

@mfcabrera progress? Let me know if you need any help -- should be a straightforward job :)

@mfcabrera
Copy link
Contributor

Due to time constraints I just started last friday. I am currently working on it. I am writing a separate module for this. My code is based word2phrase.c implementation. Although there are functionality/log that is going to be repeated from word2vec code, particularly the learn vocabulary. When I have something working I will create a pull request :). If I understood correctly in the first pass I need to calculate the counts / stats for n-grams and then apply the generate the bigrams of the fly from the 1-gram sentence iterator.

@piskvorky
Copy link
Owner

That's right. The point is for the result object to be usable as an on-the-fly transformation.

I'm not sure what you mean by "repeat the learn vocabulary from word2vec" -- why would we repeat that?

@mfcabrera
Copy link
Contributor

@piskvorky, I meant that to learn the stats word2phrase learns the vocabulary (unigrams and bigrams) from the input/training file (LearnVocabFromTrainFile). word2vec module (and the C version as well) does that (but without the bigrams). So "some" of the logic is repeated. I have a couple of questions: This should be a standalone module right? (e.g. phrases.py) and where should be located? (models?)

@piskvorky
Copy link
Owner

Yes, definitely standalone. This will be useful all around, not just in word2vec. Putting it in models is fine.

@piskvorky
Copy link
Owner

Superseded by #258 , closing here.

@piskvorky piskvorky closed this Nov 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants