trim_rule doesn't work if model initialization and vocabulary building are separated #824

vpolevoy · 2016-08-12T13:34:50Z

Is it bug or feature?

> > > import gensim, logging
> > > from gensim import utils
> > > logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
> > > full_sentence=[str(x)+'times' for x in range(1,5)]
> > > sentences=[full_sentence,full_sentence[1:],full_sentence[2:],full_sentence[3:]]
> > > print(sentences)

[['1times', '2times', '3times', '4times'], ['2times', '3times', '4times'], ['3times', '4times'], ['4times']]

> > > def print_vocab(model):
> > >     for x in model.index2word:
> > >         print(x)
> > >         print(model.vocab[x])
> > > # trim rule:
> > > 
> > > def my_rule(word, count, min_count):
> > >     if word.startswith("1"):
> > >         return gensim.utils.RULE_KEEP
> > >     else:
> > >         return gensim.utils.RULE_DEFAULT
> > > model = gensim.models.Word2Vec(sentences,min_count=3,trim_rule=my_rule)
> > > # the trim rule work
> > > 
> > > print_vocab(model)

2016-08-12 09:31:47,100 : INFO : collecting all words and their counts
2016-08-12 09:31:47,100 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-08-12 09:31:47,100 : INFO : collected 4 word types from a corpus of 10 raw words and 4 sentences
2016-08-12 09:31:47,100 : INFO : min_count=3 retains 3 unique words (drops 1)
2016-08-12 09:31:47,101 : INFO : min_count leaves 8 word corpus (80% of original 10)
2016-08-12 09:31:47,101 : INFO : deleting the raw counts dictionary of 4 items
2016-08-12 09:31:47,101 : INFO : sample=0.001 downsamples 3 most-common words
2016-08-12 09:31:47,101 : INFO : downsampling leaves estimated 0 word corpus (5.6% of prior 8)
2016-08-12 09:31:47,101 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2016-08-12 09:31:47,101 : INFO : resetting layer weights
2016-08-12 09:31:47,102 : INFO : training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-08-12 09:31:47,102 : INFO : expecting 4 sentences, matching count from corpus used for vocabulary survey
4times
Vocab(count:4, index:0, sample_int:200666711)
3times
Vocab(count:3, index:1, sample_int:233244404)
1times
Vocab(count:1, index:2, sample_int:418513292)
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-08-12 09:31:47,104 : INFO : training on 50 raw words (2 effective words) took 0.0s, 2544 effective words/s
2016-08-12 09:31:47,104 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

> > > # I want to separate  initialization and vocabulary building
> > > 
> > > model = gensim.models.Word2Vec(min_count=3,trim_rule=my_rule)
> > > model.build_vocab(sentences)
> > > # the trim rule doesn't work in this case
> > > 
> > > print_vocab(model)

2016-08-12 09:31:59,707 : INFO : collecting all words and their counts
2016-08-12 09:31:59,708 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-08-12 09:31:59,708 : INFO : collected 4 word types from a corpus of 10 raw words and 4 sentences
2016-08-12 09:31:59,708 : INFO : min_count=3 retains 2 unique words (drops 2)
2016-08-12 09:31:59,708 : INFO : min_count leaves 7 word corpus (70% of original 10)
2016-08-12 09:31:59,708 : INFO : deleting the raw counts dictionary of 4 items
2016-08-12 09:31:59,708 : INFO : sample=0.001 downsamples 2 most-common words
2016-08-12 09:31:59,708 : INFO : downsampling leaves estimated 0 word corpus (4.7% of prior 7)
2016-08-12 09:31:59,708 : INFO : estimated required memory for 2 words and 100 dimensions: 2600 bytes
2016-08-12 09:31:59,708 : INFO : resetting layer weights
4times
Vocab(count:4, index:0, sample_int:187187565)
3times
Vocab(count:3, index:1, sample_int:217488221)

> > > model.train(sentences)
> > > print_vocab(model)

2016-08-12 09:32:08,306 : INFO : training model with 3 workers on 2 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-08-12 09:32:08,306 : INFO : expecting 4 sentences, matching count from corpus used for vocabulary survey
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-08-12 09:32:08,308 : INFO : training on 50 raw words (0 effective words) took 0.0s, 0 effective words/s
2016-08-12 09:32:08,308 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
4times
Vocab(count:4, index:0, sample_int:187187565)
3times
Vocab(count:3, index:1, sample_int:217488221)

The text was updated successfully, but these errors were encountered:

vpolevoy · 2016-08-12T14:43:02Z

I guess I got it:
trim_rule should be passed to model.build_vocab() NOT during model initialization

piskvorky · 2016-08-13T11:42:07Z

Looks like a bug / documentation confusion to me. -- re-opening issue.

At the very least, if the user tries the combination of no corpus in init, but trim_rule in init, we should log a warning that trim_rule is being ignored. Or even an exception.

I can see how the current API could be confusing. CC @gojomo

glenn124f · 2016-10-08T14:49:35Z

Hi! I'd like to work on this.

piskvorky · 2016-12-12T21:30:47Z

Hi @glenn124f , sorry, I guess @tmylk missed your response.

Are you still interested in working on this? CC @gojomo for API opinion.

…ky#1186) * no corpus in init, but trim_rule in init logged warning that trim_rule is being ignored for separate model initialization and vocabulary building * log warning only when trim_rule is specified

vpolevoy closed this as completed Aug 12, 2016

piskvorky reopened this Aug 13, 2016

piskvorky assigned tmylk Aug 13, 2016

piskvorky added the documentation Current issue related to documentation label Aug 13, 2016

tmylk added the difficulty easy Easy issue: required small fix label Oct 5, 2016

prakhar2b mentioned this issue Mar 7, 2017

Fix #824 : no corpus in init, but trim_rule in init #1186

Merged

tmylk closed this as completed in 000c02a Mar 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trim_rule doesn't work if model initialization and vocabulary building are separated #824

trim_rule doesn't work if model initialization and vocabulary building are separated #824

vpolevoy commented Aug 12, 2016 •

edited by piskvorky

vpolevoy commented Aug 12, 2016

piskvorky commented Aug 13, 2016 •

edited

glenn124f commented Oct 8, 2016

piskvorky commented Dec 12, 2016

trim_rule doesn't work if model initialization and vocabulary building are separated #824

trim_rule doesn't work if model initialization and vocabulary building are separated #824

Comments

vpolevoy commented Aug 12, 2016 • edited by piskvorky

vpolevoy commented Aug 12, 2016

piskvorky commented Aug 13, 2016 • edited

glenn124f commented Oct 8, 2016

piskvorky commented Dec 12, 2016

vpolevoy commented Aug 12, 2016 •

edited by piskvorky

piskvorky commented Aug 13, 2016 •

edited