Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trim_rule doesn't work if model initialization and vocabulary building are separated #824

Closed
vpolevoy opened this issue Aug 12, 2016 · 4 comments
Assignees
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@vpolevoy
Copy link

vpolevoy commented Aug 12, 2016

Is it bug or feature?

> > > import gensim, logging
> > > from gensim import utils
> > > logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
> > > full_sentence=[str(x)+'times' for x in range(1,5)]
> > > sentences=[full_sentence,full_sentence[1:],full_sentence[2:],full_sentence[3:]]
> > > print(sentences)

[['1times', '2times', '3times', '4times'], ['2times', '3times', '4times'], ['3times', '4times'], ['4times']]

> > > def print_vocab(model):
> > >     for x in model.index2word:
> > >         print(x)
> > >         print(model.vocab[x])
> > > # trim rule:
> > > 
> > > def my_rule(word, count, min_count):
> > >     if word.startswith("1"):
> > >         return gensim.utils.RULE_KEEP
> > >     else:
> > >         return gensim.utils.RULE_DEFAULT
> > > model = gensim.models.Word2Vec(sentences,min_count=3,trim_rule=my_rule)
> > > # the trim rule work
> > > 
> > > print_vocab(model)

2016-08-12 09:31:47,100 : INFO : collecting all words and their counts
2016-08-12 09:31:47,100 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-08-12 09:31:47,100 : INFO : collected 4 word types from a corpus of 10 raw words and 4 sentences
2016-08-12 09:31:47,100 : INFO : min_count=3 retains 3 unique words (drops 1)
2016-08-12 09:31:47,101 : INFO : min_count leaves 8 word corpus (80% of original 10)
2016-08-12 09:31:47,101 : INFO : deleting the raw counts dictionary of 4 items
2016-08-12 09:31:47,101 : INFO : sample=0.001 downsamples 3 most-common words
2016-08-12 09:31:47,101 : INFO : downsampling leaves estimated 0 word corpus (5.6% of prior 8)
2016-08-12 09:31:47,101 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2016-08-12 09:31:47,101 : INFO : resetting layer weights
2016-08-12 09:31:47,102 : INFO : training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-08-12 09:31:47,102 : INFO : expecting 4 sentences, matching count from corpus used for vocabulary survey
4times
Vocab(count:4, index:0, sample_int:200666711)
3times
Vocab(count:3, index:1, sample_int:233244404)
1times
Vocab(count:1, index:2, sample_int:418513292)
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-08-12 09:31:47,103 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-08-12 09:31:47,104 : INFO : training on 50 raw words (2 effective words) took 0.0s, 2544 effective words/s
2016-08-12 09:31:47,104 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

> > > # I want to separate  initialization and vocabulary building
> > > 
> > > model = gensim.models.Word2Vec(min_count=3,trim_rule=my_rule)
> > > model.build_vocab(sentences)
> > > # the trim rule doesn't work in this case
> > > 
> > > print_vocab(model)

2016-08-12 09:31:59,707 : INFO : collecting all words and their counts
2016-08-12 09:31:59,708 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-08-12 09:31:59,708 : INFO : collected 4 word types from a corpus of 10 raw words and 4 sentences
2016-08-12 09:31:59,708 : INFO : min_count=3 retains 2 unique words (drops 2)
2016-08-12 09:31:59,708 : INFO : min_count leaves 7 word corpus (70% of original 10)
2016-08-12 09:31:59,708 : INFO : deleting the raw counts dictionary of 4 items
2016-08-12 09:31:59,708 : INFO : sample=0.001 downsamples 2 most-common words
2016-08-12 09:31:59,708 : INFO : downsampling leaves estimated 0 word corpus (4.7% of prior 7)
2016-08-12 09:31:59,708 : INFO : estimated required memory for 2 words and 100 dimensions: 2600 bytes
2016-08-12 09:31:59,708 : INFO : resetting layer weights
4times
Vocab(count:4, index:0, sample_int:187187565)
3times
Vocab(count:3, index:1, sample_int:217488221)

> > > model.train(sentences)
> > > print_vocab(model)

2016-08-12 09:32:08,306 : INFO : training model with 3 workers on 2 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5
2016-08-12 09:32:08,306 : INFO : expecting 4 sentences, matching count from corpus used for vocabulary survey
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-08-12 09:32:08,308 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-08-12 09:32:08,308 : INFO : training on 50 raw words (0 effective words) took 0.0s, 0 effective words/s
2016-08-12 09:32:08,308 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
4times
Vocab(count:4, index:0, sample_int:187187565)
3times
Vocab(count:3, index:1, sample_int:217488221)
@vpolevoy
Copy link
Author

I guess I got it:
trim_rule should be passed to model.build_vocab() NOT during model initialization

@piskvorky piskvorky reopened this Aug 13, 2016
@piskvorky
Copy link
Owner

piskvorky commented Aug 13, 2016

Looks like a bug / documentation confusion to me. -- re-opening issue.

At the very least, if the user tries the combination of no corpus in init, but trim_rule in init, we should log a warning that trim_rule is being ignored. Or even an exception.

I can see how the current API could be confusing. CC @gojomo

@piskvorky piskvorky added the documentation Current issue related to documentation label Aug 13, 2016
@tmylk tmylk added the difficulty easy Easy issue: required small fix label Oct 5, 2016
@glenn124f
Copy link

Hi! I'd like to work on this.

@piskvorky
Copy link
Owner

Hi @glenn124f , sorry, I guess @tmylk missed your response.

Are you still interested in working on this? CC @gojomo for API opinion.

@tmylk tmylk closed this as completed in 000c02a Mar 7, 2017
pranaydeeps pushed a commit to pranaydeeps/gensim that referenced this issue Mar 21, 2017
…ky#1186)

* no corpus in init, but trim_rule in init

logged warning that trim_rule is being ignored for separate model initialization and vocabulary building

* log warning only when trim_rule is specified
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

4 participants