Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "mismatched vocab_size" loading supervised model with FastText.load_fasttext_format #1498

Closed
bittlingmayer opened this issue Jul 22, 2017 · 8 comments · Fixed by #1645
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@bittlingmayer
Copy link

The issue is with the latest code in the develop branch, which is the only way to load a .bin without a .vec, and thus the only way to load a supervised model, because supervised does not output a .vec.

Steps/Code/Corpus to Reproduce

pip install https://github.com/RaRe-Technologies/gensim/archive/develop.zip

Then, in Python:

from gensim.models.wrappers import FastText
m = FastText.load_fasttext_format('example.bin') # note: there is no example.vec

Expected Results

The model loads.

Actual Results

Error:

'mismatched vocab_size ({}) and nwords ({}), extra word "{}"'.format(vocab_size, nwords, word))
AssertionError: mismatched vocab_size (2338) and nwords (2336), extra word "__label__bad"

Of course, __label__bad is one of the labels in the model.

Versions

>>> import platform; print(platform.platform())
Darwin-16.6.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 26 2016, 10:47:25) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 0.19.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 2.2.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0
@prakhar2b
Copy link
Contributor

@bittlingmayer Gensim doesn't support supervised version of fastText as of now.

@bittlingmayer
Copy link
Author

@prakhar2b Thank you, ideally the error message should say that.

@menshikh-iv menshikh-iv added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Oct 2, 2017
@ibrahimsharaf
Copy link
Contributor

Hi @menshikh-iv, how can I start solving this?

@menshikh-iv
Copy link
Contributor

Hi @ibrahimsharaf, you need to add information in an exception about unsupervised fasttext.

@ibrahimsharaf
Copy link
Contributor

@menshikh-iv, so I should raise an exception inside load_fasttext_format method in case .vec file is not existing?

@menshikh-iv
Copy link
Contributor

@ibrahimsharaf

  • Check popular supervised fasttext implementations
  • Found what's special in "labels" (I think __label_*, but need to check)
  • If you have mismatched words with this needed criterias (like __label_*) - this means that it's probably supervised model -> raise similar exception as in the first message + add info about "we don't support supervised fasttext"

@menshikh-iv menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) and removed test before incubator labels Oct 16, 2017
@ElSaico
Copy link
Contributor

ElSaico commented Oct 23, 2017

Working on it!

BTW: the prefix for the labels is fully configurable in fastText (-label parameter, with __label__ being the default), so we cannot rely on this particular pattern, but need a more generic rule such as "all mismatched words have the same prefix".

@menshikh-iv
Copy link
Contributor

Thanks for the information @ElSaico, I did not know about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants