Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading c word2vec text models and encoding errors #496

Closed
nick-magnini opened this issue Oct 22, 2015 · 3 comments
Closed

Loading c word2vec text models and encoding errors #496

nick-magnini opened this issue Oct 22, 2015 · 3 comments

Comments

@nick-magnini
Copy link

Always when I have a huge model trained using c version of word2vec I have to fix many lines that rae rather containing some strange chars or they are bad-formed. There are only few lines. Sincee finding and replacing them takes forever, is there anyway to just ignore them and skip them in the load_word2vec_format function?

@gojomo
Copy link
Collaborator

gojomo commented Oct 22, 2015

If you can work with the develop branch, load_word2vec_format() now takes an optional unicode_errors argument (see #466). The value is passed to the native python unicode() function, and using 'ignore' or 'replace' should help most reads survive any mangling from the word2vec.c files...

@nick-magnini
Copy link
Author

Thanks for the suggestion.

@gojomo
Copy link
Collaborator

gojomo commented Nov 11, 2015

The unicode_errors option is in the 0.12.3 release and so is available to anyone who hits this problem. Closing. (If a lot of people hit this problem for reasons outside their control, perhaps 'ignore' should be the new default. TBD.)

@gojomo gojomo closed this as completed Nov 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants