Loading c word2vec text models and encoding errors #496

nick-magnini · 2015-10-22T00:18:11Z

Always when I have a huge model trained using c version of word2vec I have to fix many lines that rae rather containing some strange chars or they are bad-formed. There are only few lines. Sincee finding and replacing them takes forever, is there anyway to just ignore them and skip them in the load_word2vec_format function?

gojomo · 2015-10-22T00:54:29Z

If you can work with the develop branch, load_word2vec_format() now takes an optional unicode_errors argument (see #466). The value is passed to the native python unicode() function, and using 'ignore' or 'replace' should help most reads survive any mangling from the word2vec.c files...

nick-magnini · 2015-10-22T15:05:14Z

Thanks for the suggestion.

gojomo · 2015-11-11T04:35:59Z

The unicode_errors option is in the 0.12.3 release and so is available to anyone who hits this problem. Closing. (If a lot of people hit this problem for reasons outside their control, perhaps 'ignore' should be the new default. TBD.)

gojomo closed this as completed Nov 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading c word2vec text models and encoding errors #496

Loading c word2vec text models and encoding errors #496

nick-magnini commented Oct 22, 2015

gojomo commented Oct 22, 2015

nick-magnini commented Oct 22, 2015

gojomo commented Nov 11, 2015

Loading c word2vec text models and encoding errors #496

Loading c word2vec text models and encoding errors #496

Comments

nick-magnini commented Oct 22, 2015

gojomo commented Oct 22, 2015

nick-magnini commented Oct 22, 2015

gojomo commented Nov 11, 2015