Allow file-based *2vec training from compressed files #2159
Labels
difficulty medium
Medium issue: required good gensim understanding & python skills
performance
Issue related to performance (in HW meaning)
wishlist
Feature request
All Gensim algorithms allow the use of
smart_open
to read their input data, meaning the data can be .gz, .bz2, live on s3, etc.However, the new code path for file-based training of *2vec model from #2127 only accepts .txt files. This is problematic, because the main purpose of this file-based training is to be run on very large datasets (where its superior speed actually matters). Keeping such large text files uncompressed is wasteful and sometimes even impossible.
Task: implement support for reading input from .gz compressed files (at least). bz2 would be nice too, but the "seeking into the middle of a file" by each worker may be problematic for that format, technically speaking.
The text was updated successfully, but these errors were encountered: