Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Replace word2vec-specific implementation w/ constrained subclass of FastText #2879

Open
gojomo opened this issue Jul 12, 2020 · 5 comments
Labels
housekeeping internal tasks and processes

Comments

@gojomo
Copy link
Collaborator

gojomo commented Jul 12, 2020

I believe that everything Word2Vec does can also already be done equivalently via FastText, with constrained options (like turning off char-ngrams). So we could potentially eliminate a lot of duplicated algorithm code & future maintenance overhead by recharacterizing FastText as the root of our 2Vec hierarchy, instead of the original Word2Vec code.

We'd want to ensure FastText gracefully handles char-ngram-disablement (by not making allocations/class-choices only required when ngrams are enabled), and make Word2Vec a subclass of FastText which hides some options, rather than the other way around with FastText as a subclass that adds new options.

User-visible APIs might not change at all. Some new ngram-enabled Doc2Vec options might become possible if Doc2Vec derives from FastText (like an inference that works at least a little, in some modes, even with all OOV words).

(Relates to: #2852)

@piskvorky
Copy link
Owner

Makes sense to me. I also can't think of anything – beyond performance and branding – that word2vec has over fasttext.

@gojomo
Copy link
Collaborator Author

gojomo commented Jul 13, 2020

A quick check of Word2Vec (all defaults) & FastText (all defaults except n-grams disabled) has already shown a few issues that will need addressing:

  • it's a bit muddled how ngrams should be turned off - different comments/code imply either (1) max_n less than min_n; (2) char_ngrams=0; (3) char_ngrams==0 and max_n==0 (in FastText.__init__()).
  • the train() step that takes 55s in plain Word2Vec takes 84s in FastText - a lot of overhead for a disabled feature
  • an evaluate_word_analogies() w/ question_words.txt fails everything on the FT-without-ngrams model, rather than roughly matching the should-be-equivalent Word2Vec - unclear why
  • even without ngrams, the FastText model still maintains its two separate sets of per-word vectors: the raw whole-word vector, and the whole-word-plus-ngram-enrichment calculation – so RAM usage & on-disk storage is more expansive than would be best

Most of these probably have straightforward fixes, but the existing gensim FastText implementation isn't yet a simple drop-in replacement for Word2Vec as one might have hoped. (Potentially, rough parity in ngrams-disabled performance and quality-evaluations could have been part of the FT code's original review/testing.)

@piskvorky
Copy link
Owner

Thanks for that check, that's a great start. Are the conclusions above still true after #2891?

Bullet point 2) and 3) are especially worrying. Unless it's something trivial, we're probably not aiming to resolve this for 4.0.0. Although 4.0.0 would be the ideal place for a change like this.

@gojomo
Copy link
Collaborator Author

gojomo commented Jul 27, 2020

The speed gap & problems with analogies are fixed by (ready-for-merge) #2891; the duplication bloat in RAM & serialization is fixed by (probably OK for merge but still intended as a place to fix a few other load/save issues) #2892. Other as-yet-undiagnosed bloat (both W2V & FT's 'main' pickle file are larger since #2698) probably has a quick fix when I get around to looking closer.

The exact recommended/best way to run without ngrams still needs a little more investigation & doc-improvement. (It's probably max_n=0 but bucket=0 might be as good and maybe should be equally supported.)

After that, the biggest issue would likely be ensuring Doc2Vec could live as a FastText subclass (instead of Word2Vec), & ironing out any little oddities from Word2Vec & Doc2Vec relying on FT, but possibly hiding/suppressing some of its unneeded aspects. I don't think that'd be too big of a problem, but they may be gotchas to-be-revealed.

@piskvorky
Copy link
Owner

Alright. I'll tentatively mark this as "4.0.0" but not really blocking.

@piskvorky piskvorky added this to the 4.0.0 milestone Jul 27, 2020
@piskvorky piskvorky added the housekeeping internal tasks and processes label Jul 27, 2020
@piskvorky piskvorky removed this from the 4.0.0 milestone Sep 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
housekeeping internal tasks and processes
Projects
None yet
Development

No branches or pull requests

2 participants