[MRG] Wrapper for FastText #847

jayantj · 2016-09-01T06:12:17Z

I've branched this off @droudy's KeyedVectors PR (#833). The changes I've made are all in gensim/models/wrappers/fasttext.py

Just wondering if this is the correct approach to go with, if we can decide that, I'll properly document the code, write tests, refactor it, right now it's mostly a POC. The main things I'm concerned about are -

Portability (using structs to read off the binary file generated by FastText, I'm not completely sure about how portable this is across different compiler versions and architectures)
FastText changes to binary file - If FastText makes any changes to the binary file, we'll have to make those changes too, and since the binary file seems to be mostly for internal use, it could very well see changes?
Memory usage - FastText takes up a huge amount of memory - with a tiny vocabulary of around 4k, my memory usage goes to ~750 MB. Regular word2vec takes up ~70 MB for the same. This seems to be due a lot of unused vectors assigned for the weight matrix, so it should be possible to reduce it significantly

Steps to reproduce (you'll need to setup FastText first -

from gensim.models.wrappers import FastText
ft_model = FastText.train(<fast_text_path>, <corpus_file>)
ft_model[<in_vocab_word>]
ft_model[<oov_word>]

I've manually verified on a small set that it produces the same results as FastText

@gojomo @tmylk

piskvorky · 2016-09-01T08:23:46Z

Awesome! We want to finalize and merge the KeyedVector functionality ASAP, so things don't diverge too much.

jayantj · 2016-09-09T09:40:30Z

What do you think would be the best way to handle OOV words with KeyedVectors?
Operations like most_similar, similarity etc. all need to be modified to allow for out of vocab words. Subclassing KeyedVectors and using that subclass inside the FastText class?

…g in saving

…cython files

This reverts commit 6e20834. Conflicts: gensim/test/test_word2vec.py

Conflicts: gensim/models/word2vec.py

…ugh word2vec instance

jayantj · 2017-01-06T03:55:08Z

The build is passing now (finally).
Added a tutorial for the FastText wrapper.
Let me know if there's any changes you think should be made.

Regarding KeyedVecs - the WordRank wrapper seems to return a Word2Vec instance right now.
I agree it makes sense to have FastText return a KeyedVecs instance, but it requires the Word2Vec.load_word2vec_format to return a KeyedVecs instance too. That seems like a fairly major change, as the load_word2vec_format is (probably) a widely used method. It'd also break quite a few tests in the current test_word2vec.py.
I think it'd make more sense to have a separate PR for that. Along with that PR, I'll also change the FastText wrapper to return a KeyedVecs instance.

One last thing - the FastText tests which involve actually using the FastText binary do not run on the build system. I've run them locally. Is this fine, or is there a way to setup FastText on the build system? Didn't see something similar for the Mallet LDA wrapper.

Thoughts?

jayantj · 2017-01-06T04:36:12Z

On second thoughts, I'm unsure if the FastText wrapper should return a KeyedVec instance. The loaded FastText model isn't simply a set of vectors, it contains complete data about the model (hyperparameters and weight matrices). Theoretically, the model can be further trained, we are unable to, currently, because of limitations in the FastText CLI.
Maybe a FastText instance makes most sense. The load_word2vec_format method in Word2Vec definitely should be changed to return a KeyedVecs instance.

jayantj · 2017-01-06T08:09:25Z

Pushed a new branch and created a PR #1078 which contains the update to load_word2vec_format, tests and the FastText wrapper.

tmylk · 2017-01-06T13:11:14Z

We don't test wrappers for DTM, Mallet etc in Travis as it takes a long time to download/compile the binaries for them. We do test them prior to a release though. Suggest having the same process for FastText

jayantj · 2017-01-09T13:26:56Z

@tmylk thanks, that sounds good.

@piskvorky I'm unable to request a review via Github, the option doesn't seem to show up under "Reviewers".
Would appreciate if you could look over this once.
Link to the tutorial for the FastText wrapper.

piskvorky · 2017-01-11T07:03:29Z

gensim/models/keyedvectors.py

+
+        Example::
+
+          >>> trained_model['office']


Misleading example (doesn't use this method at all).

piskvorky · 2017-01-11T07:07:15Z

gensim/models/keyedvectors.py

            else:
-                raise KeyError("word '%s' not in vocabulary" % word)
+                mean.append(weight * self.word_vec(word, use_norm=True))
+                if word in self.vocab:


Dead code test, can never reach here (above line would throw a KeyError).

The KeyError has been removed.

No, it's still there, on line 66.

That line raises a KeyError in case word in self.vocab is False. So in case it's True, line 115 would be executed.
Also, word_vec has been overriden in the KeyedVectors subclass for FastText.

Yes, my point is -- isn't it always True? How could it be False, when that would raise an exception at the line above? The test seems superfluous.

But if subclasses can make word_vec() behave differently (not raise for missing words), then it makes sense. Not sure what the general contract for word_vec() behaviour is.

piskvorky · 2017-01-11T07:08:50Z

gensim/models/keyedvectors.py

        if not positive:
            raise ValueError("cannot compute similarity with no input")

+        all_words = set([self.vocab[word].index for word in positive+negative if word in self.vocab])


What is the all_words created above for?

To remove the input words from the returned most_similar words.

Eh, never mind, the review snippet showed me the code for all_words from most_similar above, I thought it's the same function. Disregard my comment.

Square brackets [ ] not needed inside the set().

piskvorky · 2017-01-11T07:09:57Z

gensim/models/keyedvectors.py

        if not words:
            raise ValueError("cannot select a word from an empty list")
-        vectors = vstack(self.syn0norm[self.vocab[word].index] for word in words).astype(REAL)
+        logger.debug("using words %s" % words)


Use lazy log formatting (params passed via comma, not directly formatted, which is mostly wasteful because debug is not output).

piskvorky · 2017-01-11T07:12:23Z

gensim/models/keyedvectors.py

-        vectors = vstack(self.syn0norm[self.vocab[word].index] for word in words).astype(REAL)
+        logger.debug("using words %s" % words)
+        vectors = []
+        for word in words:


The line that initialized words was removed above, so how does this work?

The previous version seemed shorter and cleaner, what is the purpose of this explicit loop?

words was also the name of one of the params of the method, and the removed line was filtering out oov words.
Renamed/reworked things to make things a little clearer in the latest changes.

piskvorky · 2017-01-11T07:15:52Z

gensim/models/keyedvectors.py

+            except KeyError:
+                logger.debug("vector for word %s not present, ignoring the word", word)
+        if not vectors:
+            raise ValueError("vector for all given words absent")


Confusing message, please rephrase. How about "cannot compute similarity with no input" like above, for consistency?

Also wondering whether we should throw an exception in case of missing words, rather than silently ignoring them (DEBUG level message is almost ignore).

Changed to a warning for now, that should be noticeable.
Not sure if an exception would be ideal, since it's a change in behaviour, and in case the intention is only to make things more transparent, a warning probably serves the purpose best.

OK, thanks.

piskvorky · 2017-01-11T07:16:15Z

gensim/models/word2vec.py

@@ -469,6 +469,9 @@ def __init__(
            self.build_vocab(sentences, trim_rule=trim_rule)
            self.train(sentences)

+    def initialize_word_vectors(self):
+        self.wv = KeyedVectors()  # wv --> word vectors


Remove comment, adds nothing.

piskvorky · 2017-01-11T07:17:48Z

gensim/models/wrappers/fasttext.py

+
+    def load_binary_data(self, model_binary_file):
+        """Loads data from the output binary file created by FastText training"""
+        with open(model_binary_file, 'rb') as f:


Use smart_open (everywhere).

piskvorky · 2017-01-11T07:18:26Z

gensim/models/wrappers/fasttext.py

+    @staticmethod
+    def compute_ngrams(word, min_n, max_n):
+        ngram_indices = []
+        BOW, EOW = ('<','>')  # Used by FastText to attach to all words as prefix and suffix


PEP8: space after comma.

piskvorky · 2017-01-11T07:21:19Z

@jayantj left a few comments in the review.

Looking at the notebook:

What does Training time for FastText is significantly higher than the Gensim version of FastText mean? How can a wrapper be faster than the original?
atleast => at least (in multiple places)

Otherwise looks good 👍

jayantj · 2017-01-11T14:43:42Z

Thanks for the review, much appreciated.
The line Training time for FastText is significantly higher than the Gensim version of FastText was meant to be Training time for FastText is significantly higher than the Gensim version of Word2Vec - fixed.

jayantj mentioned this pull request Sep 9, 2016

Keyedvecs updates #852

Closed

jayantj force-pushed the fasttext branch from 372760e to e53aca1 Compare September 9, 2016 10:31

droudy and others added 26 commits September 12, 2016 19:06

updated refactor

55a4fc9

commit missed file

e916f7e

docstring added

e5416ed

more refactoring

e64766b

add missing docstring

c34cf37

fix docstring format

c9b31f9

clearer docstring

a0329af

minor typo in word2vec wmdistance

0c0e2fa

pyemd error in keyedvecs

cdefeb0

relative import of keyedvecs from word2vec fails

1aec5a2

bug in init_sims in word2vec

e7368a3

property descriptors for syn0, syn0norm, index2word, vocab - fixes bu…

fe283c2

…g in saving

tests for loading older word2vec models

9b36bc4

backwards compatibility for loading older models

dfe1893

test for syn0norm not saved to file

4a03f20

syn0norm not saved to file for KeyedVectors

09b6ebe

tests and fix for accuracy

7df4138

minor bug in finalized vocab check

4c54d9b

warnings for direct syn0/syn0norm access

a28f9f1

fixes use of most_similar in accuracy

bf1182e

changes logging level to ERROR in word2vec tests

5a6b97b

renames kv to wv in word2vec

cfb2e1c

minor bugs with checking existence of syn0

b002765

replaces syn0 and syn0norm with wv.syn0 and wv.syn0norm in tests and …

27c0a14

…cython files

adds changelog

81f8cbb

initial fastText wrapper class

aa7e632

jayantj force-pushed the fasttext branch from 7bbf4cc to 7867896 Compare January 4, 2017 13:00

jayantj added 3 commits January 4, 2017 18:40

updates word2vec test model files

461a6b4

python2.6 compatibility for fasttext tests

9137090

Revert "updates keyedvector load tests to use actual values"

e5ae899

This reverts commit 6e20834. Conflicts: gensim/test/test_word2vec.py

jayantj force-pushed the fasttext branch from 7867896 to e5ae899 Compare January 4, 2017 13:11

jayantj added 6 commits January 4, 2017 21:55

Merge branch 'develop' into fasttext

b98b40f

Conflicts: gensim/models/word2vec.py

replaces all instances of vocab and syn0 being accessed directly thro…

5eb8f75

…ugh word2vec instance

adds fasttext tutorial notebook

27bec7b

minor doc updates

ef0e1e2

removes direct vocab access in FastText

ab07ef9

suppresses numpy overflow warning while computing fasttext hash

2f37b04

jayantj changed the title ~~[WIP] Wrapper for FastText~~ [RG] Wrapper for FastText Jan 6, 2017

jayantj changed the title ~~[RG] Wrapper for FastText~~ [MRG] Wrapper for FastText Jan 6, 2017

jayantj mentioned this pull request Jan 6, 2017

[WIP] [DNM] Keyedvector load word2vec format #1078

Closed

piskvorky requested changes Jan 11, 2017

View reviewed changes

jayantj added 3 commits January 11, 2017 20:06

minor doc + pep8 updates

b2ff794

adds warning to doesnt_match if word vector is missing

7b0874a

minor fixes to fasttext tutorial

a7bceb6

jayantj force-pushed the fasttext branch from b11353e to a7bceb6 Compare January 11, 2017 14:37

Merge branch 'develop' into fasttext

dee9f97

tmylk merged commit 2a70e3a into piskvorky:develop Jan 24, 2017

jayantj mentioned this pull request May 31, 2017

Improve FastText loading times #1261

Closed

manneshiva mentioned this pull request Apr 5, 2018

FastText native VS original, different outputs #1940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Wrapper for FastText #847

[MRG] Wrapper for FastText #847

jayantj commented Sep 1, 2016 •

edited

Loading

piskvorky commented Sep 1, 2016

jayantj commented Sep 9, 2016 •

edited

Loading

jayantj commented Jan 6, 2017 •

edited

Loading

jayantj commented Jan 6, 2017

jayantj commented Jan 6, 2017

tmylk commented Jan 6, 2017 •

edited

Loading

jayantj commented Jan 9, 2017

piskvorky Jan 11, 2017

piskvorky Jan 11, 2017 •

edited

Loading

jayantj Jan 11, 2017

piskvorky Jan 11, 2017

jayantj Jan 12, 2017

piskvorky Jan 12, 2017 •

edited

Loading

piskvorky Jan 11, 2017

jayantj Jan 11, 2017

piskvorky Jan 11, 2017 •

edited

Loading

piskvorky Jan 11, 2017

jayantj Jan 11, 2017

piskvorky Jan 11, 2017

jayantj Jan 11, 2017 •

edited

Loading

piskvorky Jan 11, 2017

jayantj Jan 11, 2017

piskvorky Jan 12, 2017

piskvorky Jan 11, 2017

jayantj Jan 11, 2017

piskvorky Jan 11, 2017

jayantj Jan 11, 2017

piskvorky Jan 11, 2017

jayantj Jan 11, 2017

piskvorky commented Jan 11, 2017 •

edited

Loading

jayantj commented Jan 11, 2017

[MRG] Wrapper for FastText #847

[MRG] Wrapper for FastText #847

Conversation

jayantj commented Sep 1, 2016 • edited Loading

piskvorky commented Sep 1, 2016

jayantj commented Sep 9, 2016 • edited Loading

jayantj commented Jan 6, 2017 • edited Loading

jayantj commented Jan 6, 2017

jayantj commented Jan 6, 2017

tmylk commented Jan 6, 2017 • edited Loading

jayantj commented Jan 9, 2017

Choose a reason for hiding this comment

piskvorky Jan 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jan 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jan 11, 2017 • edited Loading

jayantj commented Jan 11, 2017

jayantj commented Sep 1, 2016 •

edited

Loading

jayantj commented Sep 9, 2016 •

edited

Loading

jayantj commented Jan 6, 2017 •

edited

Loading

tmylk commented Jan 6, 2017 •

edited

Loading

piskvorky Jan 11, 2017 •

edited

Loading

piskvorky Jan 12, 2017 •

edited

Loading

piskvorky Jan 11, 2017 •

edited

Loading

jayantj Jan 11, 2017 •

edited

Loading

piskvorky commented Jan 11, 2017 •

edited

Loading