-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Wrapper for FastText #847
Conversation
Awesome! We want to finalize and merge the |
What do you think would be the best way to handle OOV words with |
This reverts commit 6e20834. Conflicts: gensim/test/test_word2vec.py
Conflicts: gensim/models/word2vec.py
…ugh word2vec instance
The build is passing now (finally). Regarding One last thing - the FastText tests which involve actually using the FastText binary do not run on the build system. I've run them locally. Is this fine, or is there a way to setup FastText on the build system? Didn't see something similar for the Mallet LDA wrapper. Thoughts? |
On second thoughts, I'm unsure if the FastText wrapper should return a |
Pushed a new branch and created a PR #1078 which contains the update to |
We don't test wrappers for DTM, Mallet etc in Travis as it takes a long time to download/compile the binaries for them. We do test them prior to a release though. Suggest having the same process for FastText |
@tmylk thanks, that sounds good. @piskvorky I'm unable to request a review via Github, the option doesn't seem to show up under "Reviewers". |
|
||
Example:: | ||
|
||
>>> trained_model['office'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading example (doesn't use this method at all).
else: | ||
raise KeyError("word '%s' not in vocabulary" % word) | ||
mean.append(weight * self.word_vec(word, use_norm=True)) | ||
if word in self.vocab: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code test, can never reach here (above line would throw a KeyError).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KeyError
has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's still there, on line 66.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line raises a KeyError
in case word in self.vocab
is False
. So in case it's True
, line 115 would be executed.
Also, word_vec
has been overriden in the KeyedVectors
subclass for FastText
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my point is -- isn't it always True
? How could it be False
, when that would raise an exception at the line above? The test seems superfluous.
But if subclasses can make word_vec()
behave differently (not raise for missing words), then it makes sense. Not sure what the general contract for word_vec()
behaviour is.
if not positive: | ||
raise ValueError("cannot compute similarity with no input") | ||
|
||
all_words = set([self.vocab[word].index for word in positive+negative if word in self.vocab]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the all_words
created above for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remove the input words from the returned most_similar
words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, never mind, the review snippet showed me the code for all_words
from most_similar
above, I thought it's the same function. Disregard my comment.
Square brackets [ ]
not needed inside the set()
.
if not words: | ||
raise ValueError("cannot select a word from an empty list") | ||
vectors = vstack(self.syn0norm[self.vocab[word].index] for word in words).astype(REAL) | ||
logger.debug("using words %s" % words) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use lazy log formatting (params passed via comma, not directly formatted, which is mostly wasteful because debug
is not output).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
vectors = vstack(self.syn0norm[self.vocab[word].index] for word in words).astype(REAL) | ||
logger.debug("using words %s" % words) | ||
vectors = [] | ||
for word in words: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line that initialized words
was removed above, so how does this work?
The previous version seemed shorter and cleaner, what is the purpose of this explicit loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
words
was also the name of one of the params of the method, and the removed line was filtering out oov words.
Renamed/reworked things to make things a little clearer in the latest changes.
except KeyError: | ||
logger.debug("vector for word %s not present, ignoring the word", word) | ||
if not vectors: | ||
raise ValueError("vector for all given words absent") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confusing message, please rephrase. How about "cannot compute similarity with no input"
like above, for consistency?
Also wondering whether we should throw an exception in case of missing words, rather than silently ignoring them (DEBUG level message is almost ignore).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a warning for now, that should be noticeable.
Not sure if an exception would be ideal, since it's a change in behaviour, and in case the intention is only to make things more transparent, a warning probably serves the purpose best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks.
@@ -469,6 +469,9 @@ def __init__( | |||
self.build_vocab(sentences, trim_rule=trim_rule) | |||
self.train(sentences) | |||
|
|||
def initialize_word_vectors(self): | |||
self.wv = KeyedVectors() # wv --> word vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment, adds nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
def load_binary_data(self, model_binary_file): | ||
"""Loads data from the output binary file created by FastText training""" | ||
with open(model_binary_file, 'rb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use smart_open
(everywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@staticmethod | ||
def compute_ngrams(word, min_n, max_n): | ||
ngram_indices = [] | ||
BOW, EOW = ('<','>') # Used by FastText to attach to all words as prefix and suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: space after comma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@jayantj left a few comments in the review. Looking at the notebook:
Otherwise looks good 👍 |
Thanks for the review, much appreciated. |
I've branched this off @droudy's
KeyedVectors
PR (#833). The changes I've made are all ingensim/models/wrappers/fasttext.py
Just wondering if this is the correct approach to go with, if we can decide that, I'll properly document the code, write tests, refactor it, right now it's mostly a POC. The main things I'm concerned about are -
structs
to read off the binary file generated by FastText, I'm not completely sure about how portable this is across different compiler versions and architectures)Steps to reproduce (you'll need to setup FastText first -
I've manually verified on a small set that it produces the same results as FastText
@gojomo @tmylk