Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement saving to Facebook format #2712

Merged
merged 47 commits into from
Jan 23, 2020
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
1bd9b34
Add writing header for binary FB format (#2611)
lopusz Nov 29, 2019
1da932f
Adding writing vocabulary, vectors, output layer for FB format (#2611)
lopusz Dec 6, 2019
2e6ecbd
Clean up writing to binary FB format (#2611)
lopusz Dec 22, 2019
e3f104e
Adding tests for saving FastText models to binary FB format (#2611)
lopusz Dec 22, 2019
393ead3
Extending tests for saving FastText models to binary FB format (#2611)
lopusz Dec 22, 2019
a04e560
Clean up (flake8) writing to binary FB format (#2611)
lopusz Dec 22, 2019
f7ac2c6
Word count bug fix + including additional test (#2611)
lopusz Dec 27, 2019
aacab5e
Removing f-strings for Python 3.5 compatibility + clean-up(#2611)
lopusz Dec 27, 2019
c2882af
Clean up the comments (#2611)
lopusz Dec 27, 2019
08fe0f1
Removing forgotten f-string for Python 3.5 compatibility (#2611)
lopusz Dec 27, 2019
749ca6d
Correct tests failing @ CI (#2611)
lopusz Dec 27, 2019
aec5e88
Another attempt to correct tests failing @ CI (#2611)
lopusz Dec 27, 2019
bfc938d
Yet another attempt to correct tests failing @ CI (#2611)
lopusz Dec 27, 2019
88de72b
New attempt to correct tests failing @ CI (#2611)
lopusz Dec 27, 2019
100f25f
Fix accidentally broken test (#2611)
lopusz Dec 27, 2019
86d8c7b
Include Radim remarks to saving models in binary FB format (#2611)
lopusz Dec 28, 2019
94778d0
Correcting loss bug (#2611)
lopusz Dec 28, 2019
9e139c1
Completed correcting loss bug (#2611)
lopusz Dec 28, 2019
0684833
Correcting breaking doc building bug (#2611)
lopusz Dec 28, 2019
2ed3115
Include first batch of Michael remarks
lopusz Dec 30, 2019
8e8ca1e
Refactoring SaveFacebookFormatRoundtripModelToModelTest according to …
lopusz Jan 5, 2020
ebb2ac8
Refactoring remaining tests according to Michael remarks (#2611)
lopusz Jan 5, 2020
03a777c
Cleaning up the test refactoring (#2611)
lopusz Jan 5, 2020
d087074
Refactoring handling tuple result from struct.unpack (#2611)
lopusz Jan 5, 2020
6df1ff5
Removing unused import (#2611)
lopusz Jan 5, 2020
9c44e87
Refactoring variable name according to Michael review (#2611)
lopusz Jan 5, 2020
5b3eb2b
Removing redundant saving in test for Facebook binary saving (#2611)
lopusz Jan 7, 2020
d709f2f
Minimizing context manager blocks span (#2611)
lopusz Jan 8, 2020
d37fd4f
Remove obsolete comment (#2611)
lopusz Jan 8, 2020
821862e
Shortening method name (#2611)
lopusz Jan 8, 2020
8bfa866
Moving model parameters to _check_roundtrip function (#2611)
lopusz Jan 8, 2020
6bebcef
Finished moving model parameters to _check_roundtrip function (#2611)
lopusz Jan 8, 2020
1eaea78
Clean-up FT_HOME behaviour (#2611)
lopusz Jan 8, 2020
594ca6b
Simplifying vectors equality check (#2611)
lopusz Jan 8, 2020
971bfa6
Unifying testing method names (#2611)
lopusz Jan 8, 2020
f8a7d47
Refactoring _create_and_save_fb_model method name (#2611)
lopusz Jan 8, 2020
968eea0
Refactoring test names (#2611)
lopusz Jan 8, 2020
8c03bc5
Refactoring flake8 errors (#2611)
lopusz Jan 8, 2020
984ff2b
Correcting fasttext invocation handling (#2611)
lopusz Jan 8, 2020
47d45f4
Removing _parse_wordvectors function (#2611)
lopusz Jan 8, 2020
eacf8b6
Correcting whitespace and simplifying test assertion (#2611)
lopusz Jan 22, 2020
7eef237
Removing redundant anonymous variable (#2611)
lopusz Jan 22, 2020
6009f0d
Moving assertion outside of a context manager (#2611)
lopusz Jan 22, 2020
ea02091
Function rename (#2611)
lopusz Jan 22, 2020
c7c7aa8
Cleaning doc strings and comments in FB binary format saving function…
lopusz Jan 22, 2020
4745bb8
Cleaning doc strings in end-user API for FB binary format saving (#2611)
lopusz Jan 22, 2020
28f27f8
Correcting FT_CMD execution in SaveFacebookByteIdentityTest (#2611)
lopusz Jan 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
315 changes: 306 additions & 9 deletions gensim/models/_fasttext_bin.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,37 +41,53 @@

_END_OF_WORD_MARKER = b'\x00'

# FastText dictionary data structure holds elements of type `entry` which can have `entry_type`
# either `word` (0 :: int8) or `label` (1 :: int8). Here we deal with unsupervised case only
# so we want `word` type.
# See https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h

_DICT_WORD_ENTRY_TYPE_MARKER = b'\x00'


logger = logging.getLogger(__name__)

_FASTTEXT_FILEFORMAT_MAGIC = 793712314
# Constants for FastText vesrion and FastText file format magic (both int32)
# https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc#L25

_FASTTEXT_VERSION = np.int32(12)
_FASTTEXT_FILEFORMAT_MAGIC = np.int32(793712314)


# _NEW_HEADER_FORMAT is constructed on the basis of args::save method, see
# https://github.com/facebookresearch/fastText/blob/master/src/args.cc

_NEW_HEADER_FORMAT = [
('dim', 'i'),
('ws', 'i'),
('epoch', 'i'),
('min_count', 'i'),
('neg', 'i'),
('_', 'i'),
('word_ngrams', 'i'), # Unused in loading
('loss', 'i'),
('model', 'i'),
('bucket', 'i'),
('minn', 'i'),
('maxn', 'i'),
('_', 'i'),
('lr_update_rate', 'i'), # Unused in loading
('t', 'd'),
]

_OLD_HEADER_FORMAT = [
('epoch', 'i'),
('min_count', 'i'),
('neg', 'i'),
('_', 'i'),
('word_ngrams', 'i'), # Unused in loading
('loss', 'i'),
('model', 'i'),
('bucket', 'i'),
('minn', 'i'),
('maxn', 'i'),
('_', 'i'),
('lr_update_rate', 'i'), # Unused in loading
('t', 'd'),
]

Expand All @@ -93,6 +109,7 @@ def _yield_field_names():
yield 'nwords'
yield 'vectors_ngrams'
yield 'hidden_output'
yield 'ntokens'


_FIELD_NAMES = sorted(set(_yield_field_names()))
Expand Down Expand Up @@ -168,6 +185,7 @@ def _load_vocab(fin, new_format, encoding='utf-8'):
The loaded vocabulary. Keys are words, values are counts.
The vocabulary size.
The number of words.
The number of tokens.
"""
vocab_size, nwords, nlabels = _struct_unpack(fin, '@3i')

Expand All @@ -176,7 +194,8 @@ def _load_vocab(fin, new_format, encoding='utf-8'):
raise NotImplementedError("Supervised fastText models are not supported")
logger.info("loading %s words for fastText model from %s", vocab_size, fin.name)

_struct_unpack(fin, '@1q') # number of tokens
ntokens = _struct_unpack(fin, '@q')[0] # number of tokens

if new_format:
pruneidx_size, = _struct_unpack(fin, '@q')

Expand Down Expand Up @@ -205,7 +224,7 @@ def _load_vocab(fin, new_format, encoding='utf-8'):
for j in range(pruneidx_size):
_struct_unpack(fin, '@2i')

return raw_vocab, vocab_size, nwords
return raw_vocab, vocab_size, nwords, ntokens


def _load_matrix(fin, new_format=True):
Expand Down Expand Up @@ -315,11 +334,12 @@ def load(fin, encoding='utf-8', full_model=True):

header_spec = _NEW_HEADER_FORMAT if new_format else _OLD_HEADER_FORMAT
model = {name: _struct_unpack(fin, fmt)[0] for (name, fmt) in header_spec}

if not new_format:
model.update(dim=magic, ws=version)

raw_vocab, vocab_size, nwords = _load_vocab(fin, new_format, encoding=encoding)
model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords)
raw_vocab, vocab_size, nwords, ntokens = _load_vocab(fin, new_format, encoding=encoding)
model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords, ntokens=ntokens)

vectors_ngrams = _load_matrix(fin, new_format=new_format)

Expand Down Expand Up @@ -366,5 +386,282 @@ def _backslashreplace_backport(ex):
return text, end


def _sign_model(fout):
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
"""
Write signature of the file in Facebook's native fastText `.bin` format
to the binary output stream `fout`. Signature includes magic bytes and version.

Name mimics original C++ implementation, see
[FastText::signModel](https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc)

Parameters
----------
fout: writeable binary stream
"""
fout.write(_FASTTEXT_FILEFORMAT_MAGIC.tobytes())
fout.write(_FASTTEXT_VERSION.tobytes())


def _conv_field_to_bytes(field_value, field_type):
"""
Auxiliary function that converts `field_value` to bytes based on request `field_type`,
for saving to the binary file.

Parameters
----------
field_value: numerical
contains arguments of the string and start/end indexes of the bad portion.

field_type: str
currently supported `field_types` are `i` for 32-bit integer and `d` for 64-bit float
"""
if field_type == 'i':
return (np.int32(field_value).tobytes())
elif field_type == 'd':
return (np.float64(field_value).tobytes())
else:
raise NotImplementedError('Currently conversion to "%s" type is not implemmented.' % field_type)


def _get_field_from_model(model, field):
"""
Extract `field` from `model`.

piskvorky marked this conversation as resolved.
Show resolved Hide resolved
Parameters
----------
model: gensim.models.fasttext.FastText
model from which `field` is extracted
field: str
requested field name, fields are listed in the `_NEW_HEADER_FORMAT` list
"""
if field == 'bucket':
return model.trainables.bucket
elif field == 'dim':
return model.vector_size
elif field == 'epoch':
return model.epochs
elif field == 'loss':
# `loss` => hs: 1, ns: 2, softmax: 3, ova-vs-all: 4
# ns = negative sampling loss (default)
# hs = hierarchical softmax loss
# softmax = softmax loss
# one-vs-all = one vs all loss (supervised)
if model.hs == 1:
return 1
elif model.hs == 0:
return 2
elif model.hs == 0 and model.negative == 0:
return 1
elif field == 'maxn':
return model.wv.max_n
elif field == 'minn':
return model.wv.min_n
elif field == 'min_count':
return model.vocabulary.min_count
elif field == 'model':
# `model` => cbow:1, sg:2, sup:3
# cbow = continous bag of words (default)
# sg = skip-gram
# sup = supervised
return 2 if model.sg == 1 else 1
elif field == 'neg':
return model.negative
elif field == 't':
return model.vocabulary.sample
elif field == 'word_ngrams':
# This is skipped in gensim loading setting, using the default from FB C++ code
return 1
elif field == 'ws':
return model.window
elif field == 'lr_update_rate':
# This is skipped in gensim loading setting, using the default from FB C++ code
return 100
else:
msg = 'Extraction of header field "' + field + '" from Gensim FastText object not implemmented.'
raise NotImplementedError(msg)


def _args_save(fout, model, fb_fasttext_parameters):
"""
Saves header with `model` parameters to the binary stream `fout` containing a model in the Facebook's
native fastText `.bin` format.

Name mimics original C++ implementation, see
[Args::save](https://github.com/facebookresearch/fastText/blob/master/src/args.cc)

Parameters
----------
fout: writeable binary stream
stream to which model is saved
model: gensim.models.fasttext.FastText
saved model
fb_fasttext_parameters: dictionary
dictionary contain parameters containing `lr_update_rate`, `word_ngrams`
unused by gensim implementation, so they have to be provided externally
"""
for field, field_type in _NEW_HEADER_FORMAT:
if field in fb_fasttext_parameters:
field_value = fb_fasttext_parameters[field]
else:
field_value = _get_field_from_model(model, field)
fout.write(_conv_field_to_bytes(field_value, field_type))


def _dict_save(fout, model, encoding):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name and comment are inconsistent. I think it's worth aligning them one way or another. My preference would be to rename the function _save_vocabulary to make it obvious what it's doing (dict is a bit ambiguous here).

"""
Saves vocabulary from `model` to the to the binary stream `fout` containing a model in the Facebook's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the function name.

Suggested change
Saves vocabulary from `model` to the to the binary stream `fout` containing a model in the Facebook's
Saves the dictionary from `model` to the to the binary stream `fout` containing a model in the Facebook's

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

native fastText `.bin` format.

Name mimics the original C++ implementation
[Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)

Parameters
----------
fout: writeable binary stream
stream to which model is saved
lopusz marked this conversation as resolved.
Show resolved Hide resolved
model: gensim.models.fasttext.FastText
saved model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you mean?

Suggested change
saved model
the model that contains the dictionary to save

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

encoding: str
string encoding used in the output
"""
fout.write(np.int32(len(model.wv.vocab)).tobytes())

fout.write(np.int32(len(model.wv.vocab)).tobytes())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these lines duplicated? If there's a good reason, then please put it in a code comment.

From the C++ source, it looks like you're writing size, nwords, nlabels in that order. If size==nwords, then the duplication makes sense, but it's worth documenting why that equality holds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.


# nlabels=0 <- no labels we are in unsupervised mode
fout.write(np.int32(0).tobytes())

fout.write(np.int64(model.corpus_total_words).tobytes())

# prunedidx_size_=-1, -1 value denotes no prunning index (prunning is only supported in supervised mode)
fout.write(np.int64(-1))

for word in model.wv.index2word:
word_count = model.wv.vocab[word].count
fout.write(word.encode(encoding))
fout.write(_END_OF_WORD_MARKER)
fout.write(np.int64(word_count).tobytes())
fout.write(_DICT_WORD_ENTRY_TYPE_MARKER)

# We are in unsupervised case, therefore pruned_idx is empty, so we do not need to write anything else


def _input_save(fout, model):
"""
Saves word and ngram vectors from `model` to the binary stream `fout` containing a model in
the Facebook's native fastText `.bin` format.

Corresponding C++ fastText code:
DenseMatrix::save[DenseMatrix::save](https://github.com/facebookresearch/fastText/blob/master/src/densematrix.cc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the link:

Suggested change
DenseMatrix::save[DenseMatrix::save](https://github.com/facebookresearch/fastText/blob/master/src/densematrix.cc)
[DenseMatrix::save](https://github.com/facebookresearch/fastText/blob/master/src/densematrix.cc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.


Parameters
----------
fout: writeable binary stream
stream to which model is saved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stream to which model is saved
stream to which the vectors are saved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

model: gensim.models.fasttext.FastText
saved model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
saved model
the model that contains the vectors to save

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

"""
vocab_n, vocab_dim = model.wv.vectors_vocab.shape
ngrams_n, ngrams_dim = model.wv.vectors_ngrams.shape

assert vocab_dim == ngrams_dim
assert vocab_n == len(model.wv.vocab)
assert ngrams_n == model.wv.bucket

fout.write(struct.pack('@2q', vocab_n + ngrams_n, vocab_dim))
fout.write(model.wv.vectors_vocab.tobytes())
fout.write(model.wv.vectors_ngrams.tobytes())


def _output_save(fout, model):
"""
Saves output layer of `model` to the binary stream `fout` containing a model in
the Facebook's native fastText `.bin` format.

Corresponding C++ fastText code:
DenseMatrix::save[DenseMatrix::save](https://github.com/facebookresearch/fastText/blob/master/src/densematrix.cc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix formatting.

Suggested change
DenseMatrix::save[DenseMatrix::save](https://github.com/facebookresearch/fastText/blob/master/src/densematrix.cc)
[DenseMatrix::save](https://github.com/facebookresearch/fastText/blob/master/src/densematrix.cc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.


Parameters
----------
fout: writeable binary stream
stream to which model is saved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stream to which model is saved
stream to which the output layer is saved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

model: gensim.models.fasttext.FastText
saved model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
saved model
the model that contains the output layer to save

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

"""
if model.hs:
hidden_output = model.trainables.syn1
if model.negative:
hidden_output = model.trainables.syn1neg

hidden_n, hidden_dim = hidden_output.shape
fout.write(struct.pack('@2q', hidden_n, hidden_dim))
fout.write(hidden_output.tobytes())


def _save_to_stream(model, fout, fb_fasttext_parameters, encoding):
"""
Saves word embeddings to binary stream `fout` using the Facebook's native fasttext `.bin` format.

Parameters
----------
fout: file name or writeable binary stream
stream to which model is saved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stream to which model is saved
stream to which the word embeddings are saved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

model: gensim.models.fasttext.FastText
saved model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
saved model
the model that contains the word embeddings to save

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

fb_fasttext_parameters: dictionary
dictionary contain parameters containing `lr_update_rate`, `word_ngrams`
unused by gensim implementation, so they have to be provided externally
encoding: str
encoding used in the output file
"""

_sign_model(fout)
_args_save(fout, model, fb_fasttext_parameters)
_dict_save(fout, model, encoding)
fout.write(struct.pack('@?', False)) # Save 'quant_', which is False for unsupervised models

# Save words and ngrams vectors
_input_save(fout, model)
fout.write(struct.pack('@?', False)) # Save 'quot_', which is False for unsupervised models

# Save output layers of the model
_output_save(fout, model)


def save(model, fout, fb_fasttext_parameters, encoding):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also an _internal method. People won't be calling it from outside this module, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is as follows _fasttext_bin.py is internal module intended to use in fasttext.py.

For loading we have a bunch of _internal functions (_batched_generator, _load_matrix, etc) and top-level load used in fasttext.py

I tried to copy the same for saving i.e. bunch _internal functions (_sign_model, _args_save, etc) with top-level save used in fasttext.py.

If this does not make sense, let's change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if you intend for this function to be called from outside the module, then the name is fine. However, in that case, you must include a full docstring, including the parameters and their types. This is important information for people who will be using your module - they won't necessarily be familiar with the module's implementation details, so make the documentation as helpful as possible.

Also, if possible, assign default values to optional parameters, and save users the labor of working out what they mean. E.g. the encoding is usually utf8, and there's probably already a constant for that in that module somewhere.

I see that this function mostly wraps the internal _save_to_stream function, so perhaps move the docstring from there?

"""
Saves word embeddings to the Facebook's native fasttext `.bin` format.

Parameters
----------
fout: file name or writeable binary stream
stream to which model is saved
model: gensim.models.fasttext.FastText
saved model
fb_fasttext_parameters: dictionary
dictionary contain parameters containing `lr_update_rate`, `word_ngrams`
unused by gensim implementation, so they have to be provided externally
encoding: str
encoding used in the output file

Notes
-----
Unfortunately, there is no documentation of the Facebook's native fasttext `.bin` format

This is just reimplementation of
[FastText::saveModel](https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc)

Based on v0.9.1, more precisely commit da2745fcccb848c7a225a7d558218ee4c64d5333

Code follows the original C++ code naming.
"""

if isinstance(fout, str):
with open(fout, "wb") as fout_stream:
_save_to_stream(model, fout_stream, fb_fasttext_parameters, encoding)
else:
_save_to_stream(model, fout, fb_fasttext_parameters, encoding)


if six.PY2:
codecs.register_error('backslashreplace', _backslashreplace_backport)