Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming models split into multiple files from S3 / GCS #1851

Closed
JensMadsen opened this issue Jan 23, 2018 · 34 comments · Fixed by #3065
Closed

Support streaming models split into multiple files from S3 / GCS #1851

JensMadsen opened this issue Jan 23, 2018 · 34 comments · Fixed by #3065
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills impact MEDIUM Big annoyance for affected users reach MEDIUM Affects a significant number of users
Milestone

Comments

@JensMadsen
Copy link

Streaming small d2v models from s3 bucket works fine. Simply insert the s3 address into model.load. e.g. model.load('s3:///). However, when the model gets bigger and is split into multiple files all files except the main model file cannot be loaded. These other files are loaded by numpy and not smart_open.

The essential part of my code is

def load_model(model_file):
    return Doc2Vec.load(model_file)

# infer 
def infer_docs(input_string, model_file, inferred_docs=5):
    model = load_model(model_file)
    processed_str = simple_preprocess(input_string, min_len=2, max_len=35)    
    inferred_vector = model.infer_vector(processed_str)
    return model.docvecs.most_similar([inferred_vector], topn=inferred_docs)

Trying to load from s3 on a bugger model yields:

[INFO]  2018-01-21T20:44:59.613Z    f2689816-feeb-11e7-b397-b7ff2947dcec    testing keys in event dict
[INFO]  2018-01-21T20:44:59.614Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading model from s3://data-d2v/trained_models/model_law
[INFO]  2018-01-21T20:44:59.614Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading Doc2Vec object from s3://data-d2v/trained_models/model_law
[INFO]  2018-01-21T20:44:59.650Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Found credentials in environment variables.
[INFO]  2018-01-21T20:44:59.707Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Starting new HTTPS connection (1): s3.eu-west-1.amazonaws.com
[INFO]  2018-01-21T20:44:59.801Z    f2689816-feeb-11e7-b397-b7ff2947dcec    Starting new HTTPS connection (2): s3.eu-west-1.amazonaws.com
[INFO]  2018-01-21T20:45:35.830Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading wv recursively from s3://data-d2v/trained_models/model_law.wv.* with mmap=None
[INFO]  2018-01-21T20:45:35.830Z    f2689816-feeb-11e7-b397-b7ff2947dcec    loading syn0 from s3://data-d2v/trained_models/model_law.wv.syn0.npy with mmap=None
[Errno 2] No such file or directory: 's3://data-d2v/trained_models/model_law.wv.syn0.npy': FileNotFoundError
Traceback (most recent call last):
  File "/var/task/handler.py", line 20, in infer_handler
    event['input_text'], event['model_file'], inferred_docs=10)
  File "/var/task/infer_doc.py", line 26, in infer_docs
    model = load_model(model_file)
  File "/var/task/infer_doc.py", line 21, in load_model
    return Doc2Vec.load(model_file)
  File "/var/task/gensim/models/word2vec.py", line 1569, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/var/task/gensim/utils.py", line 282, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/var/task/gensim/models/word2vec.py", line 1593, in _load_specials
    super(Word2Vec, self)._load_specials(*args, **kwargs)
  File "/var/task/gensim/utils.py", line 301, in _load_specials
    getattr(self, attrib)._load_specials(cfname, mmap, compress, subname)
  File "/var/task/gensim/utils.py", line 312, in _load_specials
    val = np.load(subname(fname, attrib), mmap_mode=mmap)
  File "/var/task/numpy/lib/npyio.py", line 372, in load
    fid = open(file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 's3://data-d2v/trained_models/model_law.wv.syn0.npy'

My use case is to serve the models on a AWS lambda. My current workarond is to download all model files to a local folder and then load the model from the local folder which is rather slow

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Jan 23, 2018
@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 23, 2018

Sounds useful, thanks @JensMadsen, we know about this.
I added "feature" label, this will be really good for gensim.

P/S workaround: save the model as one file, this should work fine.

@gojomo
Copy link
Collaborator

gojomo commented Jan 23, 2018

@menshikh-iv IIRC, over a certain array-size (2GB or maybe 4GB), single-file pickling will break, even in 64-bit Python.

The numpy load() does take a file-like object, so it might be possible to smart_open() from S3 then pass numpy load() the stream, but their docs also suggest the file-like object must support seek(), which might be an issue.

@menshikh-iv
Copy link
Contributor

@gojomo smart_open already support seek operation for S3

@JensMadsen
Copy link
Author

@menshikh-iv I tried that but with no succes. What I did was separately=[] but I got two files. My model is less than 1 gb.

@menshikh-iv
Copy link
Contributor

@JensMadsen can you show code example + names of created files?

@piskvorky
Copy link
Owner

Ping @JensMadsen can you share the code? separately=[] shouldn't be creating two files IMO.

@JensMadsen
Copy link
Author

JensMadsen commented Aug 9, 2018

Sorry @menshikh-iv and @piskvorky about the delayed response

model = gensim.models.Doc2Vec.load(filename)
model.save('./tmpdata/reduced_' + model_name, separately=[])

generates four files:

<filename>.trainables.syn1neg.npy
<filename>.wv.vectors.npy
<filename>.docvecs.vectors_docs.npy
<filename>

@JensMadsen
Copy link
Author

If I use sep_limit=1000000 * 1024**2 i get a single file but the size is somewhat bigger than summed size of the separate files....

@gojomo
Copy link
Collaborator

gojomo commented Aug 9, 2018

That a single file is a little bigger, with pickle overhead, than the separate files isn't alone something to be concerned about. (Though as I noted previously, I believe single-file pickling breaks at some size around 2-4GB, even on 64bit Pythons.)

That even separately=[] results in multiple files may be an issue with the refactorings into subsidiary objects not adopting any separately settings from the container - that might be an inadequacy in the refactoring work, or an inherent ambiguity in how it should be handled with recursive SaveLoad.save() operations.

@JensMadsen
Copy link
Author

@gojomo thanks. As described in the original post I serve the model on an AWS lambda. I cannot stream the model from an s3 bucket when the model files are split due to a bug. Therefore I try to save the model in a single file. However, aws lambdas only have 512 mb diskspace which is not sufficient for me. Therefore I have tried to use delete_temporary_training_data and then save the model but that leads to even bigger files. Is there another way to achieve smaller model files? I do not need to continue training but I need infer_vector? Best Jens

@gojomo
Copy link
Collaborator

gojomo commented Aug 13, 2018

delete_temporary_training_data() is kind of a confused method, which with its defaults barely saves anything... but it should never make a model larger. (So, if you're seeing that, it may have been caused by something else.)

If you're never going to look up the vectors for those doc-tags supplied during training (as would happen for any model.docvecs.most_similar() operation), and are just using re-inferred vectors somewhere else, then you might be able to delete the model.docvecs.vectors_docs property without ill effect. If there were a lot of unique doctags in the training set, that might make a noticeable dent in model size.

If you're using plain DBOW mode during training (dm=0, dbow_words=0), then the word-vectors inside the model aren't really used, in training or later - so you might be able to delete the model.wv.vectors property without ill effect. (Or maybe even delete model.wv entirely, though it might still be consulted for maintaining the output layer, especially in negative-sampling models). But there could be problems with these approaches - test carefully in your setup – as the code hasn't consistently been designed/tested with such post-training minimization in mind.

@piskvorky
Copy link
Owner

@gojomo regarding delete_temporary_training_data() is kind of a confused method, which with its defaults barely saves anything: can you outline what's wrong with it and what it SHOULD do instead?

@JensMadsen
Copy link
Author

JensMadsen commented Aug 13, 2018

@gojomo For me to reproduce that delete_temporary_training_data() does not lead to reduced file sizes I simply

import gensim 
model = gensim.models.Doc2Vec(<filename>)
model.delete_temporary_training_data()
model.save(<new filename>)

I just reproduced it with a freshly trained model (gensim 3.5.0) with dm = 0, dbow_words = 1, vector_size = 400, windows_size = 8

@gojomo
Copy link
Collaborator

gojomo commented Aug 13, 2018

@JensMadsen I think you mean load(<filename>), as there's no load-from-file constructor. But, what were the file sizes before and after? And, what if you try re-saving to a new filename before delete_temporary_training_data(), and then a third filename after? (I suspect you may see the same expansion in the re-save, because it's something else that's causing it, perhaps a patching-up of an older/partial model upon load. And then the post-'delete' save would save a tiny amount of space, as would be expected from its defaults of not-deleting-hardly-anything.)

@piskvorky In my opinion the method shouldn't exist at all. This need isn't common-enough, or sufficiently well-supported, to justify a tempting public method. If some hatchety-with-lots-of-caveats tricks for shrinking models are important for some users, those could be documented-with-disclaimers elsewhere – maybe an 'advanced tricks' notebook, or other findable help resource.

OTOH, if such minimization is important enough to be a tested/supported feature of the models, then a larger, competent, refactoring would be justified, where the code/objects are cleanly split into the various parts needed for different steps/end-uses.

@piskvorky
Copy link
Owner

piskvorky commented Aug 13, 2018

OK. @menshikh-iv let's deprecate delete_temporary_training_data, along with any other half-implemented and not-so-well-thought-out methods.

I don't know if can reverse that unfortunate refactoring by @manneshiva at this point, but that may be another option. A little drastic, but maybe safer. Either way, a cleanup of the inheritance / functionality abstractions across *2vec will be necessary, it's leaking too much.

@JensMadsen
Copy link
Author

@gojomo yes sorry. I was typing out of my memory. I miss a "load". Sizes before and after are very similar but after is a bit larger, e.g. 32.1 mb vs ~32.4 mb for small test model

@gojomo
Copy link
Collaborator

gojomo commented Aug 15, 2018

@JensMadsen And what of a load() then save() without even doing a delete_temporary_training_data() in-between?

@menshikh-iv
Copy link
Contributor

@piskvorky

I don't know if can reverse that unfortunate refactoring by Shiva at this point,

Too late, no good way to do this (if we want backward compatibility), I think much work was done and better way - just try to fix current refactoring (based on user-feedback).

@nperera0
Copy link

I would like to fix this issue.
@JensMadsen did you find a solution that works?
@menshikh-iv can you pointers to start with this?

@hilaryp
Copy link

hilaryp commented Sep 22, 2020

Wondering if there are any plans to add support for this? I'm facing the same issue trying to use GCS for model storage, with the added twist that LdaModel apparently enforces storing attributes in separate files.

@piskvorky piskvorky changed the title support streaming d2v/w2v models split into more files from s3 buckets Support streaming models split into multiple files from S3 / GCS Sep 23, 2020
@piskvorky piskvorky added impact MEDIUM Big annoyance for affected users reach MEDIUM Affects a significant number of users labels Sep 23, 2020
@piskvorky
Copy link
Owner

piskvorky commented Sep 23, 2020

@hilaryp yeah that looks weird. That code seems to have been added in this commit: e08af7b, something to do with loading Python2 models in Python3:
https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q9-how-do-i-load-a-model-in-python-3-that-was-trained-and-saved-using-python-2

I'd consider it a bug though, hardwiring separately is not the way to do it.

Will you be able to take this up & open a PR with a fix?

@piskvorky piskvorky added bug Issue described a bug and removed feature Issue described a new feature labels Sep 23, 2020
@hilaryp
Copy link

hilaryp commented Sep 25, 2020

@piskvorky Sure, I can try to take a look soon. Thanks for pointing me to that wiki, anything else I should be aware of? and should I open a separate issue for this?

@piskvorky
Copy link
Owner

No need for a new ticket at this point – we can always create one if it turns out to be a separate (pun intended) problem.

@gojomo
Copy link
Collaborator

gojomo commented Sep 26, 2020

From a quick glance, it seems that the number of places that np.load() are used in the code are limited, and could be adapted to use file-type object that read from S3 instead. So this might not be very hard to support.

@piskvorky
Copy link
Owner

piskvorky commented Sep 26, 2020

Reading directly would be nice – although mmap is impossible, and that's the main reason we store numpy arrays separately and use np.load().

@gojomo
Copy link
Collaborator

gojomo commented Sep 28, 2020

Historically, the separate storage was also necessary to work-around pickle size limits around the 2GB/4GB thresholds. That may be gone if we move to the newer pickle whenever possible - so that might be a suggested workaround for those wanting active-models-from-S3.

@piskvorky
Copy link
Owner

piskvorky commented Sep 28, 2020

What do you mean by "newer pickle"?

@gojomo
Copy link
Collaborator

gojomo commented Sep 28, 2020

Gensim serialization utilities seem to prefer pickle protocol v2 (PEP307, 2003, introduced in Python 2.3):

https://github.com/RaRe-Technologies/gensim/blob/e210f73c42c5df5a511ca27166cbc7d10970eab2/gensim/utils.py#L512

It will cause errors if large arrays (but not very large by modern standards) aren't stored using the separately option. (Some existing saves will even compress these 'separate' files, which ruins the potential for memory-mapping but still works around the pickle v2 problems.)

If I'm reading https://docs.python.org/3/library/pickle.html#data-stream-format correctly, pickle v4 (PEP3154, 2011, available since Python 3.4 and becoming Python's default in Python 3.8) will support larger objects without error. It might make sense to make this the default for all new saves in gensim-4.0.0 and later - and thus a possible workaround for anyone needing a single-file save of a ginormous model would be to just tell them to disable separately, and everything else will work.

(Pickle v5 adds "out of band" support which might be a better way to handle the 'separately' functionality, but is only bundled in Python 3.8 and later. But, it does have a fully-functional backport to 3.5, 3.6 and 3.7 if the benefits were large enough.)

@piskvorky
Copy link
Owner

piskvorky commented Sep 29, 2020

Oh yes, that makes good sense. I'm sure we used pickle v2 to support forward compatibility (model stored in Python 2.7 or Python 3.5 can be still loaded in Python 2.4, etc).

With the Python changes in Gensim 4.0 (py3.6+) we don't have to worry about that as much. For storing we must support py3.6+; for loading we should still support older Pythons (at least py2.7 supported by Gensim 3.8.3).

So I'm +1 on changing the default pickle version to the highest protocol that py3.6 supports. And keeping it there for the 4.x series.

We could also store the current Gensim version within each pickle, to make our life easier determining whether the loaded model needs "upgrading" (rather than rely on a forest of hasattrs).

I don't know anything about "out-of-band" pickle v5, will look into it, thanks.

@gojomo
Copy link
Collaborator

gojomo commented Oct 5, 2020

So I'm +1 on changing the default pickle version to the highest protocol that py3.6 supports. And keeping it there for the 4.x series.

We could add a GENSIM_DEFAULT_PICKLE_PROTOCOL constant in gensim.utils, code-commented to explain the reason for the current choice, and have SaveLoad (and any other stray uses) generally consult that as a default. Protocol 4 seems safest, as it's already bundled with Python 3.4+, but 5 would also be an option given the availability of the pickle5 backport package.

We could also store the current Gensim version within each pickle, to make our life easier determining whether the loaded model needs "upgrading" (rather than rely on a forest of hasattrs).

Tackling #2863 (a general lifecycle log inside models) would store that version on each save as one of its events. For our purposes, trusting precise versions may be OK, and will always be useful in understanding any anomalies/reports, though in many other contexts (like when web pages are customizing themselves for browsers) it's considered wiser to prefer attribute/behavior-probing over exact package/version testing, for flexibility around things like unforeseen backports/extensions/polyfills/etc.

@saranggupta94
Copy link

saranggupta94 commented Oct 6, 2020

Wondering if there are any plans to add support for this? I'm facing the same issue trying to use GCS for model storage, with the added twist that LdaModel apparently enforces storing attributes in separate files.

A scrappy work around this would be to downgrade your gensim package to the 0.13.3 version (the release before e08af7b was added), loading the model back and using separately = [] parameter to save the model to get it into a single file. You can read it back using the latest version -- this seemed to work fine for me.

@gojomo
Copy link
Collaborator

gojomo commented Oct 6, 2020

@hilaryp More robustly, you could probably just pickle the model with pickle_protocol 4 or greater, then load it via unpickling. Unless you're either working around older-pickle size limits, or need the arrays separately (perhaps for memmapping which is irrelevant from S3), a giant single-file pickle may work just fine so the built-in save()/load() is superfluous.

@piskvorky piskvorky added this to the 4.0.0 milestone Oct 6, 2020
@hilaryp
Copy link

hilaryp commented Oct 21, 2020

Thanks for the suggestion @gojomo! Our workaround was to just copy everything into /tmp but I like this solution better. I was going to look into undoing the hardwiring of separately, but after reading this thread, I don't think I'm familiar enough with the backwards-compatibility issues to take it on.

@matthewcn56
Copy link

matthewcn56 commented Aug 15, 2021

So to do this I should just pickle and unpickle the model instead of using the default gensim save/load methods?
EDIT: Yup I resolved it by just pickling/unpickling!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills impact MEDIUM Big annoyance for affected users reach MEDIUM Affects a significant number of users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants