-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support streaming models split into multiple files from S3 / GCS #1851
Comments
Sounds useful, thanks @JensMadsen, we know about this. P/S workaround: save the model as one file, this should work fine. |
@menshikh-iv IIRC, over a certain array-size (2GB or maybe 4GB), single-file pickling will break, even in 64-bit Python. The numpy |
@gojomo |
@menshikh-iv I tried that but with no succes. What I did was separately=[] but I got two files. My model is less than 1 gb. |
@JensMadsen can you show code example + names of created files? |
Ping @JensMadsen can you share the code? |
Sorry @menshikh-iv and @piskvorky about the delayed response
generates four files: <filename>.trainables.syn1neg.npy |
If I use sep_limit=1000000 * 1024**2 i get a single file but the size is somewhat bigger than summed size of the separate files.... |
That a single file is a little bigger, with pickle overhead, than the separate files isn't alone something to be concerned about. (Though as I noted previously, I believe single-file pickling breaks at some size around 2-4GB, even on 64bit Pythons.) That even |
@gojomo thanks. As described in the original post I serve the model on an AWS lambda. I cannot stream the model from an s3 bucket when the model files are split due to a bug. Therefore I try to save the model in a single file. However, aws lambdas only have 512 mb diskspace which is not sufficient for me. Therefore I have tried to use delete_temporary_training_data and then save the model but that leads to even bigger files. Is there another way to achieve smaller model files? I do not need to continue training but I need infer_vector? Best Jens |
If you're never going to look up the vectors for those doc-tags supplied during training (as would happen for any If you're using plain DBOW mode during training ( |
@gojomo regarding |
@gojomo For me to reproduce that delete_temporary_training_data() does not lead to reduced file sizes I simply
I just reproduced it with a freshly trained model (gensim 3.5.0) with dm = 0, dbow_words = 1, vector_size = 400, windows_size = 8 |
@JensMadsen I think you mean @piskvorky In my opinion the method shouldn't exist at all. This need isn't common-enough, or sufficiently well-supported, to justify a tempting public method. If some hatchety-with-lots-of-caveats tricks for shrinking models are important for some users, those could be documented-with-disclaimers elsewhere – maybe an 'advanced tricks' notebook, or other findable help resource. OTOH, if such minimization is important enough to be a tested/supported feature of the models, then a larger, competent, refactoring would be justified, where the code/objects are cleanly split into the various parts needed for different steps/end-uses. |
OK. @menshikh-iv let's deprecate I don't know if can reverse that unfortunate refactoring by @manneshiva at this point, but that may be another option. A little drastic, but maybe safer. Either way, a cleanup of the inheritance / functionality abstractions across *2vec will be necessary, it's leaking too much. |
@gojomo yes sorry. I was typing out of my memory. I miss a "load". Sizes before and after are very similar but after is a bit larger, e.g. 32.1 mb vs ~32.4 mb for small test model |
@JensMadsen And what of a |
Too late, no good way to do this (if we want backward compatibility), I think much work was done and better way - just try to fix current refactoring (based on user-feedback). |
I would like to fix this issue. |
Wondering if there are any plans to add support for this? I'm facing the same issue trying to use GCS for model storage, with the added twist that |
@hilaryp yeah that looks weird. That code seems to have been added in this commit: e08af7b, something to do with loading Python2 models in Python3: I'd consider it a bug though, hardwiring Will you be able to take this up & open a PR with a fix? |
@piskvorky Sure, I can try to take a look soon. Thanks for pointing me to that wiki, anything else I should be aware of? and should I open a separate issue for this? |
No need for a new ticket at this point – we can always create one if it turns out to be a separate (pun intended) problem. |
From a quick glance, it seems that the number of places that |
Reading directly would be nice – although |
Historically, the separate storage was also necessary to work-around pickle size limits around the 2GB/4GB thresholds. That may be gone if we move to the newer pickle whenever possible - so that might be a suggested workaround for those wanting active-models-from-S3. |
What do you mean by "newer pickle"? |
Gensim serialization utilities seem to prefer pickle protocol v2 (PEP307, 2003, introduced in Python 2.3): It will cause errors if large arrays (but not very large by modern standards) aren't stored using the If I'm reading https://docs.python.org/3/library/pickle.html#data-stream-format correctly, pickle v4 (PEP3154, 2011, available since Python 3.4 and becoming Python's default in Python 3.8) will support larger objects without error. It might make sense to make this the default for all new saves in gensim-4.0.0 and later - and thus a possible workaround for anyone needing a single-file save of a ginormous model would be to just tell them to disable (Pickle v5 adds "out of band" support which might be a better way to handle the 'separately' functionality, but is only bundled in Python 3.8 and later. But, it does have a fully-functional backport to 3.5, 3.6 and 3.7 if the benefits were large enough.) |
Oh yes, that makes good sense. I'm sure we used pickle v2 to support forward compatibility (model stored in Python 2.7 or Python 3.5 can be still loaded in Python 2.4, etc). With the Python changes in Gensim 4.0 (py3.6+) we don't have to worry about that as much. For storing we must support py3.6+; for loading we should still support older Pythons (at least py2.7 supported by Gensim 3.8.3). So I'm +1 on changing the default pickle version to the highest protocol that py3.6 supports. And keeping it there for the 4.x series. We could also store the current Gensim version within each pickle, to make our life easier determining whether the loaded model needs "upgrading" (rather than rely on a forest of I don't know anything about "out-of-band" pickle v5, will look into it, thanks. |
We could add a
Tackling #2863 (a general lifecycle log inside models) would store that version on each save as one of its events. For our purposes, trusting precise versions may be OK, and will always be useful in understanding any anomalies/reports, though in many other contexts (like when web pages are customizing themselves for browsers) it's considered wiser to prefer attribute/behavior-probing over exact package/version testing, for flexibility around things like unforeseen backports/extensions/polyfills/etc. |
A scrappy work around this would be to downgrade your gensim package to the 0.13.3 version (the release before e08af7b was added), loading the model back and using |
@hilaryp More robustly, you could probably just |
Thanks for the suggestion @gojomo! Our workaround was to just copy everything into |
So to do this I should just pickle and unpickle the model instead of using the default gensim save/load methods? |
Streaming small d2v models from s3 bucket works fine. Simply insert the s3 address into model.load. e.g. model.load('s3:///). However, when the model gets bigger and is split into multiple files all files except the main model file cannot be loaded. These other files are loaded by numpy and not smart_open.
The essential part of my code is
Trying to load from s3 on a bugger model yields:
My use case is to serve the models on a AWS lambda. My current workarond is to download all model files to a local folder and then load the model from the local folder which is rather slow
The text was updated successfully, but these errors were encountered: