Different behaviour with `open` #207

menshikh-iv · 2018-07-05T16:03:10Z

Problem: smart_open==1.6.0 doesn't work correctly with numpy.frombuffer function (numpy < 1.13.0, preferable to test with numpy==1.11.3, python2.7).

Example:
env:

numpy==1.11.3
gensim==3.4.0
smart_open==1.6.0

from gensim.models import FastText
from gensim.test.utils import datapath

m = FastText.load_fasttext_format(datapath("lee_fasttext"))

Trace

IOError                                   Traceback (most recent call last)
<ipython-input-1-afdc56d6c4b5> in <module>()
      2 from gensim.test.utils import datapath
      3 
----> 4 m = FastText.load_fasttext_format(datapath("lee_fasttext"))

/home/ivan/release/gensim/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding)
    697             model_file += '.bin'
    698         model.file_name = model_file
--> 699         model.load_binary_data(encoding=encoding)
    700         return model
    701 

/home/ivan/release/gensim/gensim/models/fasttext.py in load_binary_data(self, encoding)
    712             self._load_model_params(f)
    713             self._load_dict(f, encoding=encoding)
--> 714             self._load_vectors(f)
    715 
    716     def _load_model_params(self, file_handle):

/home/ivan/release/gensim/gensim/models/fasttext.py in _load_vectors(self, file_handle)
    819 
    820         self.num_original_vectors = num_vectors
--> 821         self.wv.vectors_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim)
    822         self.wv.vectors_ngrams = self.wv.vectors_ngrams.reshape((num_vectors, dim))
    823         assert self.wv.vectors_ngrams.shape == (

IOError: first argument must be an open file

If I replace

with utils.smart_open(self.file_name, 'rb') as f:

from https://github.com/RaRe-Technologies/gensim/blob/875b028ccd009fe2fa7665e177b8ab0b5e2dc40d/gensim/models/fasttext.py#L711
to

with open(self.file_name, 'rb') as f:

this will works correctly (also, it works correctly with smart_open==1.5.6, i.e. bug from 1.6.0).

The text was updated successfully, but these errors were encountered:

piskvorky · 2018-07-05T16:05:08Z

Let me add that the bug is the violation of the smart_open contract:

"For local uncompressed files, smart_open() is always a drop-in replacement for open(). There is never a reason not to use smart_open() instead of open()."

mpenkov · 2018-07-06T00:32:36Z

We have two important contracts: 1. Quack like open (as you mentioned above) 2. Support both Py2 and Py3 Unfortunately, these two contracts conflict because the open functions differ between the major Py versions. Py3 accepts many more parameters and support for encodings. How should we manage this conflict?

…

On Fri, Jul 6, 2018 at 1:07 Ivan Menshikh ***@***.***> wrote: Assigned #207 <#207> to @mpenkov <https://github.com/mpenkov>. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#207 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDOVHT75SqPjFlhm5uj7cyLG8pwJJkIks5uDjmkgaJpZM4VEJr0> .

menshikh-iv · 2018-07-06T01:29:35Z

@mpenkov "quack" like a python3 always maybe (this still "contradiction", but looks like best solution)?

mpenkov · 2018-07-06T06:43:21Z

That’s going to be hard. Look at this example: open a local file with koi8-r encoding. Under Py3, you just call the built-in open and return a file descriptor. This satisfies the contract that Radim mentioned. Everyone is happy. Under Py2, the built-in open does not support encodings, so you have to deal with it yourself. You _could_ wrap it in a codec layer, but then you’d violate the contract: the return would not be a true file descriptor. What should we do in this particular case?

…

On Fri, Jul 6, 2018 at 10:29 Ivan Menshikh ***@***.***> wrote: @mpenkov <https://github.com/mpenkov> "quack" like a python3 always (this still "contradiction", but looks like best solution)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#207 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDOVEg3lvKeqxRH6cePs4IkAwbaMRiCks5uDr1_gaJpZM4VEJr0> .

…n until piskvorky/smart_open#207 will be fixed

piskvorky · 2018-07-06T07:46:45Z

smart_open should work like open for local uncompressed files (no matter which Python).

If Python 2 doesn't support some parameters, we don't have to support them either. We have no contract or ambition to provide some "compatibility layer" between python 2 and 3. We want minimum surprises.

menshikh-iv · 2018-07-06T10:03:15Z

@piskvorky open from python2 or python3? I agree with @mpenkov, this isn't clear: for example, we'll simply use standard python open - we'll have no encoding argument for python2, that's not our goal.

Probably, this should be affected only for trivial variants, i.e. if user pass parameters matched to
https://docs.python.org/2/library/functions.html#open (on python2) or
https://docs.python.org/3/library/functions.html#open (on python3) - we should return open(...), otherwise (non-trivial case) - use current "mechanism"

piskvorky · 2018-07-06T16:16:58Z

Not sure I understand. What I'm saying is that if native open doesn't support a parameter, neither do we.

I mean, if we support something extra, that's nice. But definitely not at the cost of breaking the "drop-in contract".

mpenkov · 2018-07-06T23:45:50Z

We _do_ support an encoding parameter, regardless of what the built-in open supports. That’s part of the problem.

…

On Sat, Jul 7, 2018 at 01:16 Radim Řehůřek ***@***.***> wrote: Not sure I understand. What I'm saying is that if native open doesn't support a parameter, neither do we. I mean, if we support something extra, that's nice. But definitely not at the cost of breaking the "drop-in contact". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#207 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDOVDX8hp-0Kq6nQnjWlOPLJ3BU4-0Kks5uD416gaJpZM4VEJr0> .

mpenkov · 2018-07-06T23:53:27Z

Going back to the example I gave previously, if we receive an encoding parameter when the native open doesn’t support one (Py2), we can: 1) perform our own decoding, violating the contract 2) ignore the parameter, but keep to the contract 3) explode with an error 4) ... anything else? How do you think we should handle the above situation?

…

On Sat, Jul 7, 2018 at 08:45 Misha Penkov ***@***.***> wrote: We _do_ support an encoding parameter, regardless of what the built-in open supports. That’s part of the problem. On Sat, Jul 7, 2018 at 01:16 Radim Řehůřek ***@***.***> wrote: > Not sure I understand. What I'm saying is that if native open doesn't > support a parameter, neither do we. > > I mean, if we support something extra, that's nice. But definitely not at > the cost of breaking the "drop-in contact". > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <#207 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABDOVDX8hp-0Kq6nQnjWlOPLJ3BU4-0Kks5uD416gaJpZM4VEJr0> > . >

piskvorky · 2018-07-07T05:11:06Z

Both 1) and 3) are OK. 1) doesn't really violate the "drop-in" contract, since there is no original to violate (open wouldn't even work with the unknown parameter).

Note however that this is not the case that triggered this issue. There was no encoding parameter -- the original open worked fine.

mpenkov · 2018-07-07T05:50:01Z

(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$ pip install numpy==1.11.3
Collecting numpy==1.11.3
  Downloading https://files.pythonhosted.org/packages/68/e8/d68e1c47a2bd3f03cb81f1e635a55b776e51e4746f7204d541f6e3e9ffab/numpy-1.11.3-cp27-cp27mu-manylinux1_x86_64.whl (15.4MB)
    100% |████████████████████████████████| 15.4MB 705kB/s
Installing collected packages: numpy
  Found existing installation: numpy 1.14.5
    Uninstalling numpy-1.14.5:
      Successfully uninstalled numpy-1.14.5
Successfully installed numpy-1.11.3
(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$ python integration-tests/test_208.py
Traceback (most recent call last):
  File "integration-tests/test_208.py", line 25, in <module>
    loaded = np.fromfile(fin)
IOError: first argument must be an open file
(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$ pip install numpy==1.14.5
Collecting numpy==1.14.5
  Using cached https://files.pythonhosted.org/packages/6a/a9/c01a2d5f7b045f508c8cefef3b079fe8c413d05498ca0ae877cffa230564/numpy-1.14.5-cp27-cp27mu-manylinux1_x86_64.whl
Installing collected packages: numpy
  Found existing installation: numpy 1.11.3
    Uninstalling numpy-1.11.3:
      Successfully uninstalled numpy-1.11.3
Successfully installed numpy-1.14.5
(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$ python integration-tests/test_208.py
OK
(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$

I can reproduce this problem with numpy-1.11.3. With newer numpy, it's no longer a problem.

@piskvorky

… of `min_count` for `gensim.models.Word2Vec`. Fix #465 (#1915) Andrey Kutuzov (2): Fix OOV pairs counter in `WordEmbeddingsKeyedVectors.evaluate_word_pairs` (#1934) Add `evaluate_word_analogies` (will replace `accuracy`) method for `gensim.models.KeyedVectors` (#1935) Aneesh Joshi (3): Add windows venv activate command to `CONTRIBUTING.md` (#1880) Fix deprecation warning from `inspect.getargspec`. Fix #1878 (#1887) Allow initialization with `max_final_vocab` in lieu of `min_count` for `gensim.models.Word2Vec`. Fix #465 (#1915) Chaitali Saini (1): Update rules for removing table markup from Wikipedia dumps. Fix #1710 (#1954) Dennis.Chen (1): Fix inheritance chain for `load_word2vec_format` (return correct class in case when you create an child class based on kv) (#1968) Dmitry (5): Refactor API reference `gensim.corpora`. Partial fix #1671 (#1835) Refactor documentation for `gensim.similarities.docsim` and `MmCorpus-related`. (#1910) Refactor documentation for `gensim.models.coherencemodel` (#1933) Refactor documentation for `gensim.models.phrases` (#1950) Fix format & links for `gensim.similarities.docsim` (#2030) Dmitry Persiyanov (1): Add `gensim.models.BaseKeyedVectors.add_entity` method for fill `KeyedVectors` in manual way. Fix #1942 (#1957) Fernando Camargo (1): Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models. Fix #2090 (#2093) Gordon Mohr (1): Fix `Doc2Vec.infer_vector`, notebook cleanup (#2103) Gyanesh Malhotra (1): Fix docstrings for`gensim.models.hdpmodel`, `gensim.models.lda_worker` & `gensim.models.lda_dispatcher` (#1912) Ibrahim Sharaf ElDen (1): Store images from `README.md` directly in repository. Fix #1849 (#1861) Ivan Menshikh (2): Fix PEP8 in `HashDictionary` Disable google-style docstring support. Fix #1663 (#2106) Jayant Jain (1): Fix negative sampling floating-point error for `gensim.models.PoincareModel`. Fix #1917 (#1959) Johannes Baiter (1): Fix method `estimate_memory` from `gensim.models.FastText` & huge performance improvement. Fix #1824 (#1916) Jonathan Hourany (1): Fixed Typo and increased performance in analyze_sentence (#2070) Kento NOZAWA (1): Fix example block for `gensim.models.Word2Vec` (#1876) Kumar Akshay (1): Fix documentation for `gensim.models.wrappers` (#1859) Menshikh Ivan (5): Fix `test_similarities.py` (#1928) Add flag for skip network-related tests (#1930) Fix encoding in Lee corpus reader (#1931) Fix Keras version (avoid bug from `keras==2.1.5`) (#1963) Fix quoting that break `doc2vec-IMDB` notebook Mohit Rathore (1): Add Pivot Normalization for `gensim.models.TfidfModel`. Fix #220 (#1780) Mritunjay Mohitesh (1): Fix deprecated parameters in `D2VTransformer` and `W2VTransformer`. Fix #1937 (#1945) Nils Werner (1): Add license field to `setup.py` (#1909) Oliver Price (1): Fix return dtype for `matutils.unitvec` according to input dtype. Fix #1722 (#1992) Orion Montoya (1): Fix parameter description of `sg` parameter for `gensim.models.word2vec` (#1919) Pete Bleackley (1): Fix SMART from TfidfModel for case when `df == "n"`. Fix #2020 (#2021) Pushpankar Kumar Pushp (1): Fix datatype parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 (#1819) Radim Řehůřek (8): fix logging formatting in downloader fixes to HashDictionary more fixes to broken formatting minor wording change Merge pull request #2073 from RaRe-Technologies/hashdictionary_docs Fix documentation for `*2vec` models (#2087) Fix documentation for various modules (#2096) Update non-API docs (about, intro, etc) (#2101) Rob Malouf (1): Fix `_is_single` from `Phrases` for case when corpus is numpy array (#1987) Samyak Jain (2): Fix empty output bug in `Phrases`. Fix #1401 (#1853) Fix file-like closing bug from `gensim.corpora.MmCorpus`. Fix #1869 (#1911) Sharan Yalburgi (3): Add anaconda-cloud badge. Partial fix #1901 (#1905) Add method that show base installation info of Gensim & related packages. Fix #1902 (#1903) Replace open() with smart_open() in notebooks. Fix #1789 (#1812) Shiva Manne (4): Add `wv` property to KeyedVectors (for backward compatibility). Fix #1882 (#1884) Adds `LabeledSentence` to `gensim.models.doc2vec` (for backward compatibility). Fix #1886 (#1891) Fix `Doc2Vec.infer_vector` after loading old `Doc2Vec` (`gensim<=3.2`). Fix #1952 (#1974) Fixes issues while loading `word2vec` and `doc2vec` models saved using old Gensim versions. Fix #2000, #1977 (#2012) Sourav Singh (1): Fix docstrings for `gensim.models.AuthorTopicModel` (#1907) Stamenov (1): Add inference for new unseen author for `gensim.models.AuthorTopicModel` (#1766) Stergiadis Manos (4): Fix docstrings for lsi-related code (#1892) Fix docstrings for `gensim.sklearn_api`. Fix #1667 (#1895) Document LDA-related models (#2026) Allow pass empty dictionary to `gensim.corpora.WikiCorpus`. Fix #2052 (#2042) TheFlash10 (1): Fix deprecated parameters in doc2vec-lee notebook (#1918) Umang Varma (1): Fix linear decay for learning rate in `Doc2Vec.infer_vector`. Fix #2061 (#2063) Utkarsh Mishra (1): Fix `D2VTransformer.fit_transform`. Fix #1834 (#1845) Vít Novotný (4): Implement Soft Cosine Measure (#1827) Fix misinformation in docstring of `gensim.models.KeyedVectors.similarity_matrix`. Fix #1960 (#1971) Fix `SoftCosineSimilarity.get_similarities` on corpora. Fix #1955 (#1972) Fix tests for `EuclideanKeyedVectors.similarity_matrix`. Fix #1961 (#1984) Yuri Isakov (3): Refactor docstrings for `gensim.scripts`. Partial fix #1665 (#1792) Fix docstrings for `gensim.test.utils` (#1904) Fix docstrings for `gensim.interfaces` (#1913) arlenk (3): Add Cython version of `MmReader` (#1825) Add cython version for "hot" functions from `gensim.models.LdaModel` (#1767) Fix OverflowError when loading a large term-document matrix in MatrixMarket format. Fix #1998 (#2001) bohea (1): Fix bug in `Similarity.query_shards` in multiprocessing case (#2044) darindf (3): Fix python 3 compatibility for `gensim.corpora.UciCorpus.save_corpus` (#1875) Remove duplication of class documentation for `IndexedCorpus` (#2033) Add `dtype` argument for `chunkize_serial`, fix `LdaModel` (#2027) ivan (12): Merge branch 'master' into develop bump version to 3.4.0 regenerate C files with cython==0.27 Merge branch 'release-3.4.0' Merge branch 'master' into develop apply fixes for distributed mode lda/lsi from @piskvorky #2102 remove smart_open limitation from setup.py, replace smart_open -> open until piskvorky/smart_open#207 will be fixed fix PEP8 issues bump version to 3.5.0 bump changelos to 3.5.0 + add missing changelog for 3.4.0 regenerated C files with Cython Merge branch 'release-3.5.0' numericlee (1): Fix `doc2vec-lee` notebook (#1870)

* added integration test * Resolving issue #207: use built-in open whenever possible * rename integration test using correct issue number * update unit tests * isolate failing test case * Updated unit tests Changed the mock target to keep in step with the shortcut_open function. Split tests into separate test cases. * respond to review comments - use tempfile instead of hard-coded file - integrate new test into travis.yml so that Travis runs it * added support for buffering parameter

menshikh-iv · 2018-08-13T03:30:21Z

Fixed by #208

menshikh-iv added the bug label Jul 5, 2018

menshikh-iv assigned mpenkov Jul 5, 2018

menshikh-iv added a commit to piskvorky/gensim that referenced this issue Jul 6, 2018

remove smart_open limitation from setup.py, replace smart_open -> ope…

abd76db

…n until piskvorky/smart_open#207 will be fixed

mpenkov added a commit that referenced this issue Jul 7, 2018

Resolving issue #207: use built-in open whenever possible

453b184

mpenkov mentioned this issue Jul 27, 2018

Use built-in open whenever possible #208

Merged

menshikh-iv closed this as completed Aug 13, 2018

rsdel2007 mentioned this issue Jan 15, 2019

Replace open() by smart_open() in gensim.models.fasttext._load_fasttext_format piskvorky/gensim#2335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different behaviour with `open` #207

Different behaviour with `open` #207

menshikh-iv commented Jul 5, 2018 •

edited

piskvorky commented Jul 5, 2018 •

edited

mpenkov commented Jul 6, 2018 via email

menshikh-iv commented Jul 6, 2018 •

edited

mpenkov commented Jul 6, 2018 via email

piskvorky commented Jul 6, 2018 •

edited

menshikh-iv commented Jul 6, 2018

piskvorky commented Jul 6, 2018 •

edited

mpenkov commented Jul 6, 2018 via email

mpenkov commented Jul 6, 2018 via email

piskvorky commented Jul 7, 2018 •

edited

mpenkov commented Jul 7, 2018

menshikh-iv commented Aug 13, 2018

Different behaviour with open #207

Different behaviour with open #207

Comments

menshikh-iv commented Jul 5, 2018 • edited

piskvorky commented Jul 5, 2018 • edited

mpenkov commented Jul 6, 2018 via email

menshikh-iv commented Jul 6, 2018 • edited

mpenkov commented Jul 6, 2018 via email

piskvorky commented Jul 6, 2018 • edited

menshikh-iv commented Jul 6, 2018

piskvorky commented Jul 6, 2018 • edited

mpenkov commented Jul 6, 2018 via email

mpenkov commented Jul 6, 2018 via email

piskvorky commented Jul 7, 2018 • edited

mpenkov commented Jul 7, 2018

menshikh-iv commented Aug 13, 2018

Different behaviour with `open` #207

Different behaviour with `open` #207

menshikh-iv commented Jul 5, 2018 •

edited

piskvorky commented Jul 5, 2018 •

edited

menshikh-iv commented Jul 6, 2018 •

edited

piskvorky commented Jul 6, 2018 •

edited

piskvorky commented Jul 6, 2018 •

edited

piskvorky commented Jul 7, 2018 •

edited