Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with typeError: must be str, not bytes #2125

Closed
wants to merge 2 commits into from
Closed

Dealing with typeError: must be str, not bytes #2125

wants to merge 2 commits into from

Conversation

FMCumhaill
Copy link

Dealing with typeError: must be str, not bytes which occurred for me when running this file. https://stackoverflow.com/questions/5512811/builtins-typeerror-must-be-str-not-bytes

This may be a local requirement and therefore not sufficient for a pull request?

Required to correctly open the files on a window machine (I believe?)
@piskvorky
Copy link
Owner

piskvorky commented Jul 9, 2018

All I/O writing or reading should definitely be happening in binary mode (rb, wb). With explicit encoding conversions for text where necessary. This is true for both Python 2 and Python 3.

@loretoparisi was there a particular reason for the w+? Can we replace it with wb?

One more request: @FMCumhaill can you please replace all calls to open() with smart_open()? (from smart_open import smart_open)

The reason is that in Gensim, we're trying to abstract away from the storage, so people can save files to S3, HDFS, compressed etc transparently, using the smart_open library.

@FMCumhaill
Copy link
Author

@piskvorky done as requested! 👍

@@ -67,8 +68,8 @@ def word2vec2tensor(word2vec_model_path, tensor_filename, binary=False):
outfiletsv = tensor_filename + '_tensor.tsv'
outfiletsvmeta = tensor_filename + '_metadata.tsv'

with open(outfiletsv, 'w') as file_vector:
with open(outfiletsvmeta, 'wb') as file_metadata:
with smart_open(outfiletsv, 'w') as file_vector:
Copy link
Owner

@piskvorky piskvorky Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No w -- always use binary I/O (rb, wb) please. Encode text to utf8 where necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing that w to wb gives me the following error:

Traceback (most recent call last):
File "word2vec2tensor.py", line 94, in
word2vec2tensor(args.input, args.output, args.binary)
File "word2vec2tensor.py", line 76, in word2vec2tensor
file_vector.write(vector_row + '\n')
TypeError: a bytes-like object is required, not 'str'

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- please encode text to binary (utf8) where necessary.

@@ -44,6 +44,7 @@
import argparse

import gensim
from smart_open import smart_open
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test for word2vec2tensor function (for avoid this issue in future)

@@ -67,8 +68,8 @@ def word2vec2tensor(word2vec_model_path, tensor_filename, binary=False):
outfiletsv = tensor_filename + '_tensor.tsv'
outfiletsvmeta = tensor_filename + '_metadata.tsv'

with open(outfiletsv, 'w+') as file_vector:
with open(outfiletsvmeta, 'w+') as file_metadata:
with smart_open(outfiletsv, 'w') as file_vector:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must be wb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, good idea to make a code more "flat", like this

with smart_open(...) as file_vector, smart_open(...) as file_metadata:
    ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv sorry this escaped me! ATM I don't have time to finish it sorry. I have no idea how to add a test and don't have time to research it. My apologies!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FMCumhaill That's sad, but anyway, thanks for the answer, good luck!

@menshikh-iv
Copy link
Contributor

@FMCumhaill thanks for PR, do you plan to finish it?

@menshikh-iv
Copy link
Contributor

Close by #2125 (comment) reason

@piskvorky
Copy link
Owner

piskvorky commented Aug 3, 2018

@menshikh-iv this is a bug, please don't close like this. We need to fix this either way (and the fix looks really trivial, including smart_open).

@loretoparisi was there a particular reason for the w+? Can we replace it with wb?

@menshikh-iv
Copy link
Contributor

@piskvorky we need fix bugs, I agree, but this is PR, not a issue, if contributor don't have time to complete (and isn't close to finish) - I close it without "almost complete" badge.

Also, we already have opened issue for this bug #1958

About "amost_complete" - I don't agree, because:

  • incorrect file modes
  • missing tests
  • 3 lines diff where correct only one (import), I don't think that we should mark it with badge

@piskvorky
Copy link
Owner

OK. Let me remove the label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants