Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gensim.scripts.word2vec2tensor TypeError: write() argument must be str, not bytes #1958

Closed
ttpro1995 opened this issue Mar 7, 2018 · 7 comments · Fixed by #2147
Closed
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@ttpro1995
Copy link

ttpro1995 commented Mar 7, 2018

Python environment

Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux

How I make article_body_w2v_300.txt

import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

sentences = LineSentence("./data/article_body_corpus.txt")

model = Word2Vec(sentences, size=300, window=5, min_count=5, workers=4)

model.wv.save_word2vec_format("article_body_w2v_300.txt", binary=False)

Command I use to run gensim.scripts.word2vec2tensor

python -m gensim.scripts.word2vec2tensor -i article_body_w2v_300.txt -o meow/

Console output

word_embedding  python -m gensim.scripts.word2vec2tensor -i article_body_w2v_300.txt -o meow/
2018-03-07 16:30:29,484 - word2vec2tensor - INFO - running /home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py -i article_body_w2v_300.txt -o meow/
2018-03-07 16:30:29,484 - utils_any2vec - INFO - loading projection weights from article_body_w2v_300.txt
2018-03-07 16:30:41,992 - utils_any2vec - INFO - loaded (56543, 300) matrix from article_body_w2v_300.txt
Traceback (most recent call last):
  File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py", line 93, in <module>
    word2vec2tensor(args.input, args.output, args.binary)
  File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py", line 73, in word2vec2tensor
    file_metadata.write(gensim.utils.to_utf8(word) + gensim.utils.to_utf8('\n'))
TypeError: write() argument must be str, not bytes

@menshikh-iv
Copy link
Contributor

hello @ttpro1995, thanks for the report, can you try to run gensim.scripts.word2vec2tensor with python2 (I have some ideas, what happens here)?

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Mar 7, 2018
@ttpro1995
Copy link
Author

On python 2.7, it worked without need any fix.

Python 2.7.14 |Anaconda, Inc.| (default, Dec  7 2017, 17:05:42) 
[GCC 7.2.0] on linux2

On python 3.6, it need some fix.

Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux

gensim.utils.to_utf8(word) type is byte but write() need string. So I add decode("utf-8")

    with open(outfiletsv, 'w+') as file_vector:
        with open(outfiletsvmeta, 'w+') as file_metadata:
            for word in model.index2word:
                file_metadata.write(gensim.utils.to_utf8(word).decode("utf-8") + gensim.utils.to_utf8('\n'))
                vector_row = '\t'.join(str(x) for x in model[word])
                file_vector.write(vector_row + '\n')

Then, it work on python 3.6. However, this fix does not work on python 2.7.
If run on python 2.7 with decode("utf-8")

 python gensim2tensor.py -i article_body_w2v_300.txt -o meow/                 
2018-03-08 16:30:51,521 - gensim2tensor - INFO - running gensim2tensor.py -i article_body_w2v_300.txt -o meow/
2018-03-08 16:30:51,521 - utils_any2vec - INFO - loading projection weights from article_body_w2v_300.txt
2018-03-08 16:31:08,798 - utils_any2vec - INFO - loaded (56543, 300) matrix from article_body_w2v_300.txt
Traceback (most recent call last):
  File "gensim2tensor.py", line 74, in <module>
    word2vec2tensor(args.input, args.output, args.binary)
  File "gensim2tensor.py", line 54, in word2vec2tensor
    file_metadata.write(gensim.utils.to_utf8(word).decode("utf-8") + gensim.utils.to_utf8('\n'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 1: ordinal not in range(128)

@menshikh-iv
Copy link
Contributor

@ttpro1995 aha, as awaited, big thanks, that's really a bug.

@menshikh-iv
Copy link
Contributor

@vsocrates
Copy link
Contributor

Hi, sorry I deleted my comment as I saw that @AakaashRao had made a PR as well, but I will finish it up. Will submit PR when it's done!

@menshikh-iv
Copy link
Contributor

@vsocrates right now nobody works on this fix, again, feel free to submit an PR

@vsocrates
Copy link
Contributor

@menshikh-iv submitted. A quick note: I had to force the data type to be float64 in this line to pass the test with the test data we have now. Please review, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
3 participants