Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test fails straight out of the box #126

Open
JGCoelho opened this issue Apr 30, 2020 · 4 comments
Open

Test fails straight out of the box #126

JGCoelho opened this issue Apr 30, 2020 · 4 comments

Comments

@JGCoelho
Copy link

I've cloned the repository, and tried running the unittest test.test_itertext. This test doesn't require to set up the sherlock model. It reads the text files that come with the package and makes the models inside the test, so i didn't have any input into it. The error i keep getting is this:

(base) C:\Users\JGC\Desktop\Trabalhos\Python\markovify>python -m unittest test.test_itertext
EE.E
======================================================================
ERROR: test_from_json_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 25, in test_from_json_without_retaining 
    original_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_from_mult_files_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 37, in test_from_mult_files_without_retaining
    models.append(markovify.Text(f, retain_original=False))
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 18, in test_without_retaining
    senate_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

----------------------------------------------------------------------
Ran 4 tests in 0.725s

FAILED (errors=3)

Running a conda 3.7.6 environment on Windows 10.

@jsvine
Copy link
Owner

jsvine commented Apr 30, 2020

Thanks for flagging @JGCoelho. Judging by the error messages, this seems to be an issue with character encoding — possibly tied to Windows and/or Anaconda, but it's hard to tell. If you run the tests with a standard Python installation, instead of Anaconda, do you get the same problem? And can anyone else replicate these errors?

@JGCoelho
Copy link
Author

Tried cloning it again and running the unittest with the default python 3.8.2. Same errors:

C:\Users\JGC\Desktop>git clone https://github.com/jsvine/markovify.git
Cloning into 'markovify'...
remote: Enumerating objects: 32, done.
remote: Counting objects: 100% (32/32), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 834 (delta 16), reused 10 (delta 2), pack-reused 802
Receiving objects: 100% (834/834), 461.29 KiB | 1.43 MiB/s, done.
Resolving deltas: 100% (495/495), done.

C:\Users\JGC\Desktop>cd markovify

C:\Users\JGC\Desktop\markovify>py --version
Python 3.8.2

C:\Users\JGC\Desktop\markovify>py -m unittest test.test_itertext
EE.E
======================================================================
ERROR: test_from_json_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 24, in test_from_json_without_retaining
    original_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_from_mult_files_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 36, in test_from_mult_files_without_retaining
    models.append(markovify.Text(f, retain_original=False))
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 17, in test_without_retaining
    senate_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

----------------------------------------------------------------------
Ran 4 tests in 0.515s

FAILED (errors=3)

Maybe a problem with codecs? Opening the files sherlock.txt and senate-bills.txt i could see that they had the format utf-8 without BOM. Converted them to utf-8 with BOM and got the same error. Also converted the format to ANSI and UCS-2 to no avail.

@JGCoelho
Copy link
Author

Also, the character 0x9d is the 'RIGHT DOUBLE QUOTATION MARK' (U+201D) 0x9D.

@Sylv-Lej
Copy link
Contributor

Sylv-Lej commented Aug 25, 2020

0x9d is unmapped in windows-1252 according to wikipedia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants