Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding error (non ascii characters are not valid in gTTS()) #71

Closed
Fatallis opened this issue May 18, 2017 · 3 comments
Closed

Encoding error (non ascii characters are not valid in gTTS()) #71

Fatallis opened this issue May 18, 2017 · 3 comments
Labels

Comments

@Fatallis
Copy link

Fatallis commented May 18, 2017

Hello I recently found this new amazing project. Cangratulations!!! I found an error while using a file with spanish text in it.

This is the error message:
'ascii' codec can't encode character u'\xbf' in position 0: ordinal not in range(128)
The text in the input file:
¿Cómo sabes que amas a alguien? Filosofía Martha Nussbaum Incomplegencia Teorema de la Verdad del Corazón, de Platón a Proust.

The command:
gtts-cli -o test.mp3 -f test.txt -l 'es'
I am not an expert with codecs and this stuff, I added this lines to gtts-cli.py:

# encoding=utf8
reload(sys)
sys.setdefaultencoding('utf8')

And it worked well, however I don't know if it's the optimal solution.

@antropophob
Copy link

I can confirm this (or very similar) issue with Russian language.
Here is a stack trace:
Traceback (most recent call last): File "/home/parallels/Documents/talk.py", line 394, in <module> SendSpeech(FileNameTmp) File "/home/parallels/Documents/talk.py", line 207, in SendSpeech Say(choicyfication(result)) File "/home/parallels/Documents/talk.py", line 146, in Say tts = gTTS(text, targetLanguage) File "/usr/local/lib/python2.7/dist-packages/gtts/tts.py", line 97, in __init__ text_parts = self._tokenize(text, self.MAX_CHARS) File "/usr/local/lib/python2.7/dist-packages/gtts/tts.py", line 169, in _tokenize min_parts += self._minimize(p, " ", max_size) File "/usr/local/lib/python2.7/dist-packages/gtts/tts.py", line 176, in _minimize if self._len(thestring) > max_size: File "/usr/local/lib/python2.7/dist-packages/gtts/tts.py", line 154, in _len return len(text.decode('utf8')) File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 0: unexpected end of data

The reason for such behaviour is that gtts breaks long string on small chunks without preserving coding structure.
String example where gtss fails:
Для продолжения обслуживания требуется переключение на оператора Контактного центра Банка. Вы согласны?

@XueWei
Copy link

XueWei commented Jul 31, 2017

I encounter similar issues for zh-cn, ja. If I input long text.

@pndurette
Copy link
Owner

Hey! Thanks @Fatallis! Sorry it took so long to look at this, glad you had a working workaround.

Everyone: this should be fixed in gTTS v1.2.1 that was just released. I used all the examples above for testing as well. It was an issue with Python 2.7. Let me know how it goes.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants