Windows doesn't automatically use UTF-8 encoding #9

MrKrzYch00 · 2019-06-08T15:54:47Z

Shouldn't all file operation have encoding="utf-8" added to make it more portable on other systems like Windows? Unless there is other global switch that could be applied at the beginning to not crash with a message "[...]charmap' codec can't encode character[...]"

kinoc · 2019-06-09T07:47:38Z

Would like an optional encoding flag, which defaults to "utf-8" but you could specify others. I have to use "latin-1" for some cases.

MrKrzYch00 · 2019-06-09T11:50:40Z

Yeah, I'm not yet 100% sure myself if it should be UTF-8 or one should use system-default encoding dataset instead of UTF-8 and open them as such... Trying to train it on Polish text to see the results. Unfortunately it doesn't want to use Polish accent letters, for example replaces ł with normal l with samples. Maybe I'm missing something or it still needs more training? (although it uses ó which usually exists in 1-byte encoding format)

EDIT: Never-mind the above... It seems that the console output is UTF-8 in my CMD which just simply doesn't work, it would need to be converted to ANSI using Polish code page before output, so in my case UTF-8 is most valid way to read datasets (without BOM!). Sample files look OK.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows doesn't automatically use UTF-8 encoding #9

Windows doesn't automatically use UTF-8 encoding #9

MrKrzYch00 commented Jun 8, 2019

kinoc commented Jun 9, 2019

MrKrzYch00 commented Jun 9, 2019 •

edited

Loading

Windows doesn't automatically use UTF-8 encoding #9

Windows doesn't automatically use UTF-8 encoding #9

Comments

MrKrzYch00 commented Jun 8, 2019

kinoc commented Jun 9, 2019

MrKrzYch00 commented Jun 9, 2019 • edited Loading

MrKrzYch00 commented Jun 9, 2019 •

edited

Loading