Skip to content
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.

Windows doesn't automatically use UTF-8 encoding #9

Open
MrKrzYch00 opened this issue Jun 8, 2019 · 2 comments
Open

Windows doesn't automatically use UTF-8 encoding #9

MrKrzYch00 opened this issue Jun 8, 2019 · 2 comments

Comments

@MrKrzYch00
Copy link

Shouldn't all file operation have encoding="utf-8" added to make it more portable on other systems like Windows? Unless there is other global switch that could be applied at the beginning to not crash with a message "[...]charmap' codec can't encode character[...]"

@kinoc
Copy link

kinoc commented Jun 9, 2019

Would like an optional encoding flag, which defaults to "utf-8" but you could specify others. I have to use "latin-1" for some cases.

@MrKrzYch00
Copy link
Author

MrKrzYch00 commented Jun 9, 2019

Yeah, I'm not yet 100% sure myself if it should be UTF-8 or one should use system-default encoding dataset instead of UTF-8 and open them as such... Trying to train it on Polish text to see the results. Unfortunately it doesn't want to use Polish accent letters, for example replaces ł with normal l with samples. Maybe I'm missing something or it still needs more training? (although it uses ó which usually exists in 1-byte encoding format)

EDIT: Never-mind the above... It seems that the console output is UTF-8 in my CMD which just simply doesn't work, it would need to be converted to ANSI using Polish code page before output, so in my case UTF-8 is most valid way to read datasets (without BOM!). Sample files look OK.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants