OS-dependent filename encoding/decoding #119

Open
thelema opened this Issue Feb 22, 2011 · 7 comments

Comments

Projects
None yet
3 participants
@thelema
Owner

thelema commented Feb 22, 2011

I suggest File.fs_encode/fs_decode for converting a UTF8.t into a string for use as a filename.

@chaudhuri

This comment has been minimized.

Show comment Hide comment
@chaudhuri

chaudhuri Feb 22, 2011

Contributor

Since most functions use the scheme verb_noun, I think encode/decode_fs is better. Also, I would suggest a different verb pair entirely since "en/decode" are already used in the technical sense of character encoding. Alternatively, put them in (a newly created) BatFilename. Feel free to ignore the suggestion if you don't like it.

Contributor

chaudhuri commented Feb 22, 2011

Since most functions use the scheme verb_noun, I think encode/decode_fs is better. Also, I would suggest a different verb pair entirely since "en/decode" are already used in the technical sense of character encoding. Alternatively, put them in (a newly created) BatFilename. Feel free to ignore the suggestion if you don't like it.

@thelema

This comment has been minimized.

Show comment Hide comment
@thelema

thelema Feb 22, 2011

Owner

Actually, BatPathGen might be a good place for it. maybe encode/decode_filename is better, as the next value is the filename to be recoded.

Owner

thelema commented Feb 22, 2011

Actually, BatPathGen might be a good place for it. maybe encode/decode_filename is better, as the next value is the filename to be recoded.

@ghost ghost assigned thelema Apr 18, 2011

@toton

This comment has been minimized.

Show comment Hide comment
@toton

toton Dec 12, 2011

Contributor

I think that the clean solution is to make versions of file handling functions that accept UTF-8. The OS-dependent functions should be deprecated. The file access functions for Win32 should talk in UTF-16, but AFAIK currently they use random legacy encoding. This is kind of broken and should be replaced.

Contributor

toton commented Dec 12, 2011

I think that the clean solution is to make versions of file handling functions that accept UTF-8. The OS-dependent functions should be deprecated. The file access functions for Win32 should talk in UTF-16, but AFAIK currently they use random legacy encoding. This is kind of broken and should be replaced.

@toton

This comment has been minimized.

Show comment Hide comment
@toton

toton Dec 13, 2011

Contributor

Second thought: it would make sense to settle on UTF-8 all over the library. AFAIK, other encodings matter only on Windows.
But, in principle, when coding an application for Windows, you have to be able to take data in UTF-8 (from network and files), legacy encoding (from files) and UTF-16 (GUI, hopefully exposed to you as UTF-8 by some GUI layer). But the legacy encodings vary with locales, so you usually have to work in Unicode anyway.
So, honestly, I can't see use cases where e.g. the open file function speaks non-UTF-8.
Also, Windows-style apps tend to use GUI and currently GTK is the viable option. It uses UTF-8 uniformly.
Otherwise, there are command line tools, but they are likely to encounter UTF-8 data (e.g. like source code files).

And, in case legacy encoding is enough, the program is most likely encoding-transparent, so no harm from UTF-8 in this case. With the old Windows code pages, you cannot rely on the fact that 200 means Č, it can be И as well. It's just opaque stream of bytes.

Contributor

toton commented Dec 13, 2011

Second thought: it would make sense to settle on UTF-8 all over the library. AFAIK, other encodings matter only on Windows.
But, in principle, when coding an application for Windows, you have to be able to take data in UTF-8 (from network and files), legacy encoding (from files) and UTF-16 (GUI, hopefully exposed to you as UTF-8 by some GUI layer). But the legacy encodings vary with locales, so you usually have to work in Unicode anyway.
So, honestly, I can't see use cases where e.g. the open file function speaks non-UTF-8.
Also, Windows-style apps tend to use GUI and currently GTK is the viable option. It uses UTF-8 uniformly.
Otherwise, there are command line tools, but they are likely to encounter UTF-8 data (e.g. like source code files).

And, in case legacy encoding is enough, the program is most likely encoding-transparent, so no harm from UTF-8 in this case. With the old Windows code pages, you cannot rely on the fact that 200 means Č, it can be И as well. It's just opaque stream of bytes.

@thelema

This comment has been minimized.

Show comment Hide comment
@thelema

thelema Dec 13, 2011

Owner

I'm all for UTF8. Your last comment, doesn't seem right - UTF8 doesn't allow carrying just an opaque stream of bytes - it does need to re-encode that stream to carry it. Basic string is just a stream of bytes. So supporting UTF8 and string should suffice for almost all uses.

Owner

thelema commented Dec 13, 2011

I'm all for UTF8. Your last comment, doesn't seem right - UTF8 doesn't allow carrying just an opaque stream of bytes - it does need to re-encode that stream to carry it. Basic string is just a stream of bytes. So supporting UTF8 and string should suffice for almost all uses.

@toton

This comment has been minimized.

Show comment Hide comment
@toton

toton Dec 15, 2011

Contributor

I meant, since Windows code pages vary, applications can rely only on meaning of bytes for ASCII chars. Then they should work equally well if the same code is used to process UTF-8 strings - provided that proper transcoding is done at boundaries. It means little work when moving to UTF-8-only functions.

What do you think about changing all BatFile and BatUnix functions to take UTF-8 file names on all platforms?
This would require bindings to CreateFileW, _wstati64, _wmkdir, MoveFileExW, CreateHardLinkW, _wrmdir and some others.

Contributor

toton commented Dec 15, 2011

I meant, since Windows code pages vary, applications can rely only on meaning of bytes for ASCII chars. Then they should work equally well if the same code is used to process UTF-8 strings - provided that proper transcoding is done at boundaries. It means little work when moving to UTF-8-only functions.

What do you think about changing all BatFile and BatUnix functions to take UTF-8 file names on all platforms?
This would require bindings to CreateFileW, _wstati64, _wmkdir, MoveFileExW, CreateHardLinkW, _wrmdir and some others.

@thelema

This comment has been minimized.

Show comment Hide comment
@thelema

thelema Dec 15, 2011

Owner

Sounds good to me. You'll have to blaze the trail on C bindings in batteries and platform detection, but how hard could it be? :)

Owner

thelema commented Dec 15, 2011

Sounds good to me. You'll have to blaze the trail on C bindings in batteries and platform detection, but how hard could it be? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment