Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
(FAT) Relaxed restriction on Unicode codepoints to support glyphs up …
…to index 255.
- Loading branch information
Showing
2 changed files
with
14 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0df7cdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because somebody requested it. Some software like LaunchELF were crafted to support codepoints up to 255, which includes some Spanish characters.
I did make a new attempt with implementing proper Unicode conversion functions, for converting UCS-2 (the Unicode encoding used in FAT) to UTF-8, but I remembered why I gave up previously: the LFN entries are stored back to front, while UTF-8 characters make it difficult to insert characters in a non-sequential order. As a result, filenames are are jumbled up, and there doesn't seem to be an easy way to solve this, short of identifying and storing all LFN entries in memory, before converting the filename to UTF-8 with all of them.
0df7cdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case somebody else wonders like I once did, the encoding used for the filenames was not specified. But was likely UCS-2 since FAT32 came with Windows 95B and Microsoft only switched over from UCS-2 to UTF-16 with Windows 2000. UCS-2 is a 16-bit encoding, which makes it simpler to deal with than UTF-16 since it has no concept of surrogate pairs. Although the encoding was "not specified", we're constrained to using what Microsoft used, for forward and backward-compatibility. The first 128 Unicode code points are compatible with ASCII, which is why we can simply ignore the upper byte of UCS-2 and things still work.
UTF-8, UTF-16 and UCS-2 are encodings, not the character sets. So you can use them to encode (represent) the same Unicode characters. UCS-2 cannot represent all glyphs on the planet, as it only has 16 bits, which UTF-16 is free from since it can represent much more glyphs with 20 bits (a surrogate pair).