This is noodling towards a new external format implementation in SBCL. Currently it lives right here, outside the SBCL source tree. I’m not working on this right now, so if someone wants to run with it, they should feel free. Patches that move things along are welcome.
- UTF-16 error handling
- UTF-16 surrogate policy: Allow writing valid surrogate pairs. Signal an error only for invalid pairs – but allow writing them with a restart. (Since we allow surrogates as lisp-characters, it would seem silly to complain about writing a valid pair of them. This means reading them back will canonicalize the representation, but that seems sane.)
- UTF-8 surrogate policy
- Missing encodings:
- CP 932, CP 936, CP 949, CP 950
- Document ENCODE-STRING
- Add :NULL-TERMINATE to ENCODE-STRING
- Implement ENCODE-STRING-INTO
- Implement DECODE-OCTETS-INTO
- DECODE-OCTETS to work on null-terminated buffers of unknown size
- Proper test suite: not sure if eg. DECODED- and ENCODED-LENGTH current work
- Sort out BOM API:
- Consider (OPEN … :EXTERNAL-FORMAT :UTF-16) which should check BOM immediately, but not later, but later reads should not.
- Consider decoding parts of a binary file – BOM might need to be checked multiple times.
- How to best support writing a BOM? Maybe just WRITE-BOM?
- Documentation strings for encodings should say what languages / areas they are appropriate for.
- Name external-format objects using the user-supplied name, and unless it is the canonical name print the canonical name in parentheses when describing the format.
- Naming convention: case-insensitive strings?
Octet sources and sinks can be either vectors or saps.
Character sources and sinks are strings. The API level is responsible for pulling out the storage vector.
Encoding objects contain the methods for computing the length, encoding, and decoding. Also length estimation?
External-format Objects contains an encoding, a selection of line ending style, and a replacement character.
Error functions may return either the character code of the replacement character, or NIL to indicate that it is to be skipped.
How to do more complex error handling? Ie. decoding invalid UTF-8 sequences into strings describing those sequences.
Seems like the handler would need to know the position where the failure occurred. With that and CONTINUE the final result can be amended as needed. Something along the lines of my original multi-char replacements might be more efficient, but this seems like a tangential issue.
Would also like to be able to ABORT at the position of the invalid char/octets.