Tasks

SB-EXTERNAL-FORMATS

This is noodling towards a new external format implementation in SBCL. Currently it lives right here, outside the SBCL source tree. I’m not working on this right now, so if someone wants to run with it, they should feel free. Patches that move things along are welcome.

Tasks

UTF-16 error handling
UTF-16 surrogate policy: Allow writing valid surrogate pairs. Signal an error only for invalid pairs – but allow writing them with a restart. (Since we allow surrogates as lisp-characters, it would seem silly to complain about writing a valid pair of them. This means reading them back will canonicalize the representation, but that seems sane.)
UTF-8 surrogate policy
Missing encodings:
- CP 932, CP 936, CP 949, CP 950
- EBCDIC-US
- EUC-JP
- GBK
- SHIFT_JIS
- UCS-2
- UCS-4
- UTF-32
- UTF-8b
- X-MAC-CYRILLIC
Document ENCODE-STRING
Add :NULL-TERMINATE to ENCODE-STRING
Implement ENCODE-STRING-INTO
Implement DECODE-OCTETS-INTO
DECODE-OCTETS to work on null-terminated buffers of unknown size
Proper test suite: not sure if eg. DECODED- and ENCODED-LENGTH current work
Sort out BOM API:
- Consider (OPEN … :EXTERNAL-FORMAT :UTF-16) which should check BOM immediately, but not later, but later reads should not.
- Consider decoding parts of a binary file – BOM might need to be checked multiple times.
- How to best support writing a BOM? Maybe just WRITE-BOM?
Documentation strings for encodings should say what languages / areas they are appropriate for.
Name external-format objects using the user-supplied name, and unless it is the canonical name print the canonical name in parentheses when describing the format.
Naming convention: case-insensitive strings?

Design

Octet sources and sinks can be either vectors or saps.

Character sources and sinks are strings. The API level is responsible for pulling out the storage vector.

Encoding objects contain the methods for computing the length, encoding, and decoding. Also length estimation?

External-format Objects contains an encoding, a selection of line ending style, and a replacement character.

Error functions may return either the character code of the replacement character, or NIL to indicate that it is to be skipped.

Open Questions

How to do more complex error handling? Ie. decoding invalid UTF-8 sequences into strings describing those sequences.

Seems like the handler would need to know the position where the failure occurred. With that and CONTINUE the final result can be amended as needed. Something along the lines of my original multi-char replacements might be more efficient, but this seems like a tangential issue.

Would also like to be able to ABORT at the position of the invalid char/octets.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
unicode		unicode
.gitignore		.gitignore
README.org		README.org
api.lisp		api.lisp
enc-ascii.lisp		enc-ascii.lisp
enc-cyr.lisp		enc-cyr.lisp
enc-dos.lisp		enc-dos.lisp
enc-iso.lisp		enc-iso.lisp
enc-utf16.lisp		enc-utf16.lisp
enc-utf8.lisp		enc-utf8.lisp
enc-win.lisp		enc-win.lisp
encoding.lisp		encoding.lisp
external-format.lisp		external-format.lisp
package.lisp		package.lisp
sb-external-formats.asd		sb-external-formats.asd
tests.lisp		tests.lisp
utils.lisp		utils.lisp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tasks

Design

Open Questions

About

Releases

Packages

Contributors 2

Languages

nikodemus/sb-external-formats

Folders and files

Latest commit

History

Repository files navigation

Tasks

Design

Open Questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages