Skip to content

StringRepresentations

Felix S. Klock II edited this page Jul 28, 2013 · 2 revisions

= Representation of Strings =

Implementations of R5RS and R6RS Scheme can choose from a wide variety of representations for strings, including:

flat1:: bytevector-like objects, with 1 byte/character; works for ASCII or Latin-1 but not for Unicode

flat4:: bytevector-like objects, with 4 bytes/character; works for Unicode and R6RS

record4:: records that encapsulate a flat4

record2:: records that encapsulate a bytevector-like object with 2 bytes/character plus a table for supplementary characters

record1:: records that encapsulate a flat1 plus a table for non-ASCII characters

utf16:: records that encapsulate a bytevector-like object with UTF-16 encoding

utf8:: records that encapsulate a bytevector-like object with UTF-8 encoding

The utf16 and utf8 representations do not work very well for Scheme, but are discussed here because they may seem like the most obvious representations to programmers who work with Unicode in other languages. Before proceeding to detailed descriptions of these representations, here is a quick comparison:


#!rst
================== ======== ============== =============== ============= ===========
  Representation   Supports ``string-ref`` ``string-set!`` Space (32-bit and 64-bit)
------------------ -------- -------------- --------------- -------------------------
flat1              Latin-1       O(1)           O(1)            n+4           n+8
flat4              Unicode       O(1)           O(1)          4*n+4         4*n+8    
record4            Unicode       O(1)           O(1)          4*n+20        4*n+32
record2            Unicode       O(1)           O(1)          2*n+20        2*n+40
record1            Unicode       O(1)           O(1)            n+20          n+40 
utf16              Unicode       O(n)           O(n)          2*n+20        2*n+32  
utf8               Unicode       O(n)           O(n)            n+20          n+32
================== ======== ============== =============== ============= ===========

The last two columns show the space that would be occupied by an n-character ASCII string in 32-bit and 64-bit implementations of Larceny. On both 32-bit and 64-bit systems, the space should be rounded up to the next multiple of 8.

flat1

This representation was used in Larceny through v0.93. In Larceny v0.94, the use of this representation became a build-time option for users who do not need Unicode.

flat4

A prototype of this representation was introduced in Larceny v0.93 and replaced the flat1 representation in v0.94, although the flat1 representation can still be specified at build time (for native Larceny) by changing one of the setup parameters. (Note that Larceny's flat4 representation is not UTF-32, because the constituent characters are boxed. See the discussion of record2 below.)

record4

The main advantage of this representation over flat4 is that the length of a string can change dynamically.

record2

The main advantage of this representation over record4 is that long strings take about half as much space. It may also improve performance when interoperating with external libraries that use UTF-16 encodings.

The auxiliary table is #f for strings that have never contained any supplementary characters. For all such strings, the main bytevector-like object contains the standard UTF-16 encoding of a string.

When string-set! is used to store a supplementary character into a record2 string whose auxiliary table is #f, a second bytevector-like object is allocated, just as long as the original, to serve as the auxiliary table that holds the second code units of surrogate pairs. The first code units of surrogate pairs go into the original bytevector-like object. Once the auxiliary table has been allocated, the main bytevector-like object is unlikely to contain a valid UTF-16 encoding.

In the future, we might consider using record2 in Common Larceny, where the flat4 representation is less efficient. (Common Larceny has no boxed immediates, so all of the representations described here require string-ref to box its result and string-set! to unbox its argument.) The record2 representation might result in faster conversions between Scheme and native .NET representations, although it appears that fast conversions can be obtained only by using a less compact representation that would make string-ref and string-set! run slower.

record1

The main advantage of this representation over record2 is that long strings of ASCII characters take about half as much space.

The auxiliary table is #f for strings that have never contained any non-ASCII characters. For all such strings, the main bytevector-like object contains the standard UTF-8 encoding of an ASCII string.

When string-set! is used to store a non-ASCII character into a record1 string whose auxiliary table is #f, a bytevector (twice as long as the main table) is allocated to serve as the auxiliary table that holds the low-order 16 bits of all non-ASCII scalar values. If a byte of the main bytevector-like object corresponds to a non-ASCII character, then the low-order 5 bits of the byte contain the scalar value's high-order bits, and the byte's sign bit is set to indicate that it is a non-ASCII character whose 16 low-order bits must be obtained from the auxiliary table. Once the auxiliary table has been allocated, the main bytevector-like object is unlikely to contain a valid UTF-8 encoding.

The record1 representation should become a build-time option for native Larceny and for Petit Larceny.

utf16

The main advantage of this representation is ease of interoperation with external libraries that expect Unicode strings to be encoded in UTF-16. The disadvantage is that string-ref and string-set! take O(n) time. Caching will provide O(1) time for common access patterns, but storing a supplementary character into the string is likely to take O(n) time even with caching.

For Larceny, the record2 representation is better than utf16 in almost every respect for almost all applications.

utf8

The main advantage of this representation is ease of interoperation with external libraries that expect Unicode strings to be encoded in UTF-8. The disadvantage is that string-ref and string-set! take O(n) time. Caching will provide O(1) time for common access patterns, but storing a non-ASCII character into the string is likely to take O(n) time even with caching.

For Larceny, the record1 representation is better than utf8 provided most long strings consist of English text.


Hash Tables

The auxiliary tables used by the record2 and record1 representations could be hash tables or association lists. That is an attractive option for interpreted systems, but would make string-ref so complex that compilers would be unlikely to generate inline code for it. As with generic arithmetic, compilers might generate inline code for the common case, trapping to out-of-line code for the uncommon case.

Multiprocessing

The string-ref operation does not involve locking to exclude other string-ref operations, but it may involve locking to exclude a concurrent string-set! or length-changing operation. On most architectures, with the flat4 representation, a string-ref should never have to lock. With the record2 and record1 representations, and strings of fixed size, a string-ref operation should not have to lock unless it accesses the auxiliary table. With the record4, utf16, and utf8 representations (and some careful programming), a string-ref operation should never have to lock, even if the size of a string can change dynamically.

With representations other than flat4, the string-set! procedure may have to acquire a lock, and some string-ref operations may have to wait until the lock is released. On architectures that support atomic update of single bytes, storing an ASCII character into a flat1 or record1 string can be done without locking. On architectures that support atomic update of 16-bit quantities aligned on double-byte boundaries, storing a non-supplementary character into a record2 string can be done without locking.

With the utf16 and utf8 representations, every string-set! operation will have to acquire a lock. Otherwise a concurrent string-set! might have to replace the bytevector-like object with one of a different length, causing any side effect to the original to be lost.

With the record2 representation, string-set! operations that store a supplementary character will have to acquire a lock. With the record1 representation, string-set! operations that store a non-ASCII character will have to acquire a lock. The auxiliary table field can be used as the lock field, and concurrent string-set! operations that do not involve locking may proceed while the auxiliary table is locked.

If the size of a string never changes, then fewer operations will have to wait. With the record4 representation, on 32-bit architectures, no operation should ever have to lock or wait.


Space

All of the Unicode-capable representations require more space than the Latin-1 strings that Larceny used through v0.93. The following table shows how many bytes it would take to represent the 10917 strings, with an average length of 14.476 characters, that reside in the initial heap image for Larceny v0.93.


              32-bit  64-bit
              ------  ------
flat1         239120  285000
flat4         697184  741760
record4       871872 1003792
record2       567016  786152
record1       413808  634376
utf16         567016  698808
utf8          413808  547032

In the initial heap image, the print names of symbols account for 190040 of the 239120 bytes occupied by strings. Storing those print names in compressed form, e.g. as a bytevector in UTF-8, would reduce the storage requirements to:


              32-bit  64-bit
              ------  ------
flat1         239120  285000
flat4         346568  391960
record4       373640  432568
record2       302064  374688
record1       266192  339144
utf16         302064  361152
utf8          266192  325608
Clone this wiki locally