GC word symbols and pave way for UTF-8 everywhere (major) #270

hostilefork · 2016-06-28T11:33:40Z

Previously, when a word was "interned" into the system, it would receive
a fixed index number from a table. This index number pointed into a
data area of UTF-8 strings, and was persistent--the system would never
reclaim a word's data from the table once it had been loaded. Every
spelling variation--even one fleetingly used--would add another permanent
record into the table

The data was kept as UTF-8 because there was no need to decode it--the
words were immutable, and only needed to be hashed and compared.

This beings the unification of properties of strings and words, which
intends to use UTF-8 as the internal representation for strings
anyway. In anticipation of that change, it puts words into a kind of
REBSER series node called a REBSTR. The previous term "REBSYM" could
have meant either a canon form or arbitrary casing of a symbol, but
is repurposed to be a small integer for words known to the C code
(defined in %words.r).

Terminology was standardized and examined, so terms like "interning"
are used instead of more vague things (Intern_Utf8 vs. Make_Word)
To help avoid mistakes in code in the future, the two forms of REBSTR
are extracted from words as VAL_WORD_CANON and VAL_WORD_SPELLING.
This makes it more obvious to understand the internal logic. Also,
when it comes to the small numeric symbols not all words have,
checks in the C++ build prevent coders from writing something like:

if (VAL_WORD_SYM(word1) == VAL_WORD_SYM(word2)) {...}

Since not all words have "SYMs", then two words that are not in %words.r
might return the same SYM_0 to indicate that fact, which would cause
a false belief these were equal (at least, that's likely what the code
would have meant). The C build is unchecked for this, but the C++
build has trickery to notice it at compile time.

A major aspect of the change is that this removes the binding table.
Instead, space is reserved in the REBSTR node itself on the canon forms
of words--in the spot that non-canon forms use to point to the canon
form. This is where indices are stored during a bind.

While not having a bind table and leaning on the existing hash structure
for lookups to words has benefits, the previous code had a bind table
per thread. While there was no larger threading story, Ren-C doesn't
want to have even less of one--so the technique is augmented with
a stack-based "Binder" that demonstrates how contentions could be
resolved over the space. This would ideally use some sort of lock-free
strategy that would dynamically build out as-needed from the contentious
REBSTR node that more than one thread (or recursion) wanted to write on.

This leverages some new-ish features, like the ability to embed small
payloads directly into REBSERs themselves. Short UTF-8 strings that
fit into 16 bytes (on 32-bit platforms) and 32 bytes (on 64-bit
platforms) do not need a separate storage allocation

The likely trickiest part of this change is the behavior when a canon
form of a word is GC'd. Because the accelerating hash table is done
using linear probing, it's not easy to remove a hashed item because
hash chains starting from more than one initial hash may have collided,
and leaving an empty cell in the spot could cause a false miss of an
item that is still present. This was addressed with a special "deleted"
pointer value used, which is reclaimed on insertions and hash resizings.

Previously, when a word was "interned" into the system, it would receive a fixed index number from a table. This index number pointed into a data area of UTF-8 strings, and was persistent--the system would never reclaim a word from the GC once it had been loaded. The data was kept as UTF-8 because there was no need to decode it--the words were immutable, and only needed to be hashed and compared. This beings the unification of properties of strings and words, which intends to use UTF-8 as the internal representation for strings anyway. In anticipation of that change, it puts words into a kind of REBSER series node called a REBSTR. The previous term "REBSYM" could have meant either a canon form or arbitrary casing of a symbol, but is repurposed to be a small integer for words known to the C code (defined in %words.r). Terminology was standardized and examined, so terms like "interning" are used instead of more vague things (`Intern_Utf8` vs. `Make_Word`) To help avoid mistakes in code in the future, the two forms of REBSTR are extracted from words as VAL_WORD_CANON and VAL_WORD_SPELLING. This makes it more obvious to understand the internal logic. Also, when it comes to the small numeric symbols not all words have, checks in the C++ build prevent coders from writing something like: if (VAL_WORD_SYM(word1) == VAL_WORD_SYM(word2)) {...} Since not all words have "SYMs", then two words that are not in %words.r might return the same SYM_0 to indicate that fact, which would cause a false belief these were equal (at least, that's likely what the code would have meant). The C build is unchecked for this, but the C++ build has trickery to notice it at compile time. A major aspect of the change is that this removes the binding table. Instead, space is reserved in the REBSTR node itself on the canon forms of words--in the spot that non-canon forms use to point to the canon form. This is where indices are stored during a bind. While not having a bind table and leaning on the existing hash structure for lookups to words has benefits, the previous code had a bind table per thread. While there was no larger threading story, Ren-C doesn't want to have even less of one--so the technique is augmented with a stack-based "Binder" that demonstrates how contentions could be resolved over the space. This would ideally use some sort of lock-free strategy that would dynamically build out as-needed from the contentious REBSTR node that more than one thread (or recursion) wanted to write on. This leverages some new-ish features, like the ability to embed small payloads directly into REBSERs themselves. Short UTF-8 strings that fit into 16 bytes (on 32-bit platforms) and 32 bytes (on 64-bit platforms) do not need a separate storage allocation The likely trickiest part of this change is the behavior when a canon form of a word is GC'd. Because the accelerating hash table is done using linear probing, it's not easy to remove a hashed item because hash chains starting from more than one initial hash may have collided, and leaving an empty cell in the spot could cause a false miss of an item that is still present. This was addressed with a special "deleted" pointer value used, which is reclaimed on insertions and hash resizings.

hostilefork merged commit 2f20e4b into metaeducation:master Jun 28, 2016

hostilefork deleted the gc-words branch July 10, 2016 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC word symbols and pave way for UTF-8 everywhere (major) #270

GC word symbols and pave way for UTF-8 everywhere (major) #270

hostilefork commented Jun 28, 2016 •

edited

GC word symbols and pave way for UTF-8 everywhere (major) #270

GC word symbols and pave way for UTF-8 everywhere (major) #270

Conversation

hostilefork commented Jun 28, 2016 • edited

hostilefork commented Jun 28, 2016 •

edited