GC word symbols and pave way for UTF-8 everywhere (major) #270
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, when a word was "interned" into the system, it would receive
a fixed index number from a table. This index number pointed into a
data area of UTF-8 strings, and was persistent--the system would never
reclaim a word's data from the table once it had been loaded. Every
spelling variation--even one fleetingly used--would add another permanent
record into the table
The data was kept as UTF-8 because there was no need to decode it--the
words were immutable, and only needed to be hashed and compared.
This beings the unification of properties of strings and words, which
intends to use UTF-8 as the internal representation for strings
anyway. In anticipation of that change, it puts words into a kind of
REBSER series node called a REBSTR. The previous term "REBSYM" could
have meant either a canon form or arbitrary casing of a symbol, but
is repurposed to be a small integer for words known to the C code
(defined in %words.r).
Terminology was standardized and examined, so terms like "interning"
are used instead of more vague things (
Intern_Utf8
vs.Make_Word
)To help avoid mistakes in code in the future, the two forms of REBSTR
are extracted from words as VAL_WORD_CANON and VAL_WORD_SPELLING.
This makes it more obvious to understand the internal logic. Also,
when it comes to the small numeric symbols not all words have,
checks in the C++ build prevent coders from writing something like:
Since not all words have "SYMs", then two words that are not in %words.r
might return the same SYM_0 to indicate that fact, which would cause
a false belief these were equal (at least, that's likely what the code
would have meant). The C build is unchecked for this, but the C++
build has trickery to notice it at compile time.
A major aspect of the change is that this removes the binding table.
Instead, space is reserved in the REBSTR node itself on the canon forms
of words--in the spot that non-canon forms use to point to the canon
form. This is where indices are stored during a bind.
While not having a bind table and leaning on the existing hash structure
for lookups to words has benefits, the previous code had a bind table
per thread. While there was no larger threading story, Ren-C doesn't
want to have even less of one--so the technique is augmented with
a stack-based "Binder" that demonstrates how contentions could be
resolved over the space. This would ideally use some sort of lock-free
strategy that would dynamically build out as-needed from the contentious
REBSTR node that more than one thread (or recursion) wanted to write on.
This leverages some new-ish features, like the ability to embed small
payloads directly into REBSERs themselves. Short UTF-8 strings that
fit into 16 bytes (on 32-bit platforms) and 32 bytes (on 64-bit
platforms) do not need a separate storage allocation
The likely trickiest part of this change is the behavior when a canon
form of a word is GC'd. Because the accelerating hash table is done
using linear probing, it's not easy to remove a hashed item because
hash chains starting from more than one initial hash may have collided,
and leaving an empty cell in the spot could cause a false miss of an
item that is still present. This was addressed with a special "deleted"
pointer value used, which is reclaimed on insertions and hash resizings.