Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC word symbols and pave way for UTF-8 everywhere (major) #270

Merged
merged 1 commit into from
Jun 28, 2016
Merged

GC word symbols and pave way for UTF-8 everywhere (major) #270

merged 1 commit into from
Jun 28, 2016

Conversation

hostilefork
Copy link
Member

@hostilefork hostilefork commented Jun 28, 2016

Previously, when a word was "interned" into the system, it would receive
a fixed index number from a table. This index number pointed into a
data area of UTF-8 strings, and was persistent--the system would never
reclaim a word's data from the table once it had been loaded. Every
spelling variation--even one fleetingly used--would add another permanent
record into the table

The data was kept as UTF-8 because there was no need to decode it--the
words were immutable, and only needed to be hashed and compared.

This beings the unification of properties of strings and words, which
intends to use UTF-8 as the internal representation for strings
anyway. In anticipation of that change, it puts words into a kind of
REBSER series node called a REBSTR. The previous term "REBSYM" could
have meant either a canon form or arbitrary casing of a symbol, but
is repurposed to be a small integer for words known to the C code
(defined in %words.r).

Terminology was standardized and examined, so terms like "interning"
are used instead of more vague things (Intern_Utf8 vs. Make_Word)
To help avoid mistakes in code in the future, the two forms of REBSTR
are extracted from words as VAL_WORD_CANON and VAL_WORD_SPELLING.
This makes it more obvious to understand the internal logic. Also,
when it comes to the small numeric symbols not all words have,
checks in the C++ build prevent coders from writing something like:

if (VAL_WORD_SYM(word1) == VAL_WORD_SYM(word2)) {...}

Since not all words have "SYMs", then two words that are not in %words.r
might return the same SYM_0 to indicate that fact, which would cause
a false belief these were equal (at least, that's likely what the code
would have meant). The C build is unchecked for this, but the C++
build has trickery to notice it at compile time.

A major aspect of the change is that this removes the binding table.
Instead, space is reserved in the REBSTR node itself on the canon forms
of words--in the spot that non-canon forms use to point to the canon
form. This is where indices are stored during a bind.

While not having a bind table and leaning on the existing hash structure
for lookups to words has benefits, the previous code had a bind table
per thread. While there was no larger threading story, Ren-C doesn't
want to have even less of one--so the technique is augmented with
a stack-based "Binder" that demonstrates how contentions could be
resolved over the space. This would ideally use some sort of lock-free
strategy that would dynamically build out as-needed from the contentious
REBSTR node that more than one thread (or recursion) wanted to write on.

This leverages some new-ish features, like the ability to embed small
payloads directly into REBSERs themselves. Short UTF-8 strings that
fit into 16 bytes (on 32-bit platforms) and 32 bytes (on 64-bit
platforms) do not need a separate storage allocation

The likely trickiest part of this change is the behavior when a canon
form of a word is GC'd. Because the accelerating hash table is done
using linear probing, it's not easy to remove a hashed item because
hash chains starting from more than one initial hash may have collided,
and leaving an empty cell in the spot could cause a false miss of an
item that is still present. This was addressed with a special "deleted"
pointer value used, which is reclaimed on insertions and hash resizings.

Previously, when a word was "interned" into the system, it would receive
a fixed index number from a table.  This index number pointed into a
data area of UTF-8 strings, and was persistent--the system would never
reclaim a word from the GC once it had been loaded.

The data was kept as UTF-8 because there was no need to decode it--the
words were immutable, and only needed to be hashed and compared.

This beings the unification of properties of strings and words, which
intends to use UTF-8 as the internal representation for strings
anyway.  In anticipation of that change, it puts words into a kind of
REBSER series node called a REBSTR.  The previous term "REBSYM" could
have meant either a canon form or arbitrary casing of a symbol, but
is repurposed to be a small integer for words known to the C code
(defined in %words.r).

Terminology was standardized and examined, so terms like "interning"
are used instead of more vague things (`Intern_Utf8` vs. `Make_Word`)
To help avoid mistakes in code in the future, the two forms of REBSTR
are extracted from words as VAL_WORD_CANON and VAL_WORD_SPELLING.
This makes it more obvious to understand the internal logic.  Also,
when it comes to the small numeric symbols not all words have,
checks in the C++ build prevent coders from writing something like:

    if (VAL_WORD_SYM(word1) == VAL_WORD_SYM(word2)) {...}

Since not all words have "SYMs", then two words that are not in %words.r
might return the same SYM_0 to indicate that fact, which would cause
a false belief these were equal (at least, that's likely what the code
would have meant).  The C build is unchecked for this, but the C++
build has trickery to notice it at compile time.

A major aspect of the change is that this removes the binding table.
Instead, space is reserved in the REBSTR node itself on the canon forms
of words--in the spot that non-canon forms use to point to the canon
form.  This is where indices are stored during a bind.

While not having a bind table and leaning on the existing hash structure
for lookups to words has benefits, the previous code had a bind table
per thread.  While there was no larger threading story, Ren-C doesn't
want to have even less of one--so the technique is augmented with
a stack-based "Binder" that demonstrates how contentions could be
resolved over the space.  This would ideally use some sort of lock-free
strategy that would dynamically build out as-needed from the contentious
REBSTR node that more than one thread (or recursion) wanted to write on.

This leverages some new-ish features, like the ability to embed small
payloads directly into REBSERs themselves.  Short UTF-8 strings that
fit into 16 bytes (on 32-bit platforms) and 32 bytes (on 64-bit
platforms) do not need a separate storage allocation

The likely trickiest part of this change is the behavior when a canon
form of a word is GC'd.  Because the accelerating hash table is done
using linear probing, it's not easy to remove a hashed item because
hash chains starting from more than one initial hash may have collided,
and leaving an empty cell in the spot could cause a false miss of an
item that is still present.  This was addressed with a special "deleted"
pointer value used, which is reclaimed on insertions and hash resizings.
@hostilefork hostilefork merged commit 2f20e4b into metaeducation:master Jun 28, 2016
@hostilefork hostilefork deleted the gc-words branch July 10, 2016 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant