Consider encoding when deduplicating strings. #5198
CRuby's fstring cache, used for frozen string deduplication, uses
In JRuby, we use a standard JDK collection, ConcurrentHashMap,
The additional overhead of the cache is:
In the typical case, where the requested string has already been
CRuby's fstring cache, used for frozen string deduplication, uses slightly different equalitye logic than the default equality for strings. Specifically, if two strings have the same 7bit ascii bytes, but two different ascii-compatible strings, the strings are still considered to be equal. But for the fstring cache, you can register the same 7-bit string with different ascii-compatible encodings and they both live in the cache. In JRuby, we use a standard JDK collection, ConcurrentHashMap, that always uses the standard equals() method that works like normal String equality as described above. We are forced to use a wrapper, both for storage and for lookup. This patch introduces that wrapper, and also introduces a thread-local caching mechanism to reduce the cost of looking up deduplicated strings in the cache. The additional overhead of the cache is: * The wrapper object and indirecting through it. * Constructing a wrapper object (only when the previous lookup added a new wrapper or this is the first lookup). * Accessing a previously cached wrapper via a thread-local (inverse of the above conditions) In the typical case, where the requested string has already been deduplicated, the system should eventually get to a point where there's no new entries being added, and the cached wrapper is used every time. There may be more overhead at startup to create the wrappers. There may be few calls to lookup a string that do not trigger a new entry, since most language constructs (FrozenString in IR for example) save the result, making the cache perhaps unnecessary.
just a random thought - how many single byte ascii-compatible encodings are there to manage? (a few)