Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for hashing symbols #1753

Closed
wants to merge 1 commit into from
Closed

Added support for hashing symbols #1753

wants to merge 1 commit into from

Conversation

leidegre
Copy link
Contributor

@leidegre leidegre commented Nov 28, 2019

I ran into this error trying to insert a symbol into a Map.

TypeError: Cannot convert a Symbol value to a string

Alternatively, it is possible to use the String constructor to create a string from a symbol, assuming it has a description, otherwise it will behave as an empty string (hash collision wise).

Can we merge this or fix this some other way?

@leidegre
Copy link
Contributor Author

leidegre commented Nov 28, 2019

This is actually a bug that is introduced by minification.

The minified version changes

if (typeof o.toString === 'function') {
  return hashString(o.toString());
}

into

if ("function" == typeof t.toString)
  return B("" + t);

This expression "" + t is illegal illegal for symbols specifically, but o.toString() is not.

@conartist6
Copy link
Contributor

Hi @leidegre. Those of us who want to see immutable maintained are currently working in a fork while we wait for any further news from Lee. I'd like to incorporate a fix for your issue, but this code seems like a bad idea since it it can easily cause hash collisions. I think it will be possible to leverage the existing mechanisms to hash Symbol objects, particularly since there's no need to worry about the method of extracting hashes that is in place specifically for Internet Explorer, which does not have a native Symbol implementation.

@leidegre
Copy link
Contributor Author

@conartist6 hash collision? Only private symbols without a description cause hash collisions! I think that's acceptable.

This is a non-issue for me atm. I needed this at one point but I had to design around it but it would be great if symbols were supported in the future since they are part of the language now.

@conartist6
Copy link
Contributor

conartist6 commented Aug 23, 2020

Only private symbols without a description cause hash collisions!

OK, but that's not true. With your code Symbol.for('foo') collides with Symbol('foo') which collides with 'foo'. That sounds bad to me.

Really we want symbols to have object semantics so that two symbols have the same hash code only if they are the same symbol. We have to use the symbol primitive, because Object(Symbol.for('foo')) !== Object(Symbol.for('foo')). The current code does not do the valueOf() call as it does with objects and functions. Once we have the primitive we want to assign it a numeric ID in such a way that we can retrieve it later. The problem is where to store the ID we assign. It can't be on the symbol, because symbol primitives are not extensible. It can't be in a WeakMap because a Symbol is not valid as a WeakMap key. Right now I only see one solution left, which is to create a new cache for symbols. It can't be a WeakMap, but it could be a Map or, even better (since compatibility is a concern), a plain old object.

The only downside is that Symbols do leak this way. Once created and inserted into an immutable structure they'll never be GC'd. But I think for 99% of code this is fine. Private symbols are the area of concern, and my impression is that private symbols are generally global or module-level references, held for the life of a program. That said, it would certainly be possible to construct a leak:

let m = new Map();
while(true) {
  const s = Symbol();
  m = m.set(s, null).delete(s);
  // s is now gone from m, but the hash cash prevents the GC collecting s
}

As of yet I can't imagine any legitimate usage for that pattern.

@leidegre
Copy link
Contributor Author

leidegre commented Aug 24, 2020

OK, but that's not true. With your code Symbol.for('foo') collides with Symbol('foo') which collides with 'foo'. That sounds bad to me.

Right. It's difficult to come up with a general solution that works all the time.

Think of the cost of what you are suggesting. Immutable.js is slower than it needs to be because it tries to edge as close as possible to whimsical JavaScript semantics when all you really care about are strict value semantics. It's okay that a couple of strings hash to the same hash code. It is not okay that most things hash to the same hash code.

If you care about hash collisions you should review the hash function used by Immutable.js because it's a really bad hash function. It fails every conceivable statistical test there is for hash functions.

I would fix that before worrying about strings and symbols with the same description having the same hash code.

@conartist6
Copy link
Contributor

I take your point that there's lots that could be done to improve here. I'm interested in what it is, but for the moment I'm accepting the current implementation as a given. Given the current implementation, I don't see that what I'm proposing incurs any significant perf penalty so long as there are not an indefinite number of symbols in use as keys.

@conartist6
Copy link
Contributor

conartist6 commented Aug 24, 2020

Also even that repo says, "See e.g. A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing for a concise overview of the best hash table strategies, confirming that the simplest Mult hashing (bernstein, FNV*, x17, sdbm) always beat "better" hash functions (Tabulation, Murmur, Farm, ...) when used in a hash table."

@leidegre
Copy link
Contributor Author

leidegre commented Aug 25, 2020

Also even that repo says, "See e.g. A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing for a concise overview of the best hash table strategies, confirming that the simplest Mult hashing (bernstein, FNV*, x17, sdbm) always beat "better" hash functions (Tabulation, Murmur, Farm, ...) when used in a hash table."

@conartist6 I thought you'd point that out. Did you actually read the paper tough? Have you run the benchmarks? The thing about the HAMT is that it benefits from a good hash function. If you benchmark the results you'll note that the hash quality does make a difference and it's proportional to the size of the hash table.

I've done extensive testing on this and I'm just pointing out some stuff that I have learned implementing persistent data structures. I still use Immutable.js but I have written both a persistent vector and persistent (hash) map implementation in JavaScript that out performs Immutable.js in almost all situations with gains between 2x to 6x and as much as up to 50x in the bizarre case.

@conartist6
Copy link
Contributor

conartist6 commented Aug 25, 2020

I did not read the paper. I'm really trying not to go down a rabbit hole here. I'm have more than enough pet projects to keep me busy. So I'm sure you're right that there's a better way to hash strings, and it's great to know that you have expertise in that area.

Can we circle back around to how we got into this with Symbols though? Here's how I see it.

Hashing Symbol.prototype.description:

  • Immutable's method of hashing strings is inefficient
  • description is optional for Symbols. There's no knowing how many symbols lack descriptions in the wild.
  • Collisions are more likely

Caching an autoincrementing ID:

  • Faster as the hashing is done by the engine internally
  • Uses memory in proportion to the total number of unique symbols seen
  • In line with the autoincrementing strategy used for objects

Do I have the considerations right?

@leidegre
Copy link
Contributor Author

  • description is optional for Symbols. There's no knowing how many symbols lack descriptions in the wild.

My experience with symbols have been that they all have descriptions. I think it's a reasonable constraint.

Caching an autoincrementing ID:

I just dislike this approach altogether. I don't think objects as keys makes any sense but that's not how Immutable.js works. And these opinions that I have about how I think it should work is also why I wrote my own implementation which puts greater emphasis on value semantics.

I just limit the set of allowed keys to JavaScript primitive types (including Date). And I treat arrays as tuples, so you can do ["foo", 123, new Date()] to have composite lookup keys (these also have reasonably good performance in V8). It's things like this I find lacking in JavaScript and I think they are more useful than all this object caching stuff and valueOf nonsense.

But again, It's just my opinion. I understand perfectly well that you're probably best served by keeping most of Immutable.js as-is. I just don't agree with some of the design decision Lee took with Immutable.js.

@conartist6
Copy link
Contributor

conartist6 commented Aug 25, 2020

Yeah immutable decided to keep faith with the decisions made by TC-39 here and I don't blame Lee for that. What you're building sounds interesting too though. Is it published anywhere? The composite keys would be very neat. You're not wrong that objects as keys are generally a bad sign, and that a symbol with a description would almost always make more sense.

My goal (with iter-tools) is to abstract operations over data structures so that a library like Immutable can be smaller and more focused, and hopefully permitting more data structures libraries with different opinions to flourish and interoperate. What I really want to see happen is actually something Lee took up with TC-39 -- reverse iterators. Otherwise there can never be an efficient implementation-independent way to write a last function.

@leidegre
Copy link
Contributor Author

leidegre commented Aug 26, 2020

To be clear, I don't blame Lee. He's done a great job with Immutable.js, I simply make different tradeoffs for performance reasons.

The stuff I've written is part of a system that I cannot share atm but I can share some of the relevant parts in a gist. I use xxhash32 which I've found to be about 30% slower than the djb2/Bernstein hash function used by Immutable.js but it's not a bottleneck and has proper hash function properties which passes the smasher test suite. I use this as a basis of all my equivalence testing (again, for good HAMT hash map performance you need a uniform distributions of bits in all the bits).

@leidegre
Copy link
Contributor Author

TC-39 -- reverse iterators.

While interesting to some. This is the kind of stuff I don't really care about. I have very stringent performance requirements and all these protocols tend to be just too slow. Iterators are orders of magnitude slower than plain old for loops in JavaScript which is really a shame because it is a nice abstraction but it's a no go for me. Also stuff like async iterators and overhead from promise creation. All the garbage generated just grinds stuff to a halt. I would be much happier if I was able to exert more control over my JavaScript execution. I'm really hoping that it will be possible to use WebAssembly and JavaScript seamlessly with very little overhead to be able to get more performance out.

The performance characteristics of things matter a lot to me.

@leebyron leebyron closed this Jun 17, 2021
@leebyron leebyron deleted the branch immutable-js:master June 17, 2021 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants