Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

next() function #15

Closed
jugglerchris opened this issue Mar 4, 2019 · 19 comments
Closed

next() function #15

jugglerchris opened this issue Mar 4, 2019 · 19 comments

Comments

@jugglerchris
Copy link

I've started to have a play with luster to see how it might fit into another project.
I was wondering about adding some of the table functions, but notice that the next function isn't implemented yet.

Before I start poking around in Table, do you have thoughts on an approach? Replacing the FxHashMap with something like IndexMap might work (that IndexMap doesn't quite do everything needed as far as I can tell).

Thanks,

Chris

@kyren
Copy link
Owner

kyren commented Mar 4, 2019

Before I start poking around in Table, do you have thoughts on an approach?

Not yet, I need to do some research into how precisely next works in puc lua first. I don't think IndexMap will be necessary or sufficient either. Let me look into this and get back to you.

@jugglerchris
Copy link
Author

Sure, thanks! I think it must just walk hash buckets in order, but I haven't actually looked myself yet!

@uazu
Copy link

uazu commented Mar 5, 2019

It has to support deletions as you iterate through (see Lua docs for next), so it's bound to be a little bit special.

@kyren
Copy link
Owner

kyren commented Mar 6, 2019

Okay, so as suspected PUC-Lua just walks the hash buckets in order to find a next value. The guarantee about supporting deletions as you iterate through is because a rehash will never occur on deletion, which makes perfect sense.

I briefly looked for a Rust HashMap implementation that exposes the internal bucket vector and came up empty. This is probably something I should have been thinking about earlier, tbh.

There are a few things we could do:

  1. Write our own HashMap, or equivalently copy a minimum subset of an existing HashMap implementation, thus giving direct access to the bucket vector.
  2. Continue looking for a HashMap that exposes enough of its internals to find a subsequent filled bucket for a given key, or send a PR to a HashMap implementation to add that functionality.
  3. Use a really bad data structure like BTreeMap<usize, Vec<(K, V)>> to emulate a HashMap with an exposed bucket order, as a terrible temporary solution
  4. I assumed incorrectly about how IndexMap worked, so since you can apparently efficiently go from a key to an index, it does seem to provide a viable next implementation? (unless I'm missing something else here) I really don't like this solution though as it would reliably have a useful behavior (preserving insertion order and iterating properly during concurrent insertion?) which is not part of Lua and not an intended feature of luster.

I guess I'm leaning towards 1 as a solution unless somebody knows of a HashMap with the proper API exposed. What does everyone else think?

@uazu
Copy link

uazu commented Mar 6, 2019

Lua says that deletions and modifications are okay, but insertions are not okay whilst iterating, but it doesn't enforce that at the language level like a Rust implementation would. I think a Rust HashMap would prefer to have things much better defined than that, and so adding functionality to support Lua's rather looser guarantees would not be accepted. (For example, a "give me the next key after this key" call would not have an well-defined meaning for "next key" in the presence of insertions, and that seems much too loose for Rust.) So as you say, option (1) seems like what you'll end up with in order to support Lua's behaviour, unfortunately. It doesn't look straightforward to disrupt the order in option (4) to stop people counting on it.

Also, what are you doing about the array part of the table? This is a whole other weird area of Lua. I have code in C which counts on the lua_next() iterator running through the array part first (in order) before the hash part, with tests to verify that the PUC-Rio implementation still has that behaviour just in case they change it. Depending on implementation behaviour like this is not ideal, but neither is writing code that tries to reconstruct the array part from a series of random indices.

What gets into the array part is also not terribly sane or well-defined (except by looking at the code or trial and error), e.g. v = {}; v[1] = 1; v[4] = 1; print(#v); v[2] = 1; print(#v) prints 1 then 4 (i.e. it appears like v[4] started off in the hash and then was moved to the array). I guess you could just copy whatever they do for maximum compatibility.

@uazu
Copy link

uazu commented Mar 6, 2019

Just thinking some more about this. If you don't want to store the array-part of the table separate to the hash, you could just simulate the array-length #v instead, e.g. increment and decrement it in a similar way to how Lua does it. That would still agree with the docs. The next() iterator wouldn't have the array at the front, but that's not documented behaviour so people aren't supposed to rely on that.

@kyren
Copy link
Owner

kyren commented Mar 6, 2019

Lua says that deletions and modifications are okay, but insertions are not okay whilst iterating, but it doesn't enforce that at the language level like a Rust implementation would. I think a Rust HashMap would prefer to have things much better defined than that, and so adding functionality to support Lua's rather looser guarantees would not be accepted. (For example, a "give me the next key after this key" call would not have an well-defined meaning for "next key" in the presence of insertions, and that seems much too loose for Rust.)

Modification is not really a thing we need to worry about, since nothing that is mutable would change the hash of the value (strings are immutable, you can mutate a table but when used as a key only a table's address is used, same for userdata / closures).

It seems totally reasonable, if extremely specific and weird, to ask what the next value after a given value is in the hash buckets. There's no actually concurrent mutation here, in fact in luster both operations would even take place in separate RefCell locks (the inside of Table is stored inside a GcCell after all). It of course may change after an insert due to a rehash, but that's what happens in PUC-Lua too so this is fine? I don't think this really has anything to do with Rust, rather than just being a really odd thing to need from a HashMap implementation. You can even already get the information if you're willing to walk the Keys iterator fully to get it.

Also, what are you doing about the array part of the table?

I already store the array part of tables separate to the hash part and it works very similarly to what Lua does, and yeah you're right it's not trivial. That actually was one of the first things I did.

Edit: I thought a lot about Lua tables at the beginning of the project to match the semantics, especially around sizing the array part vs the hash part, but I should have thought more about iteration at that time as well. Actually I think I did think about it but any answer that involved writing my own HashMap seemed too unreasonable and I guess I was hoping a better solution would magically present itself.

@uazu
Copy link

uazu commented Mar 6, 2019

Yes, agreed. As you say it's "extremely specific and weird", and not really a Rust-like interface, but it's not a safety problem for someone to expose that interface on their HashMap.

@uazu
Copy link

uazu commented Mar 6, 2019

Deletions wouldn't work with an IndexMap either. The last item is moved to fill the gap, messing up the indices. I can't see any easy solution yet.

@kyren
Copy link
Owner

kyren commented Mar 6, 2019

Deletions wouldn't work with an IndexMap either. The last item is moved to fill the gap, messing up the indices. I can't see any easy solution yet.

That's a good point, I didn't think about that. PUC-Lua doesn't remove keys with nil values from the table until an insert takes place, but since we wouldn't want to mimic that behavior, this would be a problem. That's fine, I thought that was a poor solution anyway.

It looks like implementing our own HashMap might be the way to go :/

@jugglerchris
Copy link
Author

Mostly for fun, and because I thought it'd be easier than understanding other hash maps well enough to modify, I've written a very naive hash map which has the required next function: https://github.com/jugglerchris/luster/tree/hash_next

By naive I mean it's the simplest possible hash map I could think of, so:

  • No shrinking (on purpose, but should probably consider doing so on insert if many items have been removed)
  • Linear probing on collisions
  • Uses Vec for the buckets
  • No doubt has bugs

I implemented enough to replace std::collections::HashMap in Table and get cargo test passing, but haven't yet used it to add next() to Table itself (I don't think it would take long). I'm happy to do a bit of clean up and start a PR if you want it as a stopgap until someone comes up with a decent one. :-)

@kyren
Copy link
Owner

kyren commented Mar 10, 2019

Ahhhh, you beat me to it, nice work!

I'm not an expert at hash map implementations at all, but I was thinking of just porting one that I had written previously in C++ to Rust, after checking std::collections::HashMap for additional ideas.

The hash map you wrote is not that far off from the one I was going to port, except:

  1. There's no difference between Empty and Tombstone that I can see
  2. You don't do robin hood bucket stealing
  3. Finding non-existent keys seems pathalogical, like it loops around the entire bucket vector before returning None? You should check the bucket error for the iterated key against the error for the current key and bail early (assuming you also do bucket stealing).
  4. It's missing some other tricks which I don't know how they translate to Rust exactly, like the end marker trick. I don't know how important that really is.

This feels deeply wrong, just copying an existing HashMap implementation, but honestly I think the best way to handle this is probably to do neither of these things and just minimally copy the one from the Rust stdlib... depending on how complex that is. I'm not super sure yet, if a minimal subset of std::collections::HashMap is thousands of lines of code, maybe the answer is not to do that?

I think maybe looking closely at std::collections::HashMap is the right next step either way. I'd like to spend the time to do a decent HashMap implementation now rather than later, it should at least not be much worse than the one in std.

@jugglerchris
Copy link
Author

Some comments:
1 and 3: Oops, that's a late-night bug. It's meant to bail out of the find() when it finds Empty, but not Tombstone (which could indicate something in between has been removed). I've pushed that fix since it's trivial. The Tombstones are needed to maintain the iteration order on deletions.
2. That's true (though I think easy to add?)
4. That's true too.

But I agree starting from a production HashMap makes sense. Using the stdlib one means you get a choice of two, though, as they're in the middle of replacing it with hashbrown: rust-lang/rust#58623 .

@kyren
Copy link
Owner

kyren commented Mar 11, 2019

1 and 3: Oops, that's a late-night bug. It's meant to bail out of the find() when it finds Empty, but not Tombstone (which could indicate something in between has been removed). I've pushed that fix

Oh okay that makes more sense.

I still think we should look at how complex the original std::collections::HashMap is and just copy a minimal subset of its implementation, for now. I'm guessing that it's much simpler than hashbrown's one, and it's extremely well tested and is of respectable performance. Later, we can either update it to a faster implementation or beg somebody who maintains an external HashMap for the right APIs. If hashbrown's HashMap is relatively simple we could also copy it instead?

I've been wracking my brain trying to figure out a way to implement this that is not really awful and doesn't require maintaining our own HashMap implementation, but I still haven't come up with anything.

@kyren
Copy link
Owner

kyren commented Mar 11, 2019

Maybe IndexMap + delay removes until the next insert is the right solution. IndexMap is respectably fast, at least.. I think PUC-Lua plays some garbage collector games with dead indexes though so that they're less bad.

Ugh, there are no good solutions here other than maintaining a whole bespoke HashMap :(

@jugglerchris
Copy link
Author

I've had a bit more time to play...

This time I've gone for adding a next_after(key) to a fork of hashbrown, which is slated to become the stdlib HashMap. I've still only lightly tested it, particularly the "no reordering under removals" bit - it didn't look like it would, but I'm not very confident yet.

Based on that I've implemented the Lua next function in Luster at https://github.com/jugglerchris/luster/tree/hash_next_hashbrown and tested it on trivial examples. It doesn't work directly (a problem with mutiple return values - I'll raise an issue shortly) but works fine appears to work fine in a for loop using the obvious implementation of pairs (see tests/running/table.lua).

@jugglerchris
Copy link
Author

On a related note, how do you feel about including some of the stdlib implemented in Lua? pairs is a one-liner that way. :-)

@kyren
Copy link
Owner

kyren commented May 15, 2023

Well, in the ridiculous amount of time I paused this project, hashbrown has published a raw table API that solves all of my problems 😆

Now solving this is (relatively) straightforward!

@kyren
Copy link
Owner

kyren commented May 15, 2023

aaaaand done, by 12fdd6a

@kyren kyren closed this as completed May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants