Skip to content

novedevo/rust-hour

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rust Hour

As is tradition in the Rust community, this is a rewrite of a project originally in a different language. As is also tradition, this rewrite resulted in a 10x performance improvement instantly, without any optimizations. After performance optimization, this program can solve all 35 test cases, and output the moves required to get to each solution, in around 10ms. It can solve all puzzles 100x in a row in less than half a second, or a little over a second if it must also keep track of its path. This is a result of various coding choices, but it is in no small part thanks to the aggressive optimization of the Rust compiler as well.

Our initial approach was a human solution: look at what cars block the way to the exit, check what is in their way preventing the car from leaving, so on recursively, building a tree-like structure. We proceeded to learn the bitter lesson. It was massively more efficient to just use a basic graph traversal algorithm and "dumbly" stumble our way through the solution space. In fact, using A* at all here is probably bitter, since it relies on human concepts of what would be a good move.

As with the Java version, our graph traversal algorithm was actually less important than other choices. Depth-first search performs better than breadth-first search, and A* with a basic heuristic lands in the middle. However, this is the difference between checking ~100k board states and ~70k states: significant, but not mandatory.

Performance

An initial performance optimization, which we didn't actually realize would be so impactful at the time, was the decision to calculate and store a 6x6 array of chars (technically u8s/bytes) to represent the board at construction, instead of relying on iterating through the list of cars whenever we needed to check collisions or anything else. We also ensured that our structs were as small and simple as possible.

One of the relevant performance decisions was the way we kept track of which states we had visited. If this was not done at all, the program gets stuck in loops and takes a long time to terminate. We add tens of thousands of board states to it, and need to check if it contains many times that. The obvious choice is a HashSet, as it supports (theoretically) O(1) insertion and .contains() checking. This worked, but upon profiling our executable with valgrind's callgrind, we discovered that the majority of our program's runtime was being consumed by hashing operations. Our initial response was to switch to a BTreeSet, as those theoretically offer consistently better performance, but that requires ordering, and making that many comparisons actually slowed the program down tenfold. Upon further reading, we found that Rust's default hashing algorithm is SipHash, which is cryptographically secure and resistant to multiple forms of hash attacks, but isn't as fast as alternatives. Since cryptography was of no concern to this program, we looked for a better one, and found it in rustc_hash. This crate (Rust's term for an importable third-party module, like you might find on Python's pip) makes no claims about DoS resistance or cryptographic soundness, which is fine for our uses. This is the one used internally by the Rust compiler team, and it is optimized for maximum speed. Perfect. This sped up the execution significantly, and until more optimizations were made, it meant that the hashing function was no longer a major component of the runtime.

Another performance optimization we made is to only store the char array representation of the board in the hashmap, reducing calls to Board.clone() by ~100k per run. This helped significantly. Another was to store the chars as bytes (aka u8 in Rust terms), instead of the default Rust char, which is actually multiple bytes long for Unicode storage. You can also set the RUSTFLAGS environment variable in such a way that the compiler will autovectorize your code whenever possible, using SIMD instructions to operate on multiple sets of data in parallel without threading.

The fastest we have seen is using plain DFS with a Vec operating as a stack, without calculating any heuristics whatsoever, using unsafe Rust and the mutate-clone-unmutate move generator, variable-length moves, using SSE optimizations. This did 100 iterations of all 35 test boards within 0.47 seconds.

When using map() instead of unsafe Rust, the time is roughly 30% slower. Using A* and calculating heuristics, the time is nearly doubled again. Checking this with valgrind, we noticed that roughly the same number of nodes are added to the visited hashset, but more than double the number of calls to get_moves() and .contains(). This tracks with an inferior algorithm. Using breadth-first search is the worst of them all. With a VecDeque as our queue, using push_back and pop_front commands, it took even longer than A*. This is an obvious non-starter, but if we were optimizing for number of moves, it would be ideal. Another non-optimization was the attempt to increase laziness by working with generators/coroutines, a feature common in languages such as Python. Theoretically, this would have a slight improvement on runtime by not calculating the byte representation of the boards until it is actually needed. In practice, however, generators require the use of unstable features only available in the nightly compiler, thus precluding their use in the project on CSIL, which does not have the nightly toolchain installed. Also, the state machine implementation requires some overhead, which would offset the small performance gains otherwise expected. Finally, the syntax required to store and use generators (which do not have a compile-time determinable size when they're a member of a struct, at least as far as my asynchronous Rust knowledge extends) is complex, requiring close knowledge of Rust's complicated async/await feature. Due to the above, this optimization was scrapped.

One profiling tool that was helpful in the development process was the flamegraph. This is available from cargo, a Rust tool. It generates a visualization allowing you to discern where the hot path is, and what is actually taking up the runtime.

This is running with complete multi-threading. Unless your computer has more than 35 threads, each core will be utilized to its fullest potential when running the test cases, as each puzzle is on its own distinct thread.

We acknowledge that this is running on a pretty fast computer. Thus our stresstesting benchmark of running the most difficult puzzle 100 times in a row on a single core. Since that completes in less than half a second, it should run fine on your machine, regardless of its age / speed (within reasonable limits). For reference, when testing on CSIL the torture test takes about 15 seconds, and seems to run across all 8 provisioned threads. (the CPU has 24 in total, so it's presumably split in thirds. This is also a CPU from 2013.)

Reality check

Honestly, the performance barely matters. Practically nothing we did made the runtime untenable. We made a null hash function, it "only" increased the runtime by an order of magnitude to a few seconds. Disabling the visited map entirely was similar. My first implementation sorted a vector every time the hash() function was called. We didn't notice; it was still under half a second. Running without any rustc optimizations still terminates within a few seconds.

It was to the degree that we were concerned the algorithm was cheating. So, we checked through all the solved boards. They seemed to be in order, were all valid board states, had the same number and arrangement of pieces as the puzzle boards, and agreed (minus a few differential moves) with the Java solutions and the ones provided to us. The most recent move was also clear: there was space directly behind the X car that it had evidently just moved out of. This tracks with expected behaviour. That wasn't enough, so we stepped through the solution-finding process for an entire board. This took forever. We didn't see a single illegal move.

Looking online, some other people have done this before. This person solved every possible 6x6 Rush Hour board in minutes. That's billions of boards. He also exported some very nice-looking diagrams of the "clusters", or the completely explored graphs, consisting of every state you can get to from one given state. Each cluster had hundreds to thousands of states, but that's not very many to traverse.

We checked the number of stored states contained in visited for each board. Depending on the algorithm and some internal structure, this varied from a few hundred to a few thousand. Again, not very many for a fast computer to traverse.

Build instructions

You will have to compile the code yourself. No binaries are distributed with this software, and rustc optimizes with respect to the specific CPU it is being run on, to ensure maximum performance. If it includes instructions or AVX512 / VMA optimizations in our binary that don't run on your machine, it will crash, thus the need for seperate compilation. If you wish to install the Rust toolchain on your computer, there are detailed instructions available on the Rust website. SFU's CSIL has a reasonably up-to-date copy of the Rust toolchain, so you may copy the project folder there and run cargo test --release from the rust-hour directory, which will recursively build all the dependancies (namely, rust-core, Rust's standard library (for Vec), rustc_hash(for the hashmaps) and all of their dependencies). It will automatically run the stresstest (i.e. 100 iterations of each puzzle). If you would like to just solve each puzzle once, you can run cargo test solve_all --release instead. There are various environment variables you can set to enable SSE enhancements, but they are unnecessary. The build process may take a while (2 seconds on my machine, 30 on CSIL). rustc is powerful, but it uses LLVM (similarly to clang) internally, so it is not as fast as some would like. You may also run cargo test, which doesn't optimize and leaves debug symbols in. This will take less time to build and run ~20x slower. On one of our machines, 100 iterations takes about 17 seconds to run in dev mode. CSIL seems to be roughly 10x slower, in general. After the first compilation, though, it doesn't recompile on every run. cargo test or cargo test --release will immediately run without needing to rebuild. To force recompilation, you can run cargo clean.

About

A Rush Hour solver, written in Rust.

Resources

Stars

Watchers

Forks

Contributors

Languages