skip global hashtable (N=15 in 5h13) #28

NailLegProcessorDivide · 2023-07-21T23:55:36Z

Based on previous patch set.
Uses my understanding of this algorithm #11 by @presseyt
Some nasty implementation details still (N=16 max because still using point list datatype for quicker implementation, lots of rotatation and deduplication all over the place).

Essentially perfectly parallelisabe as cores take a sub polycube of order n and return the number of polycubes that are canonical ancestors.

A bit slow (around 3x slower than before) for determining what is and isnt a canonical child of the polycube currently being explored.

datdenkikniet · 2023-07-22T13:02:08Z

Oooo, cool! Would you mind rebasing on main or something? Getting an overview of the changes from the github GUI is a bit... Aweul xD The actual changes aren't as sweeping as the PR page makes it seem.

NailLegProcessorDivide · 2023-07-22T13:25:27Z

I still dont really like the code (style), the performance is a good chunk better than it was (3h15 for N=15). Most things Ive tried dont work and profiling the optimised build just shows 73% in is_canonical_root with a few branches that each contribute multiple (2-6) percent of runtime.

Canonicalizing mirror pairs and double counting seemed to work but be ~10-20% slower every way I tried but I could be making a mistake.

Im not an expert in vectorization and GPGPU stuff but it doesnt look to me like it would be efficient

A fun idea that this algorithm could lend itself to is a folding@home style work distribution server with multiple worker clients processing subtrees and reporting the sub counts back but I dont know how complex and worth while that could be.

NailLegProcessorDivide · 2023-07-22T13:46:04Z

when running it divides work based on seed polycubes in a previous cache file and needs cachefiles enabled for parallelism to work (or it runs 1 thread building off N=1).
for N=15 I used an N=13 cache file but any file over N=7 should work somewhat ok.

datdenkikniet · 2023-07-22T14:27:40Z

What tool do you use for profiling, by the way? Can highly recommend flamegraph, gives you a nice and hover-able svg you can view in firefox :)

NailLegProcessorDivide · 2023-07-22T14:34:04Z

been using perf + flamegraph + perf annotate.
bbut I assume because release build and inlining it thinks its mostly 1 giant function with 73% runtime

datdenkikniet · 2023-07-22T14:37:12Z

this is what I'm seeing for N = 10, quite a bit of time spent in slice equality checks (which kinda makes sense, that's a pretty "expensive" thing to do given how efficient everything else is)

NailLegProcessorDivide · 2023-07-22T15:12:20Z

interestingly mine looks different now I switched to cargo flamegraph

do you know if yours could have been non release? it seems to have a lot of smaller stack frames and bits of iterator I'd expect to be optimized out and contains should be very vectorisable, I know your CPUs are older but havent looked at the exact ISA support.

datdenkikniet · 2023-07-22T15:18:25Z

I compile & run this with

cargo rustc --release --bin opencubes -- -C target-cpu=native
# Need to sudo because I can't get setting capabilities to work
sudo ~/.cargo/bin/flamegraph -- ./target/release/opencubes enumerate -cm hashless 10

Your flamegraph seems to be for running the opti_bit_set mode, so that would explain the difference :P I also think none of the CPUs available to me have AVX512, while yours definitely seems to have that, cool!

I have uploaded the full flamegraph. I think the iterator bits are all "virtual", in the sense that they're inlined but we still see them in the flamegraph because the debug info says that that what it's doing when executing the inlined/optimized bits?

I'm on linux, and if you're not that may also make a difference to the flamegraph output.

NailLegProcessorDivide · 2023-07-22T15:36:25Z

Youre correct my bad XD, would help if I opened the flamegraph from the correct folder after generating it.
I see now.

Is there a way to make CubeMapPos be generic over the array length. so CubeMapPos<16> would be what we have now and CubeMapPos<32> would have 32 array elements? I know it would be doable in C++ but I havent seen any hint that it would in rust after a bit of looking

datdenkikniet · 2023-07-22T15:38:13Z

Yes, absolutely! You can have const generics that do exactly this. Unfortunately const-generic math isn't supported on stable yet so it's somewhat limited, but it may/should be plenty in for this case.

NailLegProcessorDivide · 2023-07-22T15:47:02Z

thanks!

datdenkikniet · 2023-07-22T18:09:16Z

Since adding generics bounds to non-instance methods can be annoying to read/type, and cause a lot of repetition of the const generic, I want to point out that const generics apply within entire impl blocks:

impl<const T: usize> CubeMapPos<T> {

    // &self and &mut Self mean you can call 
    // `some_cube_map_pos.xy_rots_points(dims, count, &mut some_other_cube_map_pos);,
    // so long as some_cube_map_pos and some_other_cube_map_pos have the same type. The type,
    // also includes (const) generics
    pub fn xy_rots_points(&self, shape: &Dim, count: usize, res: &mut Self) {
        // All code goes here. You can refer to T and it will be the
        // outer T
    }

    // More functions that rely on T, or not.
}

especially since you already have a bunch of methods whose first argument is a receiver, this should make the code easier to read & hopefully write

datdenkikniet · 2023-07-22T18:15:09Z

Also, the way to get this to run properly in parallel is basically to:

Generate the first cache file that has more entries than your machine has cores
Run with parallelism

Right?

Though ATM it doesn't appear to produce correct results for that.

Edit: ah, load_cache doesn't actually canonicalize anything, and I don't think that would be the right format for gen_polycubes anyways... We should probably fix that

NailLegProcessorDivide · 2023-07-22T18:35:12Z

I had noticed sometimes it gave wrong numbers from some pcube files but never debugged it. adding a sort and the bottom of From<&RawPCube> should solve it I expect as a few of my optimisations require the point list to be sorted.

for parallel yes, although for maximum speed I think its more a balance between many threads locking and taking time on the progress bar / adding to global total vs the non uniformity of the size of the amount of work each parent polycube takes to find all its children. so something like N = 6 with 166 seeds might be suboptimal for 64 cores because its 2.5 jobs of varying size.

Im currently running N=15 again and it looks like it should take a bit less than 2 hours on my machine which would put an estimate for N=17 at under 5 days

datdenkikniet · 2023-07-22T18:42:25Z

Hmmm, made an interesting observation:

The impl<const N: usize> From<RawPCube> for CubeMapPos<N> calls itself recursively in debug mode, and just hangs in release mode. The correct way to call it is (&value).into() or Self::from(&value).

I'm guessing it's never detected by the compiler, but the problem is that Into is auto-implemented from the From impl, so you need to be quite explicit that you want to convert the reference instead.

NailLegProcessorDivide · 2023-07-22T18:50:05Z

I'll see if I can test and fix that when my code is done running.

I should probably add some basic tests (at least the stuff in rotation and polycube_reps) and checks that the value and the end of the enumerate calls are correct.

Do you have any suggestion for how to set the first cache file it tries to load by the way? deleting and recreating the files as needed is pretty annoying.

datdenkikniet · 2023-07-22T18:58:50Z

I think we could just run one of the different algos capable of producing a set of seeds. i.e. start with NaivePolyCube of size 1 and call unique_expansions a few times till we have enough (enough = some amount above current level of concurrency, or configurable by a flag to get enough seeds for that folding@home idea!) (which impl doesn't matter, I'm just most familiar with the one I wrote), and use that as the seed base.

I don't think using the cache files for this impl makes a lot of sense since generating the initial set can be done so easily.

Also, your suggestion of loading them + sorting them seems to do the trick :) N = 13 when starting from N = 6 in 03:44 using a whopping 5 MB of memory at most, not bad!

Load + sort at the start

    let mut canonicalized: Vec<_> = current
        .into_iter()
        .map(|v| {
            let dims = v.dims();
            let map = CubeMapPos::<32>::from(v);
            let dims = Dim {
                x: dims.0 as usize - 1,
                y: dims.1 as usize - 1,
                z: dims.2 as usize - 1,
            };
            let map = to_min_rot_points(&map, &dims, calculate_from - 1);
            map
        })
        .collect();

    canonicalized.sort();

NailLegProcessorDivide · 2023-07-22T19:30:40Z

With the sorting fix: I think there might be rotation issues as well if the definition of canonical orientation wasnt the same but im not sure.

latest code version counted N=15 in 1hr52 which is pretty good being -60% from last night. Now I think most of the performance that can be gained is just sorting results faster which outside of using the matricies as const generics and adding special cases where the order is predictable I dont have many ideas

…imited to 32)

datdenkikniet · 2023-07-22T20:02:31Z

AFAICT there's practically no way of getting around sorting plus canonicalizing for the specific type you want, but luckily it isn't that expensive if you have a list of all unique polycubes.

Dayum, not bad! I don't think there is a lot that can be done in the way of optimization, unless we want to "port" memchr (which is the underlying implementation for [u8].contains to support arrays instead of a slice (I think that might help because you can completely avoid doing any index checking?)

Edit: oh, actually, since we're running contains on a slice of u16's we might not get and equally SIMD-accelerated version... It could be worth trying to switch to u8s instead, and see if that gives any improvement. That does look like a bit of a painful rewrite, though. Edit edit: okay nevermind that doesn't look like it makes sense, since then, if I understand it correctly, we can only compute up to N = 8? At any rate: manually SIMD-ing the contains might help. For reference: godbolt shows that the contains loop is unrolled, but not SIMD-d. Edit edit edit: This is getting out of hand. I don't think we can out-optimize the compiler on this, IDK why I thought we could xD

datdenkikniet · 2023-07-22T20:09:44Z

Also if you want I can do a review of the code, try to give some general points/readability request info

NailLegProcessorDivide · 2023-07-22T20:23:46Z

Any review / feedback would be appreciated.

rust/src/hashless.rs

datdenkikniet · 2023-07-22T20:50:52Z

rust/src/hashless.rs

+        time / 1000000,
+        time % 1000000


I really like this formatting! Given that it's being done relatively often around the crate, we should have some fmt_duration(&Duration) somewhere

rust/src/rotations.rs

rust/src/cli.rs

rust/src/hashless.rs

datdenkikniet · 2023-07-22T21:05:06Z

rust/src/hashless.rs

+    let x = dim.x;
+    let y = dim.y;
+    let z = dim.z;
+    let (x_col, y_col, z_col, rdim) = if x >= y && y >= z {


I think stuff like this would quite greatly benefit from a little macro_rules, or at least use MatrixCol::* within the function so you don't have to spell that out all the time.

Correction: macro_rules would probably actually not improve readability here. use MatrixCol::* would, though!

fn renormalize(..) { use MatrixCol::*; ... (XP, YP, ZP, Dim {x: x, y: y, z: z}) .. }

rust/src/hashless.rs

NailLegProcessorDivide · 2023-07-23T16:23:37Z

is there a way to pass some sort of pcube writer than can have pcubes inserted overtime rather than taking an iterator all at once?
for now I have disabled writing pcube files from point-list as Compression is a part of cli.rs
Other than that I feel like this is in a good place to merge as long as @datdenkikniet and @bertie2 are also happy with this

datdenkikniet · 2023-07-24T15:22:53Z

LGTM, I think we can merge this.

It will need some more rust-ification and some docs, but that can be solved later.

is there a way to pass some sort of pcube writer than can have pcubes inserted overtime rather than taking an iterator all at once?

That depends what you mean by "inserted over time". You can spawn a separate thread that produces new pcubes and inserts them into a channel, and then you can iterate over the items using receiver::into_iter.

There is no requirement that all items are "known" at the moment you create your iterator though, so I think you'll need to be a bit more specific. The PCubeFile iterator also returns items "over time", only reading from the file when the next cube is requested.

Generally you represent that exact idea with an iterator, as all you need to do is to guarantee the (very basic) Iterator trait, and that's it. All of the details can be hidden nicely.

NailLegProcessorDivide marked this pull request as draft July 22, 2023 00:27

NailLegProcessorDivide force-pushed the hashless branch from f47348b to f81f49a Compare July 22, 2023 13:12

datdenkikniet mentioned this pull request Jul 22, 2023

Solution where memory usage isn't an issue #27

Open

“NailLegProcessorDivide” added 8 commits July 22, 2023 20:43

temp push

2d6830b

temp push

d5e12e4

temp for rebase

38d8907

almost arbritrary N (n=16 because small datatype)

5600bbe

perf improvements

dad1e80

perf maybe

52699f2

more vectorisable continuous function, generic number of cubes (now l…

01d860e

…imited to 32)

into fix

63f15bb

NailLegProcessorDivide force-pushed the hashless branch from 16f7e59 to 63f15bb Compare July 22, 2023 19:43

datdenkikniet reviewed Jul 22, 2023

View reviewed changes

some code improvements

5f76ef7

presseyt mentioned this pull request Jul 23, 2023

Javascript implementation (new cube encoding) #30

Open

refactoring

88bca6b

NailLegProcessorDivide force-pushed the hashless branch from 5f76ef7 to 63f15bb Compare July 23, 2023 16:20

NailLegProcessorDivide marked this pull request as ready for review July 23, 2023 16:23

“NailLegProcessorDivide” added 3 commits July 23, 2023 18:02

bench hashless

7d5672e

update readme

faae4ce

cargo fmt

3128029

bertie2 approved these changes Jul 24, 2023

View reviewed changes

bertie2 merged commit 73eaa07 into mikepound:main Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip global hashtable (N=15 in 5h13) #28

skip global hashtable (N=15 in 5h13) #28

NailLegProcessorDivide commented Jul 21, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

NailLegProcessorDivide commented Jul 22, 2023

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 •

edited

Loading

NailLegProcessorDivide commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 •

edited

Loading

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 •

edited

Loading

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet Jul 22, 2023

datdenkikniet Jul 22, 2023

datdenkikniet Jul 22, 2023 •

edited

Loading

NailLegProcessorDivide commented Jul 23, 2023

datdenkikniet commented Jul 24, 2023 •

edited

Loading

skip global hashtable (N=15 in 5h13) #28

skip global hashtable (N=15 in 5h13) #28

Conversation

NailLegProcessorDivide commented Jul 21, 2023 • edited Loading

datdenkikniet commented Jul 22, 2023 • edited Loading

NailLegProcessorDivide commented Jul 22, 2023

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 • edited Loading

NailLegProcessorDivide commented Jul 22, 2023 • edited Loading

datdenkikniet commented Jul 22, 2023 • edited Loading

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 • edited Loading

datdenkikniet commented Jul 22, 2023 • edited Loading

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 • edited Loading

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 • edited Loading

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet commented Jul 22, 2023 • edited Loading

datdenkikniet commented Jul 22, 2023

NailLegProcessorDivide commented Jul 22, 2023

datdenkikniet Jul 22, 2023

Choose a reason for hiding this comment

datdenkikniet Jul 22, 2023

Choose a reason for hiding this comment

datdenkikniet Jul 22, 2023 • edited Loading

Choose a reason for hiding this comment

NailLegProcessorDivide commented Jul 23, 2023

datdenkikniet commented Jul 24, 2023 • edited Loading

NailLegProcessorDivide commented Jul 21, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

NailLegProcessorDivide commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 22, 2023 •

edited

Loading

datdenkikniet Jul 22, 2023 •

edited

Loading

datdenkikniet commented Jul 24, 2023 •

edited

Loading