Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip global hashtable (N=15 in 5h13) #28

Merged
merged 13 commits into from
Jul 24, 2023

Conversation

NailLegProcessorDivide
Copy link

@NailLegProcessorDivide NailLegProcessorDivide commented Jul 21, 2023

Based on previous patch set.
Uses my understanding of this algorithm #11 by @presseyt
Some nasty implementation details still (N=16 max because still using point list datatype for quicker implementation, lots of rotatation and deduplication all over the place).

Essentially perfectly parallelisabe as cores take a sub polycube of order n and return the number of polycubes that are canonical ancestors.

A bit slow (around 3x slower than before) for determining what is and isnt a canonical child of the polycube currently being explored.

@NailLegProcessorDivide NailLegProcessorDivide marked this pull request as draft July 22, 2023 00:27
@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

Oooo, cool! Would you mind rebasing on main or something? Getting an overview of the changes from the github GUI is a bit... Aweul xD The actual changes aren't as sweeping as the PR page makes it seem.

@NailLegProcessorDivide
Copy link
Author

I still dont really like the code (style), the performance is a good chunk better than it was (3h15 for N=15). Most things Ive tried dont work and profiling the optimised build just shows 73% in is_canonical_root with a few branches that each contribute multiple (2-6) percent of runtime.

Canonicalizing mirror pairs and double counting seemed to work but be ~10-20% slower every way I tried but I could be making a mistake.

Im not an expert in vectorization and GPGPU stuff but it doesnt look to me like it would be efficient

A fun idea that this algorithm could lend itself to is a folding@home style work distribution server with multiple worker clients processing subtrees and reporting the sub counts back but I dont know how complex and worth while that could be.

@NailLegProcessorDivide
Copy link
Author

when running it divides work based on seed polycubes in a previous cache file and needs cachefiles enabled for parallelism to work (or it runs 1 thread building off N=1).
for N=15 I used an N=13 cache file but any file over N=7 should work somewhat ok.

@datdenkikniet
Copy link
Contributor

What tool do you use for profiling, by the way? Can highly recommend flamegraph, gives you a nice and hover-able svg you can view in firefox :)

@NailLegProcessorDivide
Copy link
Author

been using perf + flamegraph + perf annotate.
bbut I assume because release build and inlining it thinks its mostly 1 giant function with 73% runtime

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

image
this is what I'm seeing for N = 10, quite a bit of time spent in slice equality checks (which kinda makes sense, that's a pretty "expensive" thing to do given how efficient everything else is)

@NailLegProcessorDivide
Copy link
Author

NailLegProcessorDivide commented Jul 22, 2023

interestingly mine looks different now I switched to cargo flamegraph
Screenshot_20230722_161242

do you know if yours could have been non release? it seems to have a lot of smaller stack frames and bits of iterator I'd expect to be optimized out and contains should be very vectorisable, I know your CPUs are older but havent looked at the exact ISA support.

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

I compile & run this with

cargo rustc --release --bin opencubes -- -C target-cpu=native
# Need to sudo because I can't get setting capabilities to work
sudo ~/.cargo/bin/flamegraph -- ./target/release/opencubes enumerate -cm hashless 10

Your flamegraph seems to be for running the opti_bit_set mode, so that would explain the difference :P I also think none of the CPUs available to me have AVX512, while yours definitely seems to have that, cool!

I have uploaded the full flamegraph. I think the iterator bits are all "virtual", in the sense that they're inlined but we still see them in the flamegraph because the debug info says that that what it's doing when executing the inlined/optimized bits?

I'm on linux, and if you're not that may also make a difference to the flamegraph output.

@NailLegProcessorDivide
Copy link
Author

Youre correct my bad XD, would help if I opened the flamegraph from the correct folder after generating it.
I see now.

Is there a way to make CubeMapPos be generic over the array length. so CubeMapPos<16> would be what we have now and CubeMapPos<32> would have 32 array elements? I know it would be doable in C++ but I havent seen any hint that it would in rust after a bit of looking

@datdenkikniet
Copy link
Contributor

Yes, absolutely! You can have const generics that do exactly this. Unfortunately const-generic math isn't supported on stable yet so it's somewhat limited, but it may/should be plenty in for this case.

@NailLegProcessorDivide
Copy link
Author

thanks!

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

Since adding generics bounds to non-instance methods can be annoying to read/type, and cause a lot of repetition of the const generic, I want to point out that const generics apply within entire impl blocks:

impl<const T: usize> CubeMapPos<T> {

    // &self and &mut Self mean you can call 
    // `some_cube_map_pos.xy_rots_points(dims, count, &mut some_other_cube_map_pos);,
    // so long as some_cube_map_pos and some_other_cube_map_pos have the same type. The type,
    // also includes (const) generics
    pub fn xy_rots_points(&self, shape: &Dim, count: usize, res: &mut Self) {
        // All code goes here. You can refer to T and it will be the
        // outer T
    }

    // More functions that rely on T, or not.
}

especially since you already have a bunch of methods whose first argument is a receiver, this should make the code easier to read & hopefully write

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

Also, the way to get this to run properly in parallel is basically to:

  1. Generate the first cache file that has more entries than your machine has cores
  2. Run with parallelism

Right?

Though ATM it doesn't appear to produce correct results for that.

Edit: ah, load_cache doesn't actually canonicalize anything, and I don't think that would be the right format for gen_polycubes anyways... We should probably fix that

@NailLegProcessorDivide
Copy link
Author

I had noticed sometimes it gave wrong numbers from some pcube files but never debugged it. adding a sort and the bottom of From<&RawPCube> should solve it I expect as a few of my optimisations require the point list to be sorted.

for parallel yes, although for maximum speed I think its more a balance between many threads locking and taking time on the progress bar / adding to global total vs the non uniformity of the size of the amount of work each parent polycube takes to find all its children. so something like N = 6 with 166 seeds might be suboptimal for 64 cores because its 2.5 jobs of varying size.

Im currently running N=15 again and it looks like it should take a bit less than 2 hours on my machine which would put an estimate for N=17 at under 5 days

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

Hmmm, made an interesting observation:

The impl<const N: usize> From<RawPCube> for CubeMapPos<N> calls itself recursively in debug mode, and just hangs in release mode. The correct way to call it is (&value).into() or Self::from(&value).

I'm guessing it's never detected by the compiler, but the problem is that Into is auto-implemented from the From impl, so you need to be quite explicit that you want to convert the reference instead.

@NailLegProcessorDivide
Copy link
Author

I'll see if I can test and fix that when my code is done running.

I should probably add some basic tests (at least the stuff in rotation and polycube_reps) and checks that the value and the end of the enumerate calls are correct.

Do you have any suggestion for how to set the first cache file it tries to load by the way? deleting and recreating the files as needed is pretty annoying.

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

I think we could just run one of the different algos capable of producing a set of seeds. i.e. start with NaivePolyCube of size 1 and call unique_expansions a few times till we have enough (enough = some amount above current level of concurrency, or configurable by a flag to get enough seeds for that folding@home idea!) (which impl doesn't matter, I'm just most familiar with the one I wrote), and use that as the seed base.

I don't think using the cache files for this impl makes a lot of sense since generating the initial set can be done so easily.

Also, your suggestion of loading them + sorting them seems to do the trick :) N = 13 when starting from N = 6 in 03:44 using a whopping 5 MB of memory at most, not bad!

Load + sort at the start
    let mut canonicalized: Vec<_> = current
        .into_iter()
        .map(|v| {
            let dims = v.dims();
            let map = CubeMapPos::<32>::from(v);
            let dims = Dim {
                x: dims.0 as usize - 1,
                y: dims.1 as usize - 1,
                z: dims.2 as usize - 1,
            };
            let map = to_min_rot_points(&map, &dims, calculate_from - 1);
            map
        })
        .collect();

    canonicalized.sort();

@NailLegProcessorDivide
Copy link
Author

With the sorting fix: I think there might be rotation issues as well if the definition of canonical orientation wasnt the same but im not sure.

latest code version counted N=15 in 1hr52 which is pretty good being -60% from last night. Now I think most of the performance that can be gained is just sorting results faster which outside of using the matricies as const generics and adding special cases where the order is predictable I dont have many ideas

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

AFAICT there's practically no way of getting around sorting plus canonicalizing for the specific type you want, but luckily it isn't that expensive if you have a list of all unique polycubes.

Dayum, not bad! I don't think there is a lot that can be done in the way of optimization, unless we want to "port" memchr (which is the underlying implementation for [u8].contains to support arrays instead of a slice (I think that might help because you can completely avoid doing any index checking?)

Edit: oh, actually, since we're running contains on a slice of u16's we might not get and equally SIMD-accelerated version... It could be worth trying to switch to u8s instead, and see if that gives any improvement. That does look like a bit of a painful rewrite, though. Edit edit: okay nevermind that doesn't look like it makes sense, since then, if I understand it correctly, we can only compute up to N = 8? At any rate: manually SIMD-ing the contains might help. For reference: godbolt shows that the contains loop is unrolled, but not SIMD-d. Edit edit edit: This is getting out of hand. I don't think we can out-optimize the compiler on this, IDK why I thought we could xD

@datdenkikniet
Copy link
Contributor

Also if you want I can do a review of the code, try to give some general points/readability request info

@NailLegProcessorDivide
Copy link
Author

Any review / feedback would be appreciated.

rust/src/hashless.rs Outdated Show resolved Hide resolved
rust/src/hashless.rs Outdated Show resolved Hide resolved
rust/src/hashless.rs Outdated Show resolved Hide resolved
rust/src/hashless.rs Outdated Show resolved Hide resolved
Comment on lines +466 to +467
time / 1000000,
time % 1000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this formatting! Given that it's being done relatively often around the crate, we should have some fmt_duration(&Duration) somewhere

rust/src/rotations.rs Outdated Show resolved Hide resolved
rust/src/cli.rs Outdated Show resolved Hide resolved
rust/src/hashless.rs Outdated Show resolved Hide resolved
let x = dim.x;
let y = dim.y;
let z = dim.z;
let (x_col, y_col, z_col, rdim) = if x >= y && y >= z {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think stuff like this would quite greatly benefit from a little macro_rules, or at least use MatrixCol::* within the function so you don't have to spell that out all the time.

Copy link
Contributor

@datdenkikniet datdenkikniet Jul 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: macro_rules would probably actually not improve readability here. use MatrixCol::* would, though!

fn renormalize(..) {
    use MatrixCol::*;

    ...
        (XP, YP, ZP, Dim {x: x, y: y, z: z})
    ..
}

rust/src/hashless.rs Outdated Show resolved Hide resolved
@NailLegProcessorDivide
Copy link
Author

is there a way to pass some sort of pcube writer than can have pcubes inserted overtime rather than taking an iterator all at once?
for now I have disabled writing pcube files from point-list as Compression is a part of cli.rs
Other than that I feel like this is in a good place to merge as long as @datdenkikniet and @bertie2 are also happy with this

@NailLegProcessorDivide NailLegProcessorDivide marked this pull request as ready for review July 23, 2023 16:23
@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 24, 2023

LGTM, I think we can merge this.

It will need some more rust-ification and some docs, but that can be solved later.

is there a way to pass some sort of pcube writer than can have pcubes inserted overtime rather than taking an iterator all at once?

That depends what you mean by "inserted over time". You can spawn a separate thread that produces new pcubes and inserts them into a channel, and then you can iterate over the items using receiver::into_iter.

There is no requirement that all items are "known" at the moment you create your iterator though, so I think you'll need to be a bit more specific. The PCubeFile iterator also returns items "over time", only reading from the file when the next cube is requested.

Generally you represent that exact idea with an iterator, as all you need to do is to guarantee the (very basic) Iterator trait, and that's it. All of the details can be hidden nicely.

@bertie2 bertie2 merged commit 73eaa07 into mikepound:main Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants