-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skip global hashtable (N=15 in 5h13) #28
Conversation
Oooo, cool! Would you mind rebasing on main or something? Getting an overview of the changes from the github GUI is a bit... Aweul xD The actual changes aren't as sweeping as the PR page makes it seem. |
f47348b
to
f81f49a
Compare
I still dont really like the code (style), the performance is a good chunk better than it was (3h15 for N=15). Most things Ive tried dont work and profiling the optimised build just shows 73% in is_canonical_root with a few branches that each contribute multiple (2-6) percent of runtime. Canonicalizing mirror pairs and double counting seemed to work but be ~10-20% slower every way I tried but I could be making a mistake. Im not an expert in vectorization and GPGPU stuff but it doesnt look to me like it would be efficient A fun idea that this algorithm could lend itself to is a folding@home style work distribution server with multiple worker clients processing subtrees and reporting the sub counts back but I dont know how complex and worth while that could be. |
when running it divides work based on seed polycubes in a previous cache file and needs cachefiles enabled for parallelism to work (or it runs 1 thread building off N=1). |
What tool do you use for profiling, by the way? Can highly recommend |
been using perf + flamegraph + perf annotate. |
I compile & run this with cargo rustc --release --bin opencubes -- -C target-cpu=native
# Need to sudo because I can't get setting capabilities to work
sudo ~/.cargo/bin/flamegraph -- ./target/release/opencubes enumerate -cm hashless 10 Your flamegraph seems to be for running the opti_bit_set mode, so that would explain the difference :P I also think none of the CPUs available to me have AVX512, while yours definitely seems to have that, cool! I have uploaded the full flamegraph. I think the iterator bits are all "virtual", in the sense that they're inlined but we still see them in the flamegraph because the debug info says that that what it's doing when executing the inlined/optimized bits? I'm on linux, and if you're not that may also make a difference to the flamegraph output. |
Youre correct my bad XD, would help if I opened the flamegraph from the correct folder after generating it. Is there a way to make CubeMapPos be generic over the array length. so CubeMapPos<16> would be what we have now and CubeMapPos<32> would have 32 array elements? I know it would be doable in C++ but I havent seen any hint that it would in rust after a bit of looking |
Yes, absolutely! You can have const generics that do exactly this. Unfortunately const-generic math isn't supported on stable yet so it's somewhat limited, but it may/should be plenty in for this case. |
thanks! |
Since adding generics bounds to non-instance methods can be annoying to read/type, and cause a lot of repetition of the const generic, I want to point out that const generics apply within entire impl blocks: impl<const T: usize> CubeMapPos<T> {
// &self and &mut Self mean you can call
// `some_cube_map_pos.xy_rots_points(dims, count, &mut some_other_cube_map_pos);,
// so long as some_cube_map_pos and some_other_cube_map_pos have the same type. The type,
// also includes (const) generics
pub fn xy_rots_points(&self, shape: &Dim, count: usize, res: &mut Self) {
// All code goes here. You can refer to T and it will be the
// outer T
}
// More functions that rely on T, or not.
} especially since you already have a bunch of methods whose first argument is a receiver, this should make the code easier to read & hopefully write |
Also, the way to get this to run properly in parallel is basically to:
Right? Though ATM it doesn't appear to produce correct results for that. Edit: ah, |
I had noticed sometimes it gave wrong numbers from some pcube files but never debugged it. adding a sort and the bottom of From<&RawPCube> should solve it I expect as a few of my optimisations require the point list to be sorted. for parallel yes, although for maximum speed I think its more a balance between many threads locking and taking time on the progress bar / adding to global total vs the non uniformity of the size of the amount of work each parent polycube takes to find all its children. so something like N = 6 with 166 seeds might be suboptimal for 64 cores because its 2.5 jobs of varying size. Im currently running N=15 again and it looks like it should take a bit less than 2 hours on my machine which would put an estimate for N=17 at under 5 days |
Hmmm, made an interesting observation: The I'm guessing it's never detected by the compiler, but the problem is that |
I'll see if I can test and fix that when my code is done running. I should probably add some basic tests (at least the stuff in rotation and polycube_reps) and checks that the value and the end of the enumerate calls are correct. Do you have any suggestion for how to set the first cache file it tries to load by the way? deleting and recreating the files as needed is pretty annoying. |
I think we could just run one of the different algos capable of producing a set of seeds. i.e. start with I don't think using the cache files for this impl makes a lot of sense since generating the initial set can be done so easily. Also, your suggestion of loading them + sorting them seems to do the trick :) N = 13 when starting from N = 6 in 03:44 using a whopping 5 MB of memory at most, not bad! Load + sort at the start let mut canonicalized: Vec<_> = current
.into_iter()
.map(|v| {
let dims = v.dims();
let map = CubeMapPos::<32>::from(v);
let dims = Dim {
x: dims.0 as usize - 1,
y: dims.1 as usize - 1,
z: dims.2 as usize - 1,
};
let map = to_min_rot_points(&map, &dims, calculate_from - 1);
map
})
.collect();
canonicalized.sort(); |
With the sorting fix: I think there might be rotation issues as well if the definition of canonical orientation wasnt the same but im not sure. latest code version counted N=15 in 1hr52 which is pretty good being -60% from last night. Now I think most of the performance that can be gained is just sorting results faster which outside of using the matricies as const generics and adding special cases where the order is predictable I dont have many ideas |
16f7e59
to
63f15bb
Compare
AFAICT there's practically no way of getting around sorting plus canonicalizing for the specific type you want, but luckily it isn't that expensive if you have a list of all unique polycubes. Dayum, not bad! I don't think there is a lot that can be done in the way of optimization, unless we want to "port" Edit: oh, actually, since we're running |
Also if you want I can do a review of the code, try to give some general points/readability request info |
Any review / feedback would be appreciated. |
time / 1000000, | ||
time % 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this formatting! Given that it's being done relatively often around the crate, we should have some fmt_duration(&Duration)
somewhere
rust/src/hashless.rs
Outdated
let x = dim.x; | ||
let y = dim.y; | ||
let z = dim.z; | ||
let (x_col, y_col, z_col, rdim) = if x >= y && y >= z { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think stuff like this would quite greatly benefit from a little macro_rules
, or at least use MatrixCol::*
within the function so you don't have to spell that out all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: macro_rules
would probably actually not improve readability here. use MatrixCol::*
would, though!
fn renormalize(..) {
use MatrixCol::*;
...
(XP, YP, ZP, Dim {x: x, y: y, z: z})
..
}
5f76ef7
to
63f15bb
Compare
is there a way to pass some sort of pcube writer than can have pcubes inserted overtime rather than taking an iterator all at once? |
LGTM, I think we can merge this. It will need some more rust-ification and some docs, but that can be solved later.
That depends what you mean by "inserted over time". You can spawn a separate thread that produces new pcubes and inserts them into a There is no requirement that all items are "known" at the moment you create your iterator though, so I think you'll need to be a bit more specific. The Generally you represent that exact idea with an iterator, as all you need to do is to guarantee the (very basic) |
Based on previous patch set.
Uses my understanding of this algorithm #11 by @presseyt
Some nasty implementation details still (N=16 max because still using point list datatype for quicker implementation, lots of rotatation and deduplication all over the place).
Essentially perfectly parallelisabe as cores take a sub polycube of order n and return the number of polycubes that are canonical ancestors.
A bit slow (around 3x slower than before) for determining what is and isnt a canonical child of the polycube currently being explored.