Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
A GPU ray tracer using a divide and conquer strategy instead of partitioning the geometry into a hierarchy.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
dacrtGPU. A Divide and Conquer RayTracer as described by Benjamin Mora implemented on the GPU using CUDA and Thrust. Compile ./make Usage: ./build/dacrtGPU 2 2 && xv image.ppm The first argument is the number of rays pr. pixel and the second is the number of iterations. DONE Sort rays directly, not indices to avoid non-coallesced memory access (DONE) Don't do full leaf moved left/right arrays. Just do it for the nodes and then calculate the ray/sphere left/right leaf position after that using their index. This means we can do our inclusive_scan's over the nodes instead or rays and spheres. Can the same be done for non leaf partitions? Work queue idea: Update the pointer to the next work 'pool' and do it as an atomic operation. Then update the current ray owner afterwards. This can either be done atomic or not. In any case if it isn't done at the exact same time as the work index, the threads can simply iterate upwards until they find the correct dacrtnode which owns the ray (this will need to be done anyway) - It seems that scan + operation is faster than a workqueue. Given everyones obsession with work queues this may just be a problem with my old GPU, but I need to test it further. The test case can be found in DacrtNode. The atomic operation may be a major bottleneck, it is lower on newer architctures, but for my 1.2 it may just be too slow. Partition rays based on the geom bounds aswell? Perhaps do one or the other depending on some ray/geom ratio or based on how much a ray cone overlaps with a geometry bounding volume (e.g aabb) NO! Only partitioning the rays means I can do 'in node' partitioning. This would mean that I could support several different models or geometry types at the same time. It also means I can partition geometry based on it's distance to the cone apex, store the begin/end indices in a node and then later resume from that node. Is it possible to encode the MortonBounds of any 2 mortoncodes that share the first n most significant bits. If possibly then I wouldn't have to constantly reduce the bounds for each geometry partition iteration. (All values inside the bound share the same first n bits, so these can be ignored. Can I assume something about the lower 32-n bits? Are they all 1's and 0's (or close enough) or can express enough information with just the N'th bit?) - Yes! It is possible to find the lowest common bound of two morton codes. This is done by taking the most significant N bits that the two codes agree upon and filling the rest with 0 for min and 1 for max, corrosponding to a minimum and maximum path. Partition rays by sorting them according to their 5D Morton Curve, then corrospondingly partition the geometry spatially. - This is done in MortonDacrtNode and works splendidly. Reducing the size of the hypercube is what's most expensive by far when performing a DACRT step. How about only using every n'th ray to compute it instead of all of them? Say maximum 2048 rays pr cone. That would reduce cost drastically. Can I do this in a statistically sound way? Random isn't exactly cheap you know. And how much of an impact will it have when doing packet tracing? - This has been scrapped in favor of of sorting the rays along a morton curve. This allows me to lookup the lowest common morton encoded bound of a partition in constant time. The bounds are nowhere near as tight as actually reducing them, but it appears to be a lot faster. Move randomly generated numbers to the GPU - Done! This is done using a linear congruential generator, x = (Ax + C) % M, which is seeded by a global seed and a hashed local thread id. It both removes the memory overhead of generating a ton of random numbers on the CPU and storing them on the GPU, plus it increases speed by removing a giant memory transaction. TODO Let rays be created in screen space 'buckets', so instead of tracing the entire image each pass, we trace a smaller section with more coherent rays. DacrtNode: Can we do ray partitioning without converting them to hyper rays? - Sort them by major axis. - Extend all ray dirs, so their major axis has length 1.0f? Then the split - will never be performed along that axis. - Weigh all 6 dimensions when computing the bounding cone/plane split. (Needs to be done anyway with respect to the scenes AABB for evenly distributed spatial and angular split decisions) Then simply weigh the major axis angle as infinite to discourage its use. (Or use specialized kernels for each major axis, not that hard) - Proceed as usual with plane creation and intersection (but now without the constant conversion through a switch) MortonDacrtNode: - Partition along the major axis first to reduce depth complexity of the cones. - A major problem with Dacrt is that nodes can and will overlap, fx \ / / \ C1 / / \__/___/ \ C0 / \__/ where C1 is actually a subset of C0, but due to vertical division they get separated. Unfortunately there is no encoding scheme / representation that can solve this. If I do depth partitioning then I run the risk of having one ray from C0 'survive' the first depth partition and move into C1, but due to it's origin it will not get partitioned together with the rest of the rays from C1. I need to find a way to solve this without resorting a lot of rays (at least only resort them inside their own respective morton cells). Something something where live rays 'register' themselves and do a binary search to find the partition that they belong to in the next round? Compute the bounds of the geometry and use it to do early ray termination before exhaustive intersection. The left/right indice arrays also encode the left/right side info. Can we use this instead of the PartitionSide arrays to save memory? Will it be just as fast? Try using planes as bounds instead of a cone. It provides a tighter bound, so lower memory requirements, but at a higher computational cost. Amortise geometry sorting cost by using a morton curve subdivision (everyone else is anyway) When only a few rays remain, don't parallellize intersection over all rays, but do it over geometry instead. (Not an issue as long as I'm doing my fixed bounce pathtracer)