A GPU ray tracer using a divide and conquer strategy instead of partitioning the geometry into a hierarchy.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Kernels
Meta
Primitives
Rendering
Tests
Utils
CMakeLists.txt
Enums.h
Fragment.h
HyperCubes.cu
HyperCubes.h
LICENSE.txt
Material.cu
Material.h
README
Shading.cu
Shading.h
SphereContainer.cu
SphereContainer.h
SphereGeometry.cu
SphereGeometry.h
dacrtGPU.cu
make

README

dacrtGPU. A Divide and Conquer RayTracer as described by Benjamin Mora
implemented on the GPU using CUDA and Thrust.

Compile ./make
Usage: ./build/dacrtGPU 2 2 && xv image.ppm

The first argument is the number of rays pr. pixel and the second is the number
of iterations.



DONE

Sort rays directly, not indices to avoid non-coallesced memory access (DONE)

Don't do full leaf moved left/right arrays. Just do it for the nodes and then
calculate the ray/sphere left/right leaf position after that using their
index. This means we can do our inclusive_scan's over the nodes instead or
rays and spheres. Can the same be done for non leaf partitions?

Work queue idea: Update the pointer to the next work 'pool' and do it as an
atomic operation. Then update the current ray owner afterwards. This can
either be done atomic or not. In any case if it isn't done at the exact same
time as the work index, the threads can simply iterate upwards until they find
the correct dacrtnode which owns the ray (this will need to be done anyway)
- It seems that scan + operation is faster than a workqueue. Given everyones
  obsession with work queues this may just be a problem with my old GPU, but I
  need to test it further. The test case can be found in DacrtNode.  The atomic
  operation may be a major bottleneck, it is lower on newer architctures, but
  for my 1.2 it may just be too slow.

Partition rays based on the geom bounds aswell? Perhaps do one or the other
depending on some ray/geom ratio or based on how much a ray cone overlaps
with a geometry bounding volume (e.g aabb)
NO! Only partitioning the rays means I can do 'in node' partitioning. This
would mean that I could support several different models or geometry types
at the same time. It also means I can partition geometry based on it's
distance to the cone apex, store the begin/end indices in a node and then
later resume from that node.

Is it possible to encode the MortonBounds of any 2 mortoncodes that share the
first n most significant bits. If possibly then I wouldn't have to constantly
reduce the bounds for each geometry partition iteration. (All values inside the
bound share the same first n bits, so these can be ignored. Can I assume
something about the lower 32-n bits? Are they all 1's and 0's (or close enough)
or can express enough information with just the N'th bit?) 
- Yes! It is possible to find the lowest common bound of two morton codes. This
  is done by taking the most significant N bits that the two codes agree upon
  and filling the rest with 0 for min and 1 for max, corrosponding to a minimum
  and maximum path.

Partition rays by sorting them according to their 5D Morton Curve, then
corrospondingly partition the geometry spatially.
- This is done in MortonDacrtNode and works splendidly.

Reducing the size of the hypercube is what's most expensive by far when
performing a DACRT step. How about only using every n'th ray to compute it
instead of all of them? Say maximum 2048 rays pr cone. That would reduce cost
drastically. Can I do this in a statistically sound way? Random isn't exactly
cheap you know. And how much of an impact will it have when doing packet
tracing?
- This has been scrapped in favor of of sorting the rays along a morton
  curve. This allows me to lookup the lowest common morton encoded bound of a
  partition in constant time. The bounds are nowhere near as tight as actually
  reducing them, but it appears to be a lot faster.

Move randomly generated numbers to the GPU
- Done! This is done using a linear congruential generator, x = (Ax + C) % M,
  which is seeded by a global seed and a hashed local thread id. It both removes
  the memory overhead of generating a ton of random numbers on the CPU and
  storing them on the GPU, plus it increases speed by removing a giant memory
  transaction.



TODO

Let rays be created in screen space 'buckets', so instead of tracing the entire
image each pass, we trace a smaller section with more coherent rays.

DacrtNode: Can we do ray partitioning without converting them to hyper rays? 
- Sort them by major axis.
- Extend all ray dirs, so their major axis has length 1.0f? Then the split
- will never be performed along that axis.
- Weigh all 6 dimensions when computing the bounding cone/plane split. (Needs
  to be done anyway with respect to the scenes AABB for evenly distributed
  spatial and angular split decisions) Then simply weigh the major axis angle
  as infinite to discourage its use. (Or use specialized kernels for each
  major axis, not that hard)
- Proceed as usual with plane creation and intersection (but now without the
  constant conversion through a switch)


MortonDacrtNode:
- Partition along the major axis first to reduce depth complexity of the cones.
- A major problem with Dacrt is that nodes can and will overlap, fx

\      /   /
 \ C1 /   /
  \__/___/
   \ C0 /
    \__/

where C1 is actually a subset of C0, but due to vertical division they get
separated. Unfortunately there is no encoding scheme / representation that can
solve this. If I do depth partitioning then I run the risk of having one ray
from C0 'survive' the first depth partition and move into C1, but due to it's
origin it will not get partitioned together with the rest of the rays from C1. I
need to find a way to solve this without resorting a lot of rays (at least only
resort them inside their own respective morton cells). Something something where
live rays 'register' themselves and do a binary search to find the partition
that they belong to in the next round?


Compute the bounds of the geometry and use it to do early ray termination before
exhaustive intersection.


The left/right indice arrays also encode the left/right side info. Can we use
this instead of the PartitionSide arrays to save memory? Will it be just as fast?


Try using planes as bounds instead of a cone. It provides a tighter bound, so
lower memory requirements, but at a higher computational cost.


Amortise geometry sorting cost by using a morton curve subdivision (everyone
else is anyway)


When only a few rays remain, don't parallellize intersection over all rays, but
do it over geometry instead. (Not an issue as long as I'm doing my fixed bounce
pathtracer)