Rays1

Simple path tracer and benchmark based on Peter Shirley's Ray Tracing in One Weekend, inspired by Aras Pranckevicius's Daily Pathtracer.

What's this?

After I read the Daily Pathtracer articles and Ray Tracing in One Weekend, I wanted to try out some different optimization techniques. Just for fun and excercise. It was nice to read and follow Peter Shirley's book, the source code is clear and easy to understand.

Large scene	Medium scene	Small scene

This writing is not about "how to create a fast path tracer for real world use cases". If you want a really fast solution you should use more advanced algorithms and data structures. My goal was:

experimenting with different optimization strategies like:
- changing data layout (Array of Structures / Structure of Arrays)
- assisting the compiler to produce better code
- playing with SSE and AVX instructions
- multi-threading
- measuring the performance hit of virtual functions
try out tools like linux perf and the free Intel VTune Amplifier
compare different compilers
make it faster without changing the rendering algorithm

I've just tried some modifications and I'm happy with the results even if it's possible to improve it even more.

Content

Step 1-9: trying out different data layouts and other small tricks and try to help the compiler to autovectorize the code.
Step 11-12: SIMD intrinsics
Step 13: multi-threading

Why bother with this? There are a few exceptions but most programs should be fast and efficient. Slow programs use a lot of energy, cost a lot, drain batteries faster. Servers, mobile and IoT devices all need efficient programs. Be green and write efficient programs! :)

Results

For those who are impatient there are the results for Visual Studio 2019 + Windows 10 + Intel i9 9900K CPU with unrestricted power limit.

Other compilers behave differently (more on this).

scene	version	time	total rays	performance	speedup
large	step1: baseline	58.601s	25236618	0.431 mrays/s	1.00
large	step2: no getters	52.050s	25234423	0.485 mrays/s	1.13
large	step3: non-virtual hit()	37.680s	25225603	0.669 mrays/s	1.55
large	step4: Sphere[]	32.342s	25225603	0.780 mrays/s	1.81
large	step5: SoA spheres	32.360s	25225603	0.780 mrays/s	1.81
large	step6: 2 loops in sphere hit	28.556s	25222942	0.883 mrays/s	2.05
large	step7: Vec3 -> float, float, float	28.816s	25222942	0.875 mrays/s	2.03
large	step8: no dot() in hit()	8.692s	25232300	2.903 mrays/s	6.74
large	step9: normalized dir, inv_radius	8.433s	25239307	2.993 mrays/s	6.94
large	step10: SSE intrinsics	5.865s	25237638	4.303 mrays/s	9.98
large	step11: SIMD Vec3, randoms	5.866s	25238100	4.302 mrays/s	9.98
large	step12: AVX	3.842s	25241334	6.570 mrays/s	15.24
large	step13: threads	10.628s	630931100	59.362 mrays/s	137.73

scene	version	time	total rays	performance	speedup
medium	step1: baseline	6.017s	23083633	3.837 mrays/s	1.00
medium	step2: no getters	5.325s	23087944	4.336 mrays/s	1.13
medium	step3: non-virtual hit()	3.850s	23088349	5.997 mrays/s	1.56
medium	step4: Sphere[]	3.553s	23088349	6.499 mrays/s	1.69
medium	step5: SoA spheres	3.547s	23088349	6.509 mrays/s	1.70
medium	step6: 2 loops in sphere hit	3.208s	23088412	7.196 mrays/s	1.88
medium	step7: Vec3 -> float, float, float	3.313s	23088412	6.968 mrays/s	1.82
medium	step8: no dot() in hit()	1.837s	23088253	12.568 mrays/s	3.28
medium	step9: normalized dir, inv_radius	1.804s	23082621	12.793 mrays/s	3.33
medium	step10: SSE intrinsics	1.328s	23081522	17.381 mrays/s	4.53
medium	step11: SIMD Vec3, randoms	1.236s	23087726	18.684 mrays/s	4.87
medium	step12: AVX	1.051s	23087117	21.960 mrays/s	5.72
medium	step13: threads	2.679s	577132771	215.403 mrays/s	56.14

scene	version	time	total rays	performance	speedup
small	step1: baseline	0.973s	17005555	17.483 mrays/s	1.00
small	step2: no getters	0.847s	17005506	20.077 mrays/s	1.15
small	step3: non-virtual hit()	0.721s	17005011	23.574 mrays/s	1.35
small	step4: Sphere[]	0.712s	17005011	23.872 mrays/s	1.37
small	step5: SoA spheres	0.702s	17005011	24.224 mrays/s	1.39
small	step6: 2 loops in sphere hit	0.705s	17005011	24.109 mrays/s	1.38
small	step7: Vec3 -> float, float, float	0.716s	17005011	23.761 mrays/s	1.36
small	step8: no dot() in hit()	0.729s	17005011	23.316 mrays/s	1.33
small	step9: normalized dir, inv_radius	0.704s	16946841	24.084 mrays/s	1.38
small	step10: SSE intrinsics	0.590s	16568520	28.066 mrays/s	1.61
small	step11: SIMD Vec3, randoms	0.516s	16568518	32.089 mrays/s	1.84
small	step12: AVX	0.501s	16572334	33.057 mrays/s	1.89
small	step13: threads	1.289s	414187797	321.238 mrays/s	18.37

Note: I had to increase the workload in step13 because it finished too quickly.

Step 1. Preparations and Setting the Baseline

It's a small codebase and for simplicity I wrote most of the code in a single file. My initial code is almost the same as Peter Shirley's reference implementation.

Code walk-through:
main() calls benchmark() with every scenes: small, medium, large

benchmark(create_small_scene(), pixels, write_tga, "small");
benchmark(create_medium_scene(), pixels, write_tga, "medium");
benchmark(create_large_scene(), pixels, write_tga, "large");

Benchmark code is simple, some pseudo code:

// start time
clock_t t0 = std::clock();

// rendering
for every pixel in the screen buffer
    for every sample for this pixel
        calculate color

    result_pixel = average_of_samples

// end time
clock_t t1 = std::clock();

// write the result to a text file
log_results("baseline", id, elapsed_seconds, num_rays);

// write TGA if needed
tga_write_rgb24(filename, SCREEN_W, SCREEN_H, pixels);

clock() isn't the most precise timing function but we measure a couple of seconds so granularity is not a problem and it's cross platform.

clock() will be replaced in the multi-threaded version

We save the results in a text file for automatic processing by our python script.

Scene class contains the camera and the list of the spheres.
color() finds the closest intersecting sphere and calculates the color based on sphere material. Returns background color if no spheres hitted.

For implementation details you should read Peter Shirley`s minibook.
A few changes I've made:

I've used the numerically more stable (1 - t) * a + t * b formula for linear interpolation instead of a + (b - a) * t (wiki)
Vec3 members are public (x/y/z)
image is written in .tga format instead of .ppm and disabled by default, it requires -w option from command line
using fixed random seed for consistency and xorshift to generate random numbers
extended with benchmark: every run the program renders 3 scenes:
- small scene: it contains only 5 spheres. This is only for proving that for small N it's almost impossible to improve single-threaded performance.
- medium: 46 spheres, it's the same as the one used in Daily Pathtracer.
- large: 484 spheres, 1/20 of them are dielectric, 1/10 of them are metal.

I saved every version of the program to a unique folder because I wanted to test all versions on different computers with different compilers and I wanted to make it easy to automatize. They are saved in the src/step1, src/step2, ... directories. The latest version is in the src/latest directory.
Configurations like screen size and samples per pixel are in src/common/common.h
You can read about benchmarking here if you want to benchmark your system.

We have the baseline!

scene	version	time	total rays	performance	speedup
large	step1: baseline	58.601s	25236618	0.431 mrays/s	1.00
medium	step1: baseline	6.017s	23083633	3.837 mrays/s	1.00
small	step1: baseline	0.973s	17005555	17.483 mrays/s	1.00

Before We Start

We need to set goals.
I've optimized the program for execution speed. This program is not used in any real world scenario and thus we don't have any real data about usage patterns for example screen size, samples per pixel and number of spheres.

if we use extra large screen like 32k x 32k pixels and 1000 spp and only 1-2 spheres, we will spend a lot of time with generating random numbers and calculating camera rays. The time spent finding the closest sphere is probably negligible
if we have a few hundreds of spheres, we will spend almost all the time finding the closest intersecting sphere
if we have millions of spheres, we have to worry about caching more (and it's even more insane idea to use linear searching :))
if we have 100 billion spheres, we have completely different problems

In general this program is optimized for regular resolutions (HD, full hd, 4K, ...) with 1-1000 spp and not more than a few thousand spheres.
I'm focusing on the large and medium test scenes. As a minimum goal, I want to beat the performance of the Toy Path Tracer :)
I'm targeting x86_64 architecture with AVX2 instructions. I'm developing it mainly on Windows 10 with Visual Studio 2019 but regularly checking the performance on linux and macOS with gcc and clang too.

What are CPU-s Doing?

Basically every program does the same:

reads instructions from memory
reads data from memory
computes the result
writes the result to memory

Any of them can be a bottleneck and we can improve all of them.

Modern processors are beasts, they are capable of doing incredible amount of operations in best case scenarios. Our job is to be as close to these scenarios as possible :)

How big problem is when the CPU has to wait to load some missing data?
Most instructions can be executed in a few cycles on modern CPU-s and in worst case it takes a few hundred cycle to load a missing data from the memory. That's a lot of wasted time. CPU performance increases higher rate year over year than memory speed, so this bad situation is getting worse as time passes by.

Think about that you have to cook pancakes as fast as you can. What can go wrong?

If you don't know how to cook a pancake you have to read the book for cooking instructions and it will definitely slow you down. System memory contains data and instructions for your program. If your code is not cached, it takes the processor more time to read and decode it. For example C++ virtual functions and unpredictable conditions can hurt performance a lot.
If you don't have ingredients on your desk it will slow execution down too. If ingredients are in the refrigerator it's like an L1 cache miss :) If you have to go to the pantry it takes even more time like an L2 cache miss. If you have to go to the shop it's like you have to load the data from your disk.
How fast can you break the eggs? Can you use both of your left and right hands simultaneously? Can you simplify the steps without changing the results?
You finished baking the pancake but you don't have a plate to store it? So sad, you have to wait. (it's rare in single-threaded code)

Of course it's a highly simplified example, modern CPU-s are incredibly complicated and do a lot of tricks to run codes very fast even if they are poorly written. There is a very good article about performance speed limits.

Compilers can optimize your calculations, reduce the overhead of function calling by inlining, reorder your instructions and do a lot of tricks, but it won't help you much if your memory access pattern is wrong and the CPU has to wait for the input data.
Does your compiler support SIMD (single instruction, multiple data) and auto vectorization? Good. But it's your job to organize your data properly, otherwise the CPU will stall.
The highest priority of compilers is correctness, not performance. If it cannot prove that an optimization won't mess up your calculations, it won't do it.
Spend some time to learn more about your tools and compiler.

Step 2. Public Ray Members

Everybody knows that those simple getter functions are inlined by the compiler and do not make any performance hit:

// class Ray

// note: all the functions defined inside the class are implicitly inline
Vec3 origin() const
{
    return _origin;
}

Vec3 direction() const
{
    return _dir;
}

Are we right or are we wrong? Make the members public and remove the getter functions:

struct Ray
{
    // ...

    Vec3 _origin;
    Vec3 _dir;
};

... and check it!

scene	version	time	total rays	performance	speedup
large	step1: baseline	58.601s	25236618	0.431 mrays/s	1.00
large	step2: no getters	52.050s	25234423	0.485 mrays/s	1.13
medium	step1: baseline	6.017s	23083633	3.837 mrays/s	1.00
medium	step2: no getters	5.325s	23087944	4.336 mrays/s	1.13
small	step1: baseline	0.973s	17005555	17.483 mrays/s	1.00
small	step2: no getters	0.847s	17005506	20.077 mrays/s	1.15

It helps at least to the MSVC compiler, that's a great first step to improve our little program!
Why does it perform better? I've checked the generated assembly code and:

adding __forceinline to origin() and direction() in step1 didn't help to improve the performance
nothing prevented inlining these functions, there is neither call origin nor call direction instruction in the generated assembly code of step1
despite all that the MSVC compiler generates more instructions for Sphere::hit() in step1 than in step2.

Note: rant by Casey Muratori against setters/getters: video

Step 3. Non-virtual hit()

In the book there is a Hitable base class for everything that can be hit by rays. Every derived classes has to override the hit() function:

bool hit(Ray ray)
{
    if ray hit this thing
        return true
    else
        return false
}

We have only 2 derived classes:

Hitable
    |--Sphere (data for 1 sphere. hit() function returns true, if ray hits the sphere)
    |--HitableList (list of `Hitable`s. hit() returns true, if ray hits any element of the list)

It's easy to add new hitable types and we can build a hierarchy. One big drawback is that we have to make the hit() function virtual and it can degrade performance.

Can we remove virtual function call without losing the flexibility?
Yes, of course. We introduce Hitable struct, which contains

array of Sphere-s
array of Hitable-s

Extending our little path tracer with new features is still easy. For example for rendering meshes we need a new Mesh class and an array of Mesh-es in Hitable.

struct Hitable
{
    // [..]

    bool hit(const Ray& r, float t_min, float t_max, HitRecord* rec) const
    {
        // [..]

        // spheres
        for (uint32_t i = 0; i < _num_spheres; ++i)
        {
            if (_spheres[i]->hit(r, t_min, closest_so_far, &temp_rec))
                // [..]
        }

        // hitable lists
        for (uint32_t i = 0; i < _num_hitables; ++i)
        {
            if (_hitables[i].hit(r, t_min, closest_so_far, &temp_rec))
                // [..]
        }

        return hit_anything;
    }

    // spheres
    Sphere** _spheres = 0;
    uint32_t _num_spheres = 0;

    // meshes, etc.
    // ...

    // list
    Hitable* _hitables = 0;
    uint32_t _num_hitables = 0;
};

Remember: we spend a lot of time finding closest Ray-Sphere intersections. For a full hd screen with 100 spp in our large scene with 484 spheres we call the virtual function hit() at least 1920*1080*100*484 times (roughly 100 billion times plus bounced rays). What if we remove the overhead of a virtual function call?

scene	version	time	total rays	performance	speedup
large	step2: no getters	52.050s	25234423	0.485 mrays/s	1.00
large	step3: non-virtual hit()	37.680s	25225603	0.669 mrays/s	1.38
medium	step2: no getters	5.325s	23087944	4.336 mrays/s	1.00
medium	step3: non-virtual hit()	3.850s	23088349	5.997 mrays/s	1.38
small	step2: no getters	0.847s	17005506	20.077 mrays/s	1.00
small	step3: non-virtual hit()	0.721s	17005011	23.574 mrays/s	1.17

Removing virtual function calls from hit detection code was a huge benefit.
Speedup of the small scene is lower than speedup of the medium and the large scenes. We have only 5 sphere objects in the small scene and 5 virtual function calls for every sample in a pixel. Therefore reducing this overhead has less impact overall.

Step 4. Array of Spheres

The scene contains a lot of Sphere objects and we allocate them one by one with operator new thus they are scattered in the memory. During rendering we iterate all the Sphere objects from the first one to last one. It seems obvious that we should pack them in the memory next to each other.

Currently we have "array of Sphere pointers". It looks like this in the memory:

I was curious how far these Sphere objects are in the memory from the previous ones:

int num_bad = 0;
long long worst = 0;
long long average = 0;
uint8_t *lowest_ptr = (uint8_t*)world->_spheres[0];
uint8_t *highest_ptr = lowest_ptr;

for (uint32_t j = 1; j < world->_num_spheres; j++)
{
    // distance from previous Sphere object
    long long distance = (uint8_t*)world->_spheres[j] - (uint8_t*)world->_spheres[j - 1];

    if (distance != sizeof(Sphere))
    {
        num_bad++;

        average += abs(distance);

        if (abs(distance) > worst)
            worst = abs(distance);

        printf("%3d. 0x%p distance from previous: %lld\n", j, world->_spheres[j], distance);
    }

    // minimum / maximum values of sphere pointers
    if ((uint8_t *)world->_spheres[j] < lowest_ptr)
        lowest_ptr = (uint8_t *)world->_spheres[j];

    if ((uint8_t *)world->_spheres[j] > highest_ptr)
        highest_ptr = (uint8_t *)world->_spheres[j];
}

average /= world->_num_spheres;

// results
printf("sizeof(Sphere): %d, num_bad: %d / %d, worst case: %lld, average: %lld, hi-lo: %lld\n",
        (int)sizeof(Sphere),
        num_bad,
        world->_num_spheres,
        worst,
        average,
        (long long)(highest_ptr - lowest_ptr));

And a typical result on my Windows 10 machine:

[..]
471. 0x000001AF6C626680 distance from previous: 1376
472. 0x000001AF6C6261A0 distance from previous: -1248
473. 0x000001AF6C6264E0 distance from previous: 832
474. 0x000001AF6C626380 distance from previous: -352
475. 0x000001AF6C6263C0 distance from previous: 64
476. 0x000001AF6C626560 distance from previous: 416
477. 0x000001AF6C6266A0 distance from previous: 320
478. 0x000001AF6C6265E0 distance from previous: -192
479. 0x000001AF6C626960 distance from previous: 896
480. 0x000001AF6C626C60 distance from previous: 768
481. 0x000001AF6C6268E0 distance from previous: -896
482. 0x000001AF6C626CE0 distance from previous: 1024
483. 0x000001AF6C626D20 distance from previous: 64
sizeof(Sphere): 24, num_bad: 483 / 484, worst case: 12032, average: 647, hi-lo: 43808

The average distance between two Sphere objects is more than 600 bytes, it's far from ideal.

Let's change the code and allocate an array of Spheres, it will look like this:

scene	version	time	total rays	performance	speedup
large	step3: non-virtual hit()	37.680s	25225603	0.669 mrays/s	1.00
large	step4: Sphere[]	32.342s	25225603	0.780 mrays/s	1.17
medium	step3: non-virtual hit()	3.850s	23088349	5.997 mrays/s	1.00
medium	step4: Sphere[]	3.553s	23088349	6.499 mrays/s	1.08
small	step3: non-virtual hit()	0.721s	17005011	23.574 mrays/s	1.00
small	step4: Sphere[]	0.712s	17005011	23.872 mrays/s	1.01

Not a dramatic change but it's nice. The bigger the scene, the bigger the advantage is. Even 484 Sphere objects are a small amount of data, only 484*24 = 11KBytes.

Step 5. SoA Spheres

We can arrange our data in the memory in multiple ways.

Array of structures (AoS):

struct AoS
{
    float x, y, z;
};

AoS my_aos = new AoS[SIZE];

// processing
for (int i=0; i<SIZE; i++)
    my_aos[i].x = 111;

Structure of arrays (SoA):

struct SoA
{
    float *x;
    float *y;
    float *z;

    void SoA(int n)
    {
        x = new float[n];
        y = new float[n];
        z = new float[n];
    }
};

SoA my_soa(SIZE);

// processing
for (int i=0; i<SIZE; i++)
    my_soa.x[i] = 111;

AoS data layout in memory: XYZXYZXYZ...
SoA data layout in memory: XXX...YYY...ZZZ...

Both of them have advantages and disadvantages.

Usually SoA is the best layout when you want to use SSE or other SIMD technology. Also if struct has a lot of members but you process only a few of them in a for loop, SoA has the advantage of higher information density.
However: "Note that SoA can have the disadvantage of requiring more independent memory stream references. [..] This can require the use of more prefetches, additional address generation calculations, as well as having a greater impact on DRAM page access efficiency." (Intel optimization guide 4-22)

AoS is simplier in most cases and may perform better in some situations: non-SIMD processing of data or random accessing of objects could be faster.

Later we will use SSE and AVX instructions, it seems to be a good idea to rearrange our sphere data in SoA.
Sadly there is no native support in C++ for structure-of-arrays.
What are the SoA alternatives for std::vector<X> ?

The simplest version is just to write it by hand. You write the code to initialize, clean up, resize, etc. your arrays. It works and easy to read but after you wrote almost the same code 3 times, you will search for something more generalized solution.

struct Enemy
{
    float *x
    float *y;
    int *hitpoints;

    void resize(int new_size) { ... }
    void remove(int idx) { ... }
    void cleanUp() { ... }
    // ...

private:
    int _capacity;
    int _size;
    // allocator?
};

Some people write C++ templates for SoA data structure layout. It's far from simple and the result is... well, I have mixed feelings :) One problem is that it's hard to write and read the template code and it's full of C++ magic and traps. Bigger problem is that the usage of the implementations I've seen looked cumbersome for me. Also it's a template so expect slow compilations and possible sluggish debug performance.
But if you like it, use it. Write your own or find an implementation like this or this.
Almost the same as the first solution, but instead of writing the code by hand, you can generate it.
To generate SoA arrays I wrote a python script. It uses a single allocated byte buffer to store all members and handles aligning too. The config file is a python file.

The generated c++ class looks like this:

class SphereSOA
{
    [..]
    struct InstanceData
    {
        void *_buffer;
        uint32_t _count;
        uint32_t _capacity;

        // actual members we want to use
        Vec3 * __restrict center;
        float * __restrict radius;
        Material ** __restrict material;
    };
};

Note: center, radius and material point into the _buffer and will never overlap so it's safe to use __restrict which helps the compiler to generate better code. Without __restrict the compiler has to generate very conservative code sometimes to ensure correctness. With __restrict, we promise to the compiler that center, radius and material pointers do not overlap.
More: Demystifying The Restrict Keyword

Benchmark it!

scene	version	time	total rays	performance	speedup
large	step4: Sphere[]	32.342s	25225603	0.780 mrays/s	1.00
large	step5: SoA spheres	32.360s	25225603	0.780 mrays/s	1.00
medium	step4: Sphere[]	3.553s	23088349	6.499 mrays/s	1.00
medium	step5: SoA spheres	3.547s	23088349	6.509 mrays/s	1.00
small	step4: Sphere[]	0.712s	17005011	23.872 mrays/s	1.00
small	step5: SoA spheres	0.702s	17005011	24.224 mrays/s	1.01

This time there is no speedup :) That's OK. We do this rearrangement as a preparation for:

making it easier for the compiler to autovectorize our hit detection loop
making it easier for us to use SSE instructions

Step 6. Multi-pass sphere_hit()

CPU-s love short, branchless loops.
We spend most of the time in this loop (in medium and large scenes):

for (uint32_t i = 0, N = _soa_spheres.getCount(); i < N; ++i)
{
    if (sphere_hit(i, r, t_min, closest_so_far, &temp_rec))
    {
        hit_anything = true;
        closest_so_far = temp_rec.t;
        *result_rec = temp_rec;
    }
}

..and:

bool sphere_hit(int i, const Ray& r, float t_min, float t_max, HitRecord* rec) const
{
    const Vec3& center = _soa_spheres.getData()->center[i];
    const float radius = _soa_spheres.getData()->radius[i];

    Vec3 oc = r._origin - center;
    float a = dot(r._dir, r._dir);
    float b = dot(oc, r._dir);
    float c = dot(oc, oc) - radius * radius;
    float discriminant = b * b - a * c;
    if (discriminant > 0)
    {
        [..]
    }

    return false;
}

What does the if (discriminant > 0) mean? It's true if the ray hits the currently tested sphere. How often does this happen? I bet it's a rare case! Check it with a simple additional code. We count both cases:

if (discriminant > 0)
{
    g_NumTrue++;
    [..]
}
else
    g_NumFalse++;

and print the results:

printf("discr > 0:     %.2f%%\n", 100.0 * g_NumTrue / (double)(g_NumTrue+g_NumFalse));

small
discr > 0:     21.34%

medium
discr > 0:     3.68%

large
discr > 0:     0.45%

As expected, the more spheres we have, the lower the chance of a ray hitting the tested sphere.

Let's split the loop!

// 1. pass: calculate every discriminant
// our little program spends most of the time here
for (uint32_t i = 0, N = _soa_spheres.getCount(); i < N; ++i)
{
    const Vec3& center = centers[i];
    const float radius = radii[i];

    Vec3 oc = r._origin - center;
    float a = dot(r._dir, r._dir);
    float b = dot(oc, r._dir);
    float c = dot(oc, oc) - radius * radius;

    // save results
    discriminants[i] = b * b - a * c;
}

// 2. pass: check discriminants
for (uint32_t i = 0, N = _soa_spheres.getCount(); i < N; ++i)
{
    if (discriminants[i] > 0 && sphere_hit_splitted(i, r, t_min, closest_so_far, &temp_rec))
    {
        hit_anything = true;
        closest_so_far = temp_rec.t;
        *result_rec = temp_rec;
    }
}

Pros:

no branching in the first loop
it's easier for the compiler to autovectorize the code
easy to convert to SIMD code (see later)

Cons:

increased memory footprint: we need to store the discriminant of every sphere
we redundantly recalculate a, b, c in sphere_hit_splitted(). Luckily it's a rare case.

scene	version	time	total rays	performance	speedup
large	step5: SoA spheres	32.360s	25225603	0.780 mrays/s	1.00
large	step6: 2 loops in sphere hit	28.556s	25222942	0.883 mrays/s	1.13
medium	step5: SoA spheres	3.547s	23088349	6.509 mrays/s	1.00
medium	step6: 2 loops in sphere hit	3.208s	23088412	7.196 mrays/s	1.11
small	step5: SoA spheres	0.702s	17005011	24.224 mrays/s	1.00
small	step6: 2 loops in sphere hit	0.705s	17005011	24.109 mrays/s	1.00

Great improvement!

Step 7. Vec3 -> float, float, float

What if we change the SoA sphere data from this:

Vec3 * __restrict center;

to this:

float * __restrict center_x;
float * __restrict center_y;
float * __restrict center_z;

Can it improve the performance?

scene	version	time	total rays	performance	speedup
large	step6: 2 loops in sphere hit	28.556s	25222942	0.883 mrays/s	1.00
large	step7: Vec3 -> float, float, float	28.816s	25222942	0.875 mrays/s	0.99
medium	step6: 2 loops in sphere hit	3.208s	23088412	7.196 mrays/s	1.00
medium	step7: Vec3 -> float, float, float	3.313s	23088412	6.968 mrays/s	0.97
small	step6: 2 loops in sphere hit	0.705s	17005011	24.109 mrays/s	1.00
small	step7: Vec3 -> float, float, float	0.716s	17005011	23.761 mrays/s	0.99

Nope.

Step 8. Manual dot() in hit()

Another easy modification: replace our inline dot() function calls with manual dot product calculation:

Old:

Vec3 oc(r._origin.x - cx[i], r._origin.y - cy[i], r._origin.z - cz[i]);
float a = dot(r._dir, r._dir);
float b = dot(oc, r._dir);
float c = dot(oc, oc) - radius * radius;

New:

float ocx = r._origin.x - cx[i];
float ocy = r._origin.y - cy[i];
float ocz = r._origin.z - cz[i];
float a = r._dir.x * r._dir.x + r._dir.y * r._dir.y + r._dir.z * r._dir.z;
float b = ocx * r._dir.x + ocy * r._dir.y + ocz * r._dir.z;
float c = (ocx * ocx + ocy * ocy + ocz * ocz) - radius * radius;

Manually inlining a function instead of letting the compiler to do it? What a bad idea!

scene	version	time	total rays	performance	speedup
large	step7: Vec3 -> float, float, float	28.816s	25222942	0.875 mrays/s	1.00
large	step8: no dot() in hit()	8.692s	25232300	2.903 mrays/s	3.32
medium	step7: Vec3 -> float, float, float	3.313s	23088412	6.968 mrays/s	1.00
medium	step8: no dot() in hit()	1.837s	23088253	12.568 mrays/s	1.80
small	step7: Vec3 -> float, float, float	0.716s	17005011	23.761 mrays/s	1.00
small	step8: no dot() in hit()	0.729s	17005011	23.316 mrays/s	0.98

Surprise! Why is this huge improvement?
Add /Qvec-report:1 to the command line to enable the reporting feature of the MSVC compiler auto-vectorizer. Until step 7, it reports nothing. However in step 8:

--- Analyzing function: bool __cdecl Hitable::hit(struct Ray const & __ptr64,float,float,struct HitRecord * __ptr64)const __ptr64
rayweek1.cpp(104) : info C5001: loop vectorized

Hooray! Finally the compiler is capable of auto-vectorizing our hottest loop! (GCC has a similar option: -fopt-info-vec-optimized)

Conclusion: in theory, calling an inline function is the same as manual calculation. In practice, sometimes there is a huge difference. (similar to step2)

Step 9. Minor Optimizations: ray::dir and Precalculated Data

I cannot unsee the Daily Pathtracer, I have copied some idea from it:

We store the direction of the ray as unit vector so we can make some simplification in the calculations.

Ray(const Vec3 &origin, const Vec3 &dir)
{
    _origin = origin;
    _dir = dir.getUnitVector();
}

By definition a unit vector has a length of 1. The dot product of any unit vector with itself is 1 too.
Using this knowledge we can completely remove a from this expression:

float a = dot(r._dir, r._dir);
[..]
float temp = (-b - sqrtf(b * b - a * c)) / a;

Modified version:

float temp = (-b - sqrtf(b * b - c));

We store the radius of the sphere but we use always the square or the inverse of the radius. From now we store radius*radius as radius_sq and 1.0/radius as inv_radius.

scene	version	time	total rays	performance	speedup
large	step8: no dot() in hit()	8.692s	25232300	2.903 mrays/s	1.00
large	step9: normalized dir, inv_radius	8.433s	25239307	2.993 mrays/s	1.03
medium	step8: no dot() in hit()	1.837s	23088253	12.568 mrays/s	1.00
medium	step9: normalized dir, inv_radius	1.804s	23082621	12.793 mrays/s	1.02
small	step8: no dot() in hit()	0.729s	17005011	23.316 mrays/s	1.00
small	step9: normalized dir, inv_radius	0.704s	16946841	24.084 mrays/s	1.03

Step 10. SSE Intrinsics

SIMD stands for single instruction, multiple data. A single SIMD instruction can perform the same operation on multiple data points simultaneously. If you've never heard about it, there is a really nice introduction: Crunching Numbers with AVX and AVX2. (although it's about the AVX instruction set, the 256 bits wide big brother of SSE)
Some other good readings: What is SSE and AVX? and Intel Intrinsics Guide.
I'm not experienced in writing SIMD code, I'm curious how far I can go!

Hitable::hit() is still the hottest function. We're going to replace the 'find the closest intersecting sphere' loop. SSE registers are 128 bits wide so instead of a single 32-bit float data we can work with 4x 32-bit float data simultaneously. It's easy except a small tricky part :)

Old method:

pass1:
for i in 0..num_of_spheres, i += 1
    discriminant[i] = ...

pass2:
for i in 0..num_of_spheres
    if (discriminant > 0)
        check closer ray-sphere intersection
        check further ray-sphere intersection

We replace only the first pass with SSE instructions.

pass1:
for i in 0..num_of_spheres, i += 4
    calculate discriminant i, i+1, i+2, i+3

pass2:
    nothing changed

const __m128 ray_dir_x = _mm_set1_ps(r._dir.x);
const __m128 ray_dir_y = _mm_set1_ps(r._dir.y);
const __m128 ray_dir_z = _mm_set1_ps(r._dir.z);
const __m128 ray_origin_x = _mm_set1_ps(r._origin.x);
const __m128 ray_origin_y = _mm_set1_ps(r._origin.y);
const __m128 ray_origin_z = _mm_set1_ps(r._origin.z);


for (uint32_t i = 0, N = _soa_spheres.getCount(); i < N; i += SIMD_WIDTH)
{
    const __m128 ocx = _mm_sub_ps(ray_origin_x, _mm_load_ps(&cx[i]));
    const __m128 ocy = _mm_sub_ps(ray_origin_y, _mm_load_ps(&cy[i]));
    const __m128 ocz = _mm_sub_ps(ray_origin_z, _mm_load_ps(&cz[i]));

    __m128 b = _mm_mul_ps(ocx, ray_dir_x);
    b = _mm_add_ps(_mm_mul_ps(ocy, ray_dir_y), b);
    b = _mm_add_ps(_mm_mul_ps(ocz, ray_dir_z), b);
    _mm_storeu_ps(&all_b[i], b);

    const __m128 radius_sq = _mm_load_ps(&radii_sq[i]);
    __m128 c = _mm_add_ps(_mm_add_ps(_mm_mul_ps(ocx, ocx), _mm_mul_ps(ocy, ocy)), _mm_mul_ps(ocz, ocz));
    c = _mm_sub_ps(c, radius_sq);

    __m128 discr = _mm_sub_ps(_mm_mul_ps(b, b), c);
    _mm_storeu_ps(&discriminants[i], discr);
}

Remember: we spend almost all the time in the first pass.

Note: if you check the source code, this is the #define COLLECT 0 version.

The second loop is not vectorized. The reasons:

I cannot do it :) (I need more experience with SSE)
Probably it is useless. Likely only 1 of the 4 spheres intersects the ray.

Another idea! This multi-pass solution works pretty well, but the second for loop touches every element of the discriminant[] again. And we know that most of them are negative values. What if in the first pass we record the useful indices?
The scalar version would look like this:

uint16_t positives[1024];
int num_positives = 0;

for (uint32_t i = 0; i < _soa_spheres.getCount(); ++i)
{
    // [..]
    discriminants[i] = b * b - c;

    if (discriminants[i] > 0)
        positives[num_positives++] = i;
}

Here comes the SIMD version:

__m128 discr = _mm_sub_ps(_mm_mul_ps(b, b), c);
_mm_storeu_ps(&discriminants[i], discr);

int mask = (~_mm_movemask_ps(discr)) & 15;

if (mask)
{
    if (mask & (1<<0))
        positive_idx[num_positives++] = i;
    if (mask & (1<<1))
        positive_idx[num_positives++] = i + 1;
    if (mask & (1<<2))
        positive_idx[num_positives++] = i + 2;
    if (mask & (1<<3))
        positive_idx[num_positives++] = i + 3;
}

It looks scary for the first time. Remember that we work with 4x floats simultaneously and there is no 'multiple-if' instruction in c++. The highest bit of a 32-bit floating point number is the sign bit. If it's 1, this is a negative number. If it's 0, the value is positive. This is why we have two representations of zero :) The _mm_movemask_ps instruction copies the most significant bit (the sign bit) of each float. We have 4x floats, _mm_movemask_ps returns 4 bits of information.

Let's see an example:

discriminants[i]  : 10111111 10000000 00000000 00000000    // -1
discriminants[i+1]: 00111111 10000000 00000000 00000000    // +1
discriminants[i+2]: 10111111 00000000 00000000 00000000    // -0.5
discriminants[i+3]: 11000001 00100000 00000000 00000000    // -10
                    ^
                    sign bit

mask = _mm_movemask_ps(...)
mask: 00000000 00000000 00000000 00001101 (bin) or 13 (dec)

We're almost there! We want to filter the positive values, not the negative ones. So invert every bit and clear the upper 28 bits: mask = (~mask) & 15

Continuing this example:

mask:       00000000 00000000 00000000 00001101 (bin) or 13 (dec)
~mask:      11111111 11111111 11111111 11110010 (bin)
(~mask)&15) 00000000 00000000 00000000 00000010 (bin)

If mask is non-zero, it means that 1, 2, 3 or 4 discriminants are positive.

In the second pass, we iterate only these registered indices, which is probably a much smaller amount of data:

for (int i = 0; i < num_positives; ++i)
{
    ....
}

This is the #define COLLECT 1 version that runs a little faster on my machine.

Other small changes in the code:

Hitable::hit(): the results are calculated only in one place (result_rec-> lines). I've seen it in the Daily Pathtracer! :)
Sphere_hit_splitted() function is removed and inlined manually in Hitable::hit().
Hitable::hit(): the values of b are stored too to reduce redundant calculations, which increases memory footprint slightly
Because we work with a group of 4 spheres, we must ensure that the number of spheres is multiple of 4. We have to add a few lines of code to the scene creator functions:

// make sure that the number of spheres is multiple of SIMD width
while (scene->hitables->_soa_spheres.getCount() % SIMD_WIDTH != 0)
    scene->hitables->_soa_spheres.add(Vec3(999999999, 999999999, 999999999), 0, nullptr);

We add place holder spheres that have radius of 0 and handle them carefully in hit detection.

Final version is here, results are here:

scene	version	time	total rays	performance	speedup
large	step9: normalized dir, inv_radius	8.433s	25239307	2.993 mrays/s	1.00
large	step10: SSE intrinsics	5.865s	25237638	4.303 mrays/s	1.44
medium	step9: normalized dir, inv_radius	1.804s	23082621	12.793 mrays/s	1.00
medium	step10: SSE intrinsics	1.328s	23081522	17.381 mrays/s	1.36
small	step9: normalized dir, inv_radius	0.704s	16946841	24.084 mrays/s	1.00
small	step10: SSE intrinsics	0.590s	16568520	28.066 mrays/s	1.17

Our naive SSE code performs pretty well.

My minimal goal is reached as the program is a little faster now than the Toy Path Tracer on my computer, details are on the bottom of the page :)

Step 11. Minor Optimizations

Vec3 is based on __m128 instead of "float x, y, z"
Another idea copied from Fabian 'ryg' Giesen: compute -b (nb) not b by flipping vector
Now I have 4 random functions:

float myrand01() -> returns 0..1
float myrand02() -> returns 0..2
__m128 myrand01_x4 -> returns 0..1
__m128 myrand02_x4 -> returns 0..2

Some cycles can be saved using them, for example:

random_in_unit_sphere()
OLD:
p = 2.0f * Vec3((float)myrand(), (float)myrand(), (float)myrand()) - Vec3(1, 1, 1);

NEW:
p = Vec3(myrand02_x4()) - Vec3(1.0f);

random_in_unit_disk()
OLD:
p = 2.0 * Vec3((float)myrand(), (float)myrand(), 0) - Vec3(1, 1, 0);

NEW:
p = Vec3((float)myrand02(), (float)myrand02(), 0) - Vec3(1, 1, 0);

Some calculations are simplified because we know that Ray::dir is a unit vector.

And the results of all these changes:

scene	version	time	total rays	performance	speedup
large	step10: SSE intrinsics	5.865s	25237638	4.303 mrays/s	1.00
large	step11: SIMD Vec3, randoms	5.866s	25238100	4.302 mrays/s	1.00
medium	step10: SSE intrinsics	1.328s	23081522	17.381 mrays/s	1.00
medium	step11: SIMD Vec3, randoms	1.236s	23087726	18.684 mrays/s	1.07
small	step10: SSE intrinsics	0.590s	16568520	28.066 mrays/s	1.00
small	step11: SIMD Vec3, randoms	0.516s	16568518	32.089 mrays/s	1.14

Great for the small and medium scenes, almost nothing for the large scene. This is what I expected. The fewer number of spheres we have the bigger part of the total time we spend in random_in_unit_sphere() and random_in_unit_disk().

Step 12. AVX (Advanced Vector Extensions)

AVX over SSE:

registers are 256 bits wide instead of 128 bits, they hold 8 floats instead of 4
data must be aligned on a 32-byte boundary
and more

Important note for AMD CPU owners: before Zen2 (2019), AMD processors emulated instructions on 256 bits wide data as 2x128 so it was supported but wasn't faster than SSE.

Start with an easy change:

#define SIMD_WIDTH  8

We change some variable types and function names (it's almost a simple search & replace):

OLD:
const __m128 ray_dir_x = _mm_set1_ps(r._dir.getX());
[..]

NEW:
const __m256 ray_dir_x = _mm256_set1_ps(r._dir.getX());
[..]

(__m128 is replaced with __m256 and the prefix in function names changed from _mm_ to _mm256_)

AVX2 instruction set introduces a new useful function: fused multiply-accumulation (FMA). It's faster than a multiply operation followed by an add and more precise too.
Use it:

OLD:
__m128 c = _mm_add_ps(_mm_add_ps(_mm_mul_ps(cox, cox), _mm_mul_ps(coy, coy)), _mm_mul_ps(coz, coz));
c = _mm_sub_ps(c, radius_sq);

NEW:
__m256 c = _mm256_mul_ps(cox, cox);
c = _mm256_fmadd_ps(coy, coy, c);
c = _mm256_fmadd_ps(coz, coz, c);

_mm256_movemask_ps() returns 8 bits instead of 4:

int mask = (~_mm256_movemask_ps(discr)) & 255;

if (mask)
{
    if (mask & (1<<0))
        positive_idx[num_positives++] = i;
    if (mask & (1<<1))
        positive_idx[num_positives++] = i + 1;
    if (mask & (1<<2))
        positive_idx[num_positives++] = i + 2;
    if (mask & (1<<3))
        positive_idx[num_positives++] = i + 3;
    if (mask & (1<<4))
        positive_idx[num_positives++] = i + 4;
    if (mask & (1<<5))
        positive_idx[num_positives++] = i + 5;
    if (mask & (1<<6))
        positive_idx[num_positives++] = i + 6;
    if (mask & (1<<7))
        positive_idx[num_positives++] = i + 7;
}

That's all!

scene	version	time	total rays	performance	speedup
large	step11: SIMD Vec3, randoms	5.866s	25238100	4.302 mrays/s	1.00
large	step12: AVX	3.842s	25241334	6.570 mrays/s	1.53
medium	step11: SIMD Vec3, randoms	1.236s	23087726	18.684 mrays/s	1.00
medium	step12: AVX	1.051s	23087117	21.960 mrays/s	1.18
small	step11: SIMD Vec3, randoms	0.516s	16568518	32.089 mrays/s	1.00
small	step12: AVX	0.501s	16572334	33.057 mrays/s	1.03

Great!
Of course it didn't double the performance because Hitable::hit() is not the only function in our program that takes time (especially in the small scene).

Step 13. Multi-threading

Our single-threaded performance is now at an acceptable level, it's time to utilize all of our cores!

Some random topics and questions:

Splitting the workload across multiple threads and then collecting the results always has some overhead.

We need a scheduler: I wrote a very simple one. Threads render small pieces of the final picture, in our case the tiles are 32x32 pixels. A thread requests a tile index from the scheduler. When there are no more tiles, the thread stops.

static void thread_function(TileRenderScheduler* scheduler, uint32_t thread_idx, ThreadData* td)
{
    int tile;

    while ((tile = scheduler->get_tile()) != -1)
        render_tile(tile, td);
}

int get_tile()
{
    int tile = _next_tile.fetch_add(1);
    
    if (tile < _num_tiles)
        return tile;
    
    return -1;
}

std::atomic<int> _next_tile = 0;
int _num_total_tiles = 0;

Granularity: what's the ideal size of a tile? Small tiles -> big overhead because of the scheduler. Huge tiles: thread utilization can be low. After some experiments I have set tile size to 32x32 pixels.

How much threads should we use? On my 8-core CPU with hyper-threading enabled, std::thread::hardware_concurrency() returns 16.

The sphere structure contains temporary data (discriminant and b). Threads are not able to share these otherwise we get wrong results, they overwrite the results of the other threads. We have to make it per-thread data.

The CPU can change dynamically its frequency based on power consumption, temperature or some random boost time limit. Getting consistent performance and measuring it is hard. AVX heavy load on all cores -> throttling is expected.
My CPU is able to run at 5.0GHz clock speed when 1 core is active and 4.7GHz (94%) when 8 cores are active. That's a limit, not the guaranteed speed. Actually it ran at 4.65GHz during benchmarking.

It turned out that clock() is unable to measure time in multi-threaded programs, so I replaced it with std::chrono. (details)

Hyper-threading: it can increase or decrease the performance. "SMT or HT Technology threads share the L1 instruction and data caches. If their memory access patterns are incompatible, they can end up fighting over the cache and causing many cache misses. In the worst case, the total performance for the CPU core can actually decrease when a second thread is run." (source)

Just for my curiosity I have tried a float8 wrapper class for __m256. It doesn't affect the performance but the code is slightly more readable.

I had to increase the workload in common.h:

#if MULTITHREADED
// multi-threaded version is too fast, need more work :)
#define NUM_SAMPLES_PER_PIXEL (10 * 25)
#else
#define NUM_SAMPLES_PER_PIXEL 10
#endif

#endif

Efficiency: in ideal case running on 8 cores / 8 threads is almost 8 times faster than on a single thread (job scheduling always has some overhead). Typically hyper-threading boosts performance by 10-30%. (8 cores / 16 threads).

Source and results:

scene	version	time	total rays	performance	speedup
large	step12: AVX	3.842s	25241334	6.570 mrays/s	1.00
large	step13: threads	10.628s	630931100	59.362 mrays/s	9.04
medium	step12: AVX	1.051s	23087117	21.960 mrays/s	1.00
medium	step13: threads	2.679s	577132771	215.403 mrays/s	9.81
small	step12: AVX	0.501s	16572334	33.057 mrays/s	1.00
small	step13: threads	1.289s	414187797	321.238 mrays/s	9.72

Benchmarking

Do you want to benchmark your machine or your compiler?

Short version: open a terminal and run the bench.py script. Python3 and a c++ compiler is required.

Long version:
Although there are project files for XCode and Visual Studio, I always run the benchmark from command line.
You have two options to compile and run a program: do it directly or using bench.py.

src/common/common.h contains the output config for all versions (screen size, spp)
bench.py: compiles and runs all versions in "step1", "step2", ..., "latest" directories. Every directory contains at least the rayweek1.cpp. It produces out_*.txt and out_*.tga files.
src/latest/rayweek1.cpp is the latest version of the program.

Not needed for benchmarking but good to know: update_readme.py collects results from step*/out_*.txt and updates README.md (the template file is README_template.md). I've used it for updating the article automatically.

Run bench.py. It has some options:

--all               Run all tests from step1 to latest
--latest            Run latest only
--single-threaded   Run latest single-threaded version (step12)
--step NUMBER       Run stepNUMBER
--cxx COMPILER      Select compiler, available options:
                        cl: MSVC compiler
                        clang or clang++ or clang++-9 or similar: clang
                        g++ or gcc or g++-9 or similar: g++

--quick             Quick bench (lower resolution and spp)
--compile-only      Compile programs but doesn't run
--num NUMBER        Number of runs (1..32)
--save              Save the output to a .tga file (src/step*/*.tga)

Open a terminal and run the script:

Windows:

python bench.py --latest
cl.exe should be in the path (run vcvars64.bat if it's in your path or start "x64 Native Tools Command Prompt for VS 2019")

Linux:

python3 bench.py --latest
g++ should be installed

macOS:

python3 bench.py --latest
clang should be installed

By default the program doesn't save the output image (in the multi-threaded version it can lead to inconsistent results). To store the image, use -w option, for example on Windows:

step1\rayweek1.exe -w

Or if you use bench.py, add the --save option.

python bench.py --latest --save

A few tips to get consistent results:

set the CPU scaling governor to "performance" on linux ("powersave" is the default in most distributions)
change power plan to "High Performance" on Windows
disable real-time virus protection on Windows
use a high performance cooler

MSVC compiler flags:
/Ox /fp:fast /D "NDEBUG" /arch:AVX2 /GL /link /LTCG'
g++/clang flags:
-ffast-math -O3 -g -fno-rtti -fno-exceptions -std=c++17 -march=native -flto

Check bench.py for compilation details.

Some Benchmark Results

All results are in million rays / secs, higher is better.

All benchmarks ran on my machines:

Intel i9 9900K (8c/16t), Noctua NH-D15S cooler, Windows 10 Home (1909) / Linux Mint 19.2.
MacBook Pro 2016 without Touch Bar (2c/4t) / macOS Catalina

Hyper-threading On/Off

Does hyper-threading boost the performance?
Speedup is 1.2x - 1.4x.

Optimization Steps vs Compilers

I think it's pretty interesting to see how each optimization step changes the ranking of the compilers.
I've omitted the result of step13 (multi-threaded version) because it made the chart unreadable, but you can find all results here)

Price-performance Ratio, Desktop vs Laptop

It's a little unfair but let's compare my desktop PC and my Macbook Pro laptop. Their price was similar (my PC was a little cheaper).

All Results

For those who like numbers :)

Toy Path Tracer

To make the comparison fair, I disabled light sampling and set up similar parameters:

#define kBackbufferWidth 1280
#define kBackbufferHeight 720

#define CPU_CAN_DO_SIMD 1
#define DO_LIGHT_SAMPLING 0

// single-threaded
#define DO_SAMPLES_PER_PIXEL 10
#define CPU_CAN_DO_THREADS 0

// mt
#define DO_SAMPLES_PER_PIXEL (10*25)
#define CPU_CAN_DO_THREADS 1

On my computer the Toy Path Tracer of Aras can do 16.0 mrays/s in single-threaded mode and 165.4mrays/s running on 8c/16t.
After I enabled the AVX2 instruction set in the project settings, the performance improved to 16.8mr/s and 168mr/s. (MSVC compiler)
The best performance of my programs with MSVC compiler are 21.4mr/s in single-threaded and 212.1mr/s in multi-threaded version. (step12 and step13)

Good Readings

In random order:

Rules of optimization by Emil Persson

Performance Matters by Travis Downs

How to optimize C and C++ code in 2018 by Iurii Krasnoshchok

Optimizing software in C++ by Agner Fog

Intel 64 and IA-32 Architectures Optimization Reference Manual by Intel

Do You Prefer Fast or Precise? by MS C++ Team blog

Pitfalls of Object Oriented Programming - Revisited by Tony Albrecht

Memory, Cache, CPU optimization resources by Omar Cornut

easyperf.net by Denis Bakhvalov

How to Write Code the Compiler Can Actually Optimize? by Mike Acton (.pptx)

What We Talk About When We Talk About Performance by Bruce Dawson

Thank you

Thanks to Attila Kocsis and Mihaly Kovacs for suggestions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pics		pics
projects		projects
soa_generator		soa_generator
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_template.md		README_template.md
bench.py		bench.py
update_readme.py		update_readme.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rays1

What's this?

Content

Results

Step 1. Preparations and Setting the Baseline

Before We Start

What are CPU-s Doing?

Step 2. Public Ray Members

Step 3. Non-virtual hit()

Step 4. Array of Spheres

Step 5. SoA Spheres

Step 6. Multi-pass sphere_hit()

Step 7. Vec3 -> float, float, float

Step 8. Manual dot() in hit()

Step 9. Minor Optimizations: ray::dir and Precalculated Data

Step 10. SSE Intrinsics

Step 11. Minor Optimizations

Step 12. AVX (Advanced Vector Extensions)

Step 13. Multi-threading

Benchmarking

Some Benchmark Results

Hyper-threading On/Off

Optimization Steps vs Compilers

Price-performance Ratio, Desktop vs Laptop

All Results

Toy Path Tracer

Good Readings

Thank you

About

Releases

Packages

Contributors 2

Languages

License

montib/rays1bench

Folders and files

Latest commit

History

Repository files navigation

Rays1

What's this?

Content

Results

Step 1. Preparations and Setting the Baseline

Before We Start

What are CPU-s Doing?

Step 2. Public Ray Members

Step 3. Non-virtual hit()

Step 4. Array of Spheres

Step 5. SoA Spheres

Step 6. Multi-pass sphere_hit()

Step 7. Vec3 -> float, float, float

Step 8. Manual dot() in hit()

Step 9. Minor Optimizations: ray::dir and Precalculated Data

Step 10. SSE Intrinsics

Step 11. Minor Optimizations

Step 12. AVX (Advanced Vector Extensions)

Step 13. Multi-threading

Benchmarking

Some Benchmark Results

Hyper-threading On/Off

Optimization Steps vs Compilers

Price-performance Ratio, Desktop vs Laptop

All Results

Toy Path Tracer

Good Readings

Thank you

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages