/ vstat Public

SIMD-enabled descriptive statistics (mean, variance, covariance, correlation)

# heal-research/vstat

## Folders and files

NameName
Last commit message
Last commit date

## History

### Introduction

is a C++17 library of computationally efficient methods for calculating sample statistics (mean, variance, covariance, correlation).

• the implementation builds upon the SIMD abstraction layer provided by the EVE [1]
• it uses a data-parallel Youngs and Cramer [2] algorithm for numerically stable computations of sums and sums-of-squares.
• the results from independent data partitions are combined with the approach by Schubert and Gertz [3].
• the methods are validated for correctness against statistical methods from the GNU Scientific Library [4].

### Usage

To use this library you simply need to copy the contents of the `include` folder inside your project, and then `#include <vstat.hpp>`. Defining `VSTAT_NAMESPACE` before inclusion will allow you to set a custom namespace for the library.

Two convenience methods are provided for batch data:

• `univariate::accumulate` for univariate statistics (mean, variance, standard deviation)
• `bivariate::accumulate` for bivariate statistics (covariance, correlation)

The methods return a `statistics` object which contains all the stat values. For example:

```std::vector<float> values{ 1.0, 2.0, 3.0, 4.0 };
std::vector<float> weights{ 2.0, 4.0, 6.0, 8.0 };

// unweighted data
auto stats = univariate::accumulate<float>(values.begin(), values.end());
std::cout << "stats:\n" << stats << "\n";

count:                  4
sum:                    10
ssr:                    5
mean:                   2.5
variance:               1.25
sample variance:        1.66667

// weighted data
auto stats = univariate::accumulate<float>(values.begin(), values.end(), weights.begin());
std::cout << "stats:\n" << stats << "\n";

count:                  20
sum:                    60
ssr:                    20
mean:                   3
variance:               1
sample variance:        1.05263```

Besides iterators, it is also possible to provide raw pointers:

```float x[] = { 1., 1., 2., 6. };
float y[] = { 2., 4., 3., 1. };
size_t n = std::size(x);

auto stats = bivariate::accumulate<float>(x, y, n);
std::cout << "stats:\n" << stats << "\n";

// results
count:                  4
sum_x:                  10
ssr_x:                  17
mean_x:                 2.5
variance_x:             4.25
sample variance_x:      5.66667
sum_y:                  10
ssr_y:                  5
mean_y:                 2.5
variance_y:             1.25
sample variance_y:      1.66667
correlation:            -0.759257
covariance:             -1.75
sample covariance:      -2.33333```

It is also possible to use projections to aggregate stats over object properties:

```struct Foo {
float value;
};

Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
auto stats = univariate::accumulate<float>(foos, std::size(foos), [](auto const& foo) { return foo.value; });
std::cout << "stats:\n" << stats << "\n";

// results
count:                  5
sum:                    19
ssr:                    30.8
mean:                   3.8
variance:               6.16
sample variance:        7.7

struct Foo {
float value;
};

struct Bar {
int value;
};

Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
Bar bars[] = { {3}, {2}, {1}, {4}, {11} };

auto stats = bivariate::accumulate<float>(foos, bars, std::size(foos), [](auto const& foo) { return foo.value; },
[](auto const& bar) { return bar.value; });
std::cout << "stats:\n" << stats << "\n";

// results
count:                  5
sum_x:                  19
ssr_x:                  30.8
mean_x:                 3.8
variance_x:             6.16
sample variance_x:      7.7
sum_y:                  21
ssr_y:                  62.8
mean_y:                 4.2
variance_y:             12.56
sample variance_y:      15.7
correlation:            0.686676
covariance:             6.04
sample covariance:      7.55```

The methods above accept a batch of data and calculate relevant statistics. If the data is streaming, then one can also use accumulators. The accumulator is a lower-level object that is able to perform calculations online as new data arrives:

```univariate_accumulator<float> acc;
acc(1.0);
acc(2.0);
acc(3.0);
acc(4.0);
auto stats = univariate_statistics(acc);
std::cout << "stats:\n" << stats << "\n";

Count:                  4
Sum:                    10
Sum of squares:         5
Mean:                   2.5
Variance:               1.25
Sample variance:        1.66667```

The template parameter tells the accumulator how to represent data internally.

• if a scalar type is provided (e.g. `float` or `double`), the accumulator will perform all operations with scalars (i.e., no SIMD).
• if a SIMD-type is provided (e.g., `eve::wide`) then the accumulator will perform data-parallel operations

This allows the user to combine accumulators, for example using a SIMD-enabled accumulator to process the bulk of the data and a scalar accumulator for the left-over points.

#### Available statistics

• univariate

```struct univariate_statistics {
double count;
double sum;
double ssr;
double mean;
double variance;
double sample_variance;
};```
• bivariate

```struct bivariate_statistics {
double count;
double sum_x;
double sum_y;
double ssr_x;
double ssr_y;
double sum_xy;
double mean_x;
double mean_y;
double variance_x;
double variance_y;
double sample_variance_x;
double sample_variance_y;
double correlation;
double covariance;
double sample_covariance;
};```

### Benchmarks

The following libraries have been used for performance comparison in the univariate (variance) and bivariate (covariance) case:

#### Methodology

• we generate 1M values uniformly distributed between [-1, 1] and save them into a `double` and a `float` array
• increase the data size in 100k increments and benchmark the performance for each method using nanobench

#### Notes

• we did not use MKL as a backend for numpy and gsl (expect MKL performance to be higher)
• linasm methods for variance and covariance require precomputed array means, so means computation is factored into the benchmarks
• hardware: Ryzen 9 5950X

### Acknowledgements

SIMD-enabled descriptive statistics (mean, variance, covariance, correlation)

## Releases

No releases published

## Packages 0

No packages published

•
•