This Postgres extension provides various functions for operating on arrays, for instance taking the histogram of an array of numbers.
These functions are useful because if you have a lot values you want to aggregate,
queries that fetch each value from a separate row can have poor performance.
Storing all the values in a single row as a Postgres array
can drastically improve query performance.
For instance, computing a 1000-bucket histogram on one million float values
stored in separate rows took 12 seconds in a simple benchmark,
compared to 27 milliseconds with the array_to_hist
function.
Even if you store all the values together in an array column, these functions still outperform aggregations done in SQL or plpgsql, because the Postgres C API lets you skip a lot of the work for interfacing at those higher levels. For instance, the same benchmark gave 398ms for a SQL solution and 12 seconds for a plpgsql solution. We show further benchmark results below.
This package installs like any Postgres extension. First say:
make && sudo make install
You will need to have pg_config
in your path,
but normally that is already the case.
You can check with which pg_config
.
Then in the database of your choice say:
CREATE EXTENSION aggs_for_arrays;
The available functions are described below.
In general, these functions accept arrays of any integer or floating-point type,
namely SMALLINT
, INTEGER
, BIGINT
, REAL
, or DOUBLE PRECISION
(aka FLOAT
).
The return value will either be the same type (e.g. for a minimum),
a FLOAT (e.g. for a mean),
or an INTEGER
type (e.g. for histogram bucket counts).
If a function can take any numeric type,
its types are shown as T
.
Returns the bucket count based on the values and bucket characteristics you request.
Returns the mean of all the values in the array.
Returns the median. Does not require a pre-sorted input. If there are an even number of values, returns the mean of the two middle values.
Just like array_to_median
, but assumes values
is already sorted.
Returns the mode. Does not require a pre-sorted input. If there are several values tied for most common, returns their mean.
Just like array_to_mode
, but assumes values
is already sorted.
Returns the percentile you request,
where percentile
is a number from 0 to 1 inclusive.
Asking for 0 will always give the minimum,
1 for maximum, and 0.5 the median.
If you ask for a percentile that lands between two data points,
we return a linear interpolation between them.
Just like array_to_percentile
, but assumes values
is already sorted.
Just like array_to_percentile
,
but you can pass several percentiles
and get the result for each in a single call.
Just like array_to_percentiles
, but assumes values
is already sorted.
Returns the greatest value in the array.
Returns the least value in the array.
Returns a tuple with the min in position 1 and the max in position 2.
Computes the skewness of the given values.
Computes the kurtosis of the given values.
Assume you have two tables:
CREATE TABLE samples (
id INTEGER PRIMARY KEY,
measurement_id INTEGER NOT NULL,
value FLOAT NOT NULL
);
CREATE TABLE sample_groups {
id INTEGER PRIMARY KEY,
measurement_id INTEGER NOT NULL,
values FLOAT[] NOT NULL
};
These tables store the same information,
but samples
stores each sample in a separate row,
and sample_groups
stores a whole group in just one row.
You can run bench.sh
to test the performance of various approaches:
* SQL on `samples`.
* SQL on `sample_groups`.
* PLPGSQL on `sample_groups`.
* The `aggs_for_arrays` function on `sample_groups`.
The sorted_array_to_*
methods use sorted_samples
and sorted_sample_groups
instead.
function | SQL row-based | SQL array-based | PLPGSQL array-based | aggs_for_arrays |
---|---|---|---|---|
array_to_hist |
12218.1 ms | 398.235 ms | 12310.800 ms | 26.936 ms |
array_to_mean |
10630.0 ms | 121.677 ms | 390.983 ms | 25.226 ms |
array_to_median |
33587.0 ms | 1163.070 ms | 1258.160 ms | 47.996 ms |
sorted_array_to_median |
23239.5 ms | 30.107 ms | 41.225 ms | 14.835 ms |
array_to_mode |
13724.1 ms | 1505.310 ms | 1552.610 ms | 201.943 ms |
sorted_array_to_mode |
13195.2 ms | 1474.130 ms | 1577.770 ms | 45.171 ms |
array_to_percentile |
24218.2 ms | 2591.240 ms | 1698.570 ms | 179.916 ms |
sorted_array_to_percentile |
24305.5 ms | 2102.520 ms | 1204.140 ms | 21.947 ms |
array_to_percentiles |
32367.0 ms | 10735.300 ms | 3608.800 ms | 188.752 ms |
sorted_array_to_percentiles |
32294.3 ms | 10153.300 ms | 3120.830 ms | 22.227 ms |
array_to_max |
10613.2 ms | 115.094 ms | 398.791 ms | 17.321 ms |
array_to_min |
10600.5 ms | 113.859 ms | 400.926 ms | 17.204 ms |
array_to_min_max |
10727.9 ms | 169.226 ms | 824.539 ms | 23.922 ms |
array_to_skewness |
22267.2 ms | 802.463 ms | 1077.630 ms | 120.925 ms |
array_to_kurtosis |
22253.1 ms | 806.296 ms | 1075.960 ms | 112.210 ms |