Edited: 2021-09-28
You'll probably never need this, as aggregation (math operations) is not the strongest suit of any database system - in fact, aggregation should be best left to programming.
However, there are cases when aggregation with SQL is necessary, convenient, or required.
This guide is for that rare occasion.
I said MySQL, but I'm actually using MariaDB.
10.4.16-MariaDB-1:10.4.16+maria~focal
Check the MySQL versus MariaDB compatibility documentation if you really want to - link
I'm using RFM data at work, but for sake of convenient, I will refer to the table as mydataset
and its single column as mydata
.
index | data |
---|---|
1 | 45 |
2 | 32 |
3 | 74 |
.. | .. |
Aggregate functions official documentation - link
It's easy to get total number, total sum, minimum, maximum, mean, standard deviation, and variance - they are all functions provided by MySQL.
As such,
SELECT
COUNT(mydata) AS total_number,
SUM(mydata) AS total_sum,
MIN(mydata) AS minimum_value,
MAX(mydata) AS maximum_value,
AVG(mydata) AS mean,
STD(mydata) AS standard_variation,
VARIANCE(mydata) AS variance
FROM mydataset;
Apparently, there's no such thing as MODE
.
Since mode is the value that is repeated most often, the query below simply gets count of each value in mydata
and from the result in descending order, the first value is returned.
SELECT
mydata AS mode_value, COUNT(mydata) AS group_count
FROM mydataset
GROUP BY mydata
ORDER BY group_count DESC LIMIT 1;
There are other ways to go about this, but this is my limit.
This sort of gets me angry.
- StackOverflow was not at all helpful - they came up with overly complicated answers that I could not replicate (without using
PERCENTILE_CONT
). - There is no
MEDIAN
function, but there isPERCENTILE_CONT(0.5)
, a window function that pretty much returns the median. It is available since version 10.3, so it is available on 10.4. - DataGrip, the tool I use to manage databases, totally DERPED and its inspection told me
PERCENTILE_CONT
is anUnknown database function
.WHYYYYYYYY.
- DataGrip takes it further when it acts like
WITHIN GROUP
is not a thing. - I only executed the query out of frustration, only to find out that IT ACTUALLY WORKS.
- By the time I found that
PERCENTILE_CONT
works, I had nearly completed a complicated query that returns a median.
Here's the simple and neat way to get a median:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY mydata) OVER () AS median_value FROM mydataset LIMIT 1;
Here's the overly complex way to get a median that I am weirdly, slightly proud of:
SELECT
# 4. Gets average of median value(s)
(SUM(t3.lower_median_value) + SUM(t3.upper_median_value))/COUNT(*) AS true_median
FROM (
# 3. From data, retrieves value from the median row(s)
SELECT DISTINCT
NTH_VALUE(t2.mydata, t1.lower_middle_row) OVER (ORDER BY t2.mydata) AS lower_median_value,
NTH_VALUE(t2.mydata, t1.upper_middle_row) OVER (ORDER BY t2.mydata) AS upper_median_value
FROM (
# 1. Gets row number(s) of median
SELECT
FLOOR(COUNT(mydata) / 2) AS lower_middle_row,
CEIL(COUNT(mydata) / 2) AS upper_middle_row
FROM mydataset
) AS t1, (
# 2. Gets data
SELECT mydata
FROM mydataset
) AS t2
) AS t3;
Since I put so much effort in it, I feel the need to explain the overly complex query.
- Count total number of rows
- Half the total count and you get an integer if there are even number of rows, or float if there are odd number of rows (i.e. if 10, then 5. if 9, then 4.5).
FLOOR
will round down,CEIL
will round up. If total count is even, I get two of the same number. If total count is odd, I get two different values (i.e. if 10, then 5 and 5. if 9, then 4 and 5). These are row number(s) of median. - Retrieve values at median rows, and average them - that's the median value.
This guide won't cover how to do complex statistics, but here are some notes on how you could go about it:
- Find a SQL Cookbook, or a Practical SQL guide. In fact, Practical SQL by Anthony DeBarros is recommended. Chapter 5 and chapter 10 covers statistics.
- Learn from my mistake, don't rely too much on code inspection on your db tool.
- Stored procedure is definitely not the way to go about it. Trust me, I've tried (it's very slow from some reason).
- Check first to see if a function that you need is available in Aggregate Functions and/or Window Functions.
- If the result of a function is not what you expect, you might need to refer to documentation. I mean, who knew skewness can be calculated in so many different ways?