# 03. GroupBy and OrderBy

**Aggregating data** (also referred to as windows, rolling up, summarizing, or grouping data) is creating some sort of total from a number of records. **Sum, min, max, count, and average** are common aggregate operations. In SQL you can group these totals on any specified columns, allowing you to control the scope of these aggregations easily.


In [1]:
%load_ext sql
%config SqlMagic.autocommit=False
%config SqlMagic.autolimit=20
%config SqlMagic.displaylimit=20
%sql postgresql://pliu:pliu@127.0.0.1:5432/north_wind

## 3.1 Grouping records

Count the number of records in a table is the most simple and common aggregating function in SQL. Below query count the total order number of table orders

In [2]:
%%sql

select count(*) as order_number from orders

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
1 rows affected.


order_number
830


We can also add some condition to count the records number of certain group, below query count the number of orders that are shipped via company 1.

In [21]:
%%sql
select count(*) as order_number from orders where ship_via=1;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
1 rows affected.


order_number
249


By we have too many records, what if we want to separate the count by year

In [22]:
%%sql
select extract(year from shipped_date) as year, count(*) as order_number from orders where ship_via=1 group by extract(year from shipped_date);

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
4 rows affected.


year,order_number
,4
1997.0,130
1996.0,36
1998.0,79


The output of the above query suddenly becomes more meaningful. We now see the order count by year.
None means some records do not have value at the column **shipped_date**

### 3.1.1 Grouping by using multiple columns

We can group data by using multiple columns. If we retake the above example, we want more precision. We can group data by year and month. Run the below query

In [23]:
%%sql
select extract(year from shipped_date) as year, extract(month from shipped_date) as month, count(*) as order_number from orders where ship_via=1 group by extract(year from shipped_date), extract(month from shipped_date);

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
24 rows affected.


year,month,order_number
,,4
1997.0,6.0,8
1997.0,11.0,13
1998.0,4.0,25
1997.0,3.0,6
1997.0,9.0,14
1996.0,7.0,4
1997.0,4.0,8
1998.0,5.0,5
1997.0,5.0,11


Alternatively, we can use **ordinal positions instead of the column names** in the GROUP BY. The ordinal positions correspond to each item’s numeric position in the SELECT statement. So, instead of writing **extract(year from shipped_date), extract(month from shipped_date)**, we could instead make it **GROUP BY 1, 2** (note, **it starts by 1 not 0**). Below query is an example

In [24]:
%%sql
select extract(year from shipped_date) as year, extract(month from shipped_date) as month, count(*) as order_number from orders where ship_via=1 group by 1, 2;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
24 rows affected.


year,month,order_number
,,4
1997.0,6.0,8
1997.0,11.0,13
1998.0,4.0,25
1997.0,3.0,6
1997.0,9.0,14
1996.0,7.0,4
1997.0,4.0,8
1998.0,5.0,5
1997.0,5.0,11


## 3.2 Ordering records

Notice that the year and month column in above example is not in a natural order that we would expect. To make them in order, we can use the ORDER BY operator, which you can put at the end of a SQL statement after any **WHERE** and **GROUP BY**.

Let's try first order the result by month

In [25]:
%%sql
select extract(year from shipped_date) as year, extract(month from shipped_date) as month, count(*) as order_number from orders where ship_via=1 group by 1, 2 order by month;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
24 rows affected.


year,month,order_number
1997.0,1.0,13
1998.0,1.0,17
1997.0,2.0,8
1998.0,2.0,18
1997.0,3.0,6
1998.0,3.0,14
1998.0,4.0,25
1997.0,4.0,8
1998.0,5.0,5
1997.0,5.0,11


Above query order the final result by month, we can notice the order of year is broken in the result. So we need to order by year first then by month.

In [26]:
%%sql
select extract(year from shipped_date) as year, extract(month from shipped_date) as month, count(*) as order_number from orders where ship_via=1 group by 1, 2 order by year, month;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
24 rows affected.


year,month,order_number
1996.0,7.0,4
1996.0,8.0,9
1996.0,9.0,4
1996.0,10.0,7
1996.0,11.0,6
1996.0,12.0,6
1997.0,1.0,13
1997.0,2.0,8
1997.0,3.0,6
1997.0,4.0,8


You can notice that by default, the sorting order is in ascending order. If you want to sort in descending order instead, you need to apply the DESC operator to the ordering of year to make more recent records appear at the top of the results. Check the below query's output


In [27]:
%%sql
select extract(year from shipped_date) as year, extract(month from shipped_date) as month, count(*) as order_number from orders where ship_via=1 group by 1, 2 order by year desc, month;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
24 rows affected.


year,month,order_number
,,4
1998.0,1.0,17
1998.0,2.0,18
1998.0,3.0,14
1998.0,4.0,25
1998.0,5.0,5
1997.0,1.0,13
1997.0,2.0,8
1997.0,3.0,6
1997.0,4.0,8


## 3.3 Aggregate functions

We already used the COUNT(*) function to count records. But there are other aggregation functions, including SUM() , MIN() , MAX() , and AVG().

### 3.3.1 count

If you specify a column instead of an asterisk, it will count the number of **non-null** values in that column. For instance, we can take a count of shipped_date, which will count the number of non-null values. Compare the result with the output of

```sql
select count(*) as order_number from orders
```

You could notice, they are different, that's because column shipped_data contains null values.


In [28]:
%%sql
select count(shipped_date) as records_count from orders;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
1 rows affected.


records_count
809


### 3.3.2 Average

The column freight indicates the cost of the delivery. If you wanted to find the average delivery cost for each month
of 1997, you could filter for years 1997, group by month, and perform an average on column freight


In [31]:
%%sql
select extract(month from shipped_date) as month, avg(freight) from orders where extract(year from shipped_date)=1997 group by month;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
12 rows affected.


month,avg
1.0,72.49969721427469
2.0,48.58481457498338
3.0,77.18218732252717
4.0,61.19833384056886
5.0,101.29250009357928
6.0,107.7986676732699
7.0,54.90967742377712
8.0,97.81833358771271
9.0,73.09710496389552
10.0,127.17361079487536


You can notice the above result has too much precision, you can use the round() function to ommit some precisions.

TL;DR
The origin datatype of column freight is real not float, so we can't use round() directly on it. We need to convert real to float first, then apply round.
In real world scenario, the round is not necessary, because **real is a lossy, inexact floating-point type. It only uses 4 bytes for storage and cannot store the presented numeric literals precisely to begin with**. During the conversion, the round is already done, no need to call round() again

In [33]:
%%sql
select extract(month from shipped_date) as month, round(avg(freight):: numeric(16,2),2) from orders where extract(year from shipped_date)=1997 group by month;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
12 rows affected.


month,round
1.0,72.5
2.0,48.58
3.0,77.18
4.0,61.2
5.0,101.29
6.0,107.8
7.0,54.91
8.0,97.82
9.0,73.1
10.0,127.17




### 3.3.3 SUM

SUM() is another common aggregate operation. To find the sum of delivery cost of each month of year 1997, run below query:


In [37]:
%%sql
select extract(month from shipped_date) as month, sum(freight) as total_freight from orders where extract(year from shipped_date)=1997 group by month;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
12 rows affected.


month,total_freight
1.0,2392.4902
2.0,1311.79
3.0,2469.83
4.0,1835.9501
5.0,3241.3599
6.0,3233.9602
7.0,1702.2
8.0,3521.46
9.0,2777.6902
10.0,4578.25


### 3.3.4 Multiple aggregation function in one query

There is **no limitation on how many aggregate operations you can use in a single query**. Below query can find the total_freight, max_freight and avg_freight for each month of year 1997 in a single query.


In [43]:
%%sql
select extract(month from shipped_date) as month,
sum(freight) as total_freight,
max(freight) as max_freight,
avg(freight) as avg_freight
from orders
where extract(year from shipped_date)=1997
group by month;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
12 rows affected.


month,total_freight,max_freight,avg_freight
1.0,2392.4902,370.61,72.49969721427469
2.0,1311.79,458.78,48.58481457498338
3.0,2469.83,708.95,77.18218732252717
4.0,1835.9501,367.63,61.19833384056886
5.0,3241.3599,789.95,101.29250009357928
6.0,3233.9602,1007.64,107.7986676732699
7.0,1702.2,379.13,54.90967742377712
8.0,3521.46,544.08,97.81833358771271
9.0,2777.6902,364.15,73.09710496389552
10.0,4578.25,810.05,127.17361079487536


### 3.3.5 Use where to get specific aggregations

We can achieve some very **specific aggregations by leveraging the WHERE**. If you wanted the total delivery cost by year of all orders that are shipped by company 1, you would just have to filter on ship_via=1. Below query will only count the delivery cost of orders that are shipped by company 1.


In [42]:
%%sql
select extract(year from shipped_date) as year,
sum(freight) as total_number1_freight
from orders
where extract(year from shipped_date)>=1996 and ship_via=1
group by year;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
3 rows affected.


year,total_number1_freight
1996.0,2052.7498
1997.0,8819.119
1998.0,5163.289


## 3.4 The having statement

We can use **where** to filter rows that satisfied certain conditions. But we can't use it to filter the aggregated value. **The way aggregation works is that the
software processes record by record, finding which ones it wants to keep based on the WHERE condition. After that, it crunches the records down on the GROUP BY and performs any aggregate functions, such as SUM(). If we wanted to filter on the SUM() value, we would need the filter to take place after it is calculated.**

To filter the aggregated value, we need to use **HAVING**. HAVING is the aggregated equivalent to WHERE. The WHERE keyword filters individual records, but HAVING filters aggregations.

Below query output only the rows that have avg(freight)>30

```sql
select extract(month from shipped_date) as month,
avg(freight) as avg_freight
from orders
where extract(year from shipped_date)=1997
group by month
having avg_freight>60;
```

Note the above query will not run in postgres. Because **some platforms (including Oracle, Postgresql)** do not support aliases in the HAVING statement (just like the GROUP BY ). This means you must specify the aggregate function again in the HAVING statement.

For instance, to run the above query in postgres, you must write the query as followed

In [45]:
%%sql
select extract(month from shipped_date) as month,
avg(freight) as avg_freight
from orders
where extract(year from shipped_date)=1997
group by month
having avg(freight)>60;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
9 rows affected.


month,avg_freight
1.0,72.49969721427469
3.0,77.18218732252717
4.0,61.19833384056886
5.0,101.29250009357928
6.0,107.7986676732699
8.0,97.81833358771271
9.0,73.09710496389552
10.0,127.17361079487536
12.0,85.74918859713786


## 3.5 Get distinct record

It is not uncommon to want a set of distinct results from a query. You can use **distinct()** or **distinct** to get distinct values of a column. Below two queries return the same result.

In [52]:
%%sql
select distinct(customer_id) from orders limit 5;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
5 rows affected.


customer_id
TOMSP
LONEP
OLDWO
WARTH
MAGAA


In [53]:
%%sql
select distinct customer_id from orders limit 5;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
5 rows affected.


customer_id
TOMSP
LONEP
OLDWO
WARTH
MAGAA


We know there are 830 records in the orders table. But suppose we want to get a distinct list of the customer_id values, how can we get that? We can combine aggregation function with distinct. Below query is an example

In [54]:
%%sql
select count(distinct(customer_id)) from orders;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
1 rows affected.


count
89


distinct operator can be applied on multiple columns.


In [59]:
%%sql
select distinct customer_id, ship_via from orders limit 5;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
5 rows affected.


customer_id,ship_via
GALED,1
WHITC,1
FRANR,2
ROMEY,2
THEBI,3


But when we apply distinct on multiple columns, **we can not use the distinct() version anymore**. Try below query, you will notice it will group all columns in () as a set.

In [60]:
%%sql
select distinct(customer_id, ship_via) from orders limit 5;

 * postgresql://pliu:***@127.0.0.1:5432/north_wind
5 rows affected.


row
"(ALFKI,1)"
"(ALFKI,2)"
"(ALFKI,3)"
"(ANATR,1)"
"(ANATR,3)"


# Exercise

1. Find the customer that has placed the most orders

In [None]:
%%sql

select customer_id, count(*) as customer_order_count from orders group by customer_id order by customer_order_count desc;