# Aggregration and grouping


In this part, you'll learn how to compute statistics, group rows, and filter such groups. Such operations are extremely important for preparing reports and always come in handy in big tables.


We're going to be using `DB Fiddle` for this course. 

Navigate to: https://www.db-fiddle.com/

In the top right corner of the webpage, be sure to select `Database: PostgreSQL 13`

Now, in the `Schema SQL` pane on the left copy and paste the following:

```
CREATE TABLE IF NOT EXISTS "employees" (
    "department" TEXT,
    "first_name" TEXT,
    "last_name" TEXT,
    "year" INT,
    "salary" INT,
    "position" TEXT
);
INSERT INTO "employees" VALUES
    ('IT','Olivia','Pearson',2011,3000,'Trainee'),
    ('IT','Olivia','Pearson',2012,3000,'Trainee'),
    ('IT','Olivia','Pearson',2012,4200,'Junior Developer'),
    ('IT','Olivia','Pearson',2013,4900,'Junior Developer'),
    ('IT','Olivia','Pearson',2014,8100,'Senior Developer'),
    ('Management','Jack','Johnson',2011,4300,'Junior Project Manager'),
    ('Management','Jack','Johnson',2012,5100,'Project Manager'),
    ('Management','Jack','Johnson',2013,7200,'Senior Project Manager'),
    ('Management','Jack','Johnson',2014,7600,'Senior Project Manager'),
    ('Management','Jack','Johnson',2015,9500,'Head of Department'),
    ('IT','Harry','Taylor',2015,2700,'Trainee'),
    ('Human Resources','Lily','Bennett',2013,1900,'Junior HR Specialist'),
    ('Human Resources','Lily','Bennett',2014,2300,'HR Specialist'),
    ('Human Resources','Lily','Bennett',2015,3650,'Senior HR Specialist'),
    ('Accounting','Charlie','Johnson',2010,2000,'Junior Accountant'),
    ('Accounting','Charlie','Johnson',2011,2000,'Junior Accountant'),
    ('Accounting','Charlie','Johnson',2012,2500,'Accountant'),
    ('Accounting','Charlie','Johnson',2013,3200,'Accountant'),
    ('Accounting','Charlie','Johnson',2014,3700,'Senior Accountant'),
    ('Accounting','Charlie','Johnson',2015,4200,'Senior Accountant');
```


By now, you're already pretty skilled when it comes to filtering rows – but have you wondered how they are sorted in the result of an SQL query? 

Well, the answer is simple – by default, they are not sorted at all. 

The sequence in which rows appear is arbitrary and every database can behave differently. 

You can even perform the same SQL instruction a few times and get a different order each time – unless you ask the database to sort the rows, of course.

```
SELECT *
FROM orders
ORDER BY customer_id;
```


In the above example, we've added a new piece: `ORDER BY`. After this expression, you can simply specify a column on which the data will be sorted.

In this case, we want to sort by the customers' IDs, so we put `customer_id` in the ORDER BY clause.


## Exercise

Try it yourself. Select all columns from the employees table, and sort the result by the salary.

#### Show me the answer

```
SELECT *
FROM employees
ORDER BY salary;
```

# ORDER BY with conditions

Excellent! Now you can easily examine who's got the lowest and the highest salary. It's not that hard, as you can see.

We can filter rows and sort them at the same time. Just have a look:

```
SELECT *
FROM orders
WHERE customer_id = 100
ORDER BY total_sum;
```

The `WHERE` clause and `ORDER BY` work well together.

In this case, we'll only see the orders made by the customer with id 100. The orders will be sorted on the total sum – the cheapest order will appear as the first result and the most expensive as the last one.

## Exercise

Select only the rows related to 2011 from the employees table. Sort the result by the salary.

#### Show me the answer

```
SELECT * FROM employees
WHERE year = 2011
ORDER BY salary;
```


# Ascending and descending orders

As you can see, the lowest salary was shown first and the highest salary last. This ascending order of results is performed in SQL by default. 

If you want to be precise and make things clear, however, you can use the keyword `ASC` (short for the ascending order) after the column name:

```
SELECT *
FROM orders
ORDER BY total_sum ASC;
```

Adding the keyword `ASC` will change nothing, but it will show your intention in a very clear way.

We can also reverse the order and make the greatest values appear first.

```
SELECT *
FROM orders
ORDER BY total_sum DESC;
```

As you can see, we've added the word `DESC` after the column name, which is short for the descending order. As a result, the highest values in the column total_sum will be shown first.

### Exercise

Select all rows from the employees table and sort them in the descending order by the column last_name.

#### Show me the answer

```
SELECT *
FROM employees
ORDER BY last_name DESC;
```

# Sort by a few columns

One more thing before we move on: you can sort your results by more than one column and each of them can be sorted in a different order:

```
SELECT *
FROM order
ORDER BY customer_id ASC, total_sum DESC;
```

As you can see, the results will first be sorted by customer_id in the ascending order (lowest values first) and then, for each customer_id, the orders will be sorted by the total_sum in the descending order (greatest values first).

### Exercise

Select all rows from the employees table and sort them in the ascending order by the department and then in the descending order by the salary.

#### Show me the answer

```
SELECT *
FROM employees
ORDER BY
  department ASC,
  salary DESC;
```

# Limiting the output

We'll show you another feature of PostgreSQL. 

By default, PostgreSQL returns every row that matches the given criteria. This is what we normally expect, of course, but there are cases when we might want to change this behavior.

The more rows the database has to retrieve, the more time it takes. This isn't good, especially when we don't have to look at all the results but only need a small glimpse at the data. 

Take a look:

```
SELECT *
FROM orders
LIMIT 10;
```

`LIMIT n` returns the first n rows from the result. This is much more efficient than returning all the data from the database.

You can see something similar in our sandbox environment here. Every time you run a query, it returns the first 20 rows. This way, you get the response faster.

### Exercise

Select the top five rows of `salary` and `position` from the employees table. Use the template provided.

Try running the template before solving the exercise in order to see the difference for yourself.

#### Show me the answer

```
SELECT salary, position
FROM employees
LIMIT 5;
```

## Exercise

This time, show the top ten highest salaries from the employees table. Select the position column as well. Modify the answer from the previous exercise.


```
SELECT
  salary,
  position
FROM employees
ORDER BY salary DESC
LIMIT 10;
```

# Duplicate results


We'll now focus on another aspect. By default, the database returns every row which matches the given criteria. This is what we normally expect, of course, but there are cases when we might want to change this behavior.

Imagine the following situation: we want to get the IDs of all customers who have ever placed an order. We might use the following code:

```
SELECT customer_id
FROM orders;
```

What's wrong with the code in this case? Well, try to do the exercise to find out.

### Exercise

Select the column year for all rows in the employees table. Then examine the result carefully.

```
SELECT year
FROM employees;
```

# Select distinctive values

Could you see the problem? There were many rows with the same year, so each year is shown many times in the results.

In our orders example, if there were many orders placed by the same customer, each customer ID would be shown many times in the results. Not good.

Fortunately, we can easily change this.

```
SELECT DISTINCT customer_id
FROM orders;
```

Before the column name, we've added the word `DISTINCT`. Now the database will remove duplicates and only show distinct values. Each customer_id will appear only once.

### Exercise

Select the column year from the employees table in such a way that each year is only shown once.

```
SELECT DISTINCT year
FROM employees;
```

# Select distinctive values in certain columns

You can also use `DISTINCT` on a group of columns. Take a look:

```
SELECT DISTINCT
  customer_id,
  order_date
FROM orders;
```

One customer may place many orders every day, but if we just want to know on what days each customers actually did place at least one order, the above query will check that.

### Exercise

Check what positions there are in every department. 

In order to do that, select the columns `department` and `position` from the `employees` table and eliminate duplicates.

```
SELECT DISTINCT department, position
FROM employees;
```

# Count the rows

You already know that your database can do computation because we've already added or subtracted values in our SQL instructions. 

The database can do much more than that. It can compute statistics for multiple rows. 

This operation is called aggregation.

Let's start with something simple:

```
SELECT COUNT(*)
FROM orders;
```

Instead of the asterisk (`*`) which basically means "all", we've put the expression `COUNT(*)`.

`COUNT(*)` is a function. 

A function in SQL always has a name followed by parentheses. 

In the parentheses, you can put information which the function needs to work. 

For example, `COUNT()` calculates the number of rows specified in the parentheses.

In this case, we've used `COUNT(*)` which basically means "count all rows". 

As a result, we'll just get the number of all rows in the orders table – and not their content.

### Exercise

Count all rows in the employees table.





```
SELECT COUNT(*)
FROM employees;
```

# Count the rows, ignore the NULLS

Naturally, the asterisk `(*)` isn't the only option available in the function `COUNT()`. 

For example, we may ask the database to count the values in a specific column:

```
SELECT COUNT(customer_id)
FROM orders;
```

What's the difference between `COUNT(*)` and `COUNT(customer_id)`? 

Well, the first option counts all rows in the table and the second option counts all rows where the column `customer_id` has a specified value. 

In other words, if there is a `NULL` in the column customer_id, that row won't be counted.

### Exercise

Check how many non-`NULL` values in the column `position` there are in the `employees` table. Name the column `non_null_no`.


```
SELECT COUNT(position) AS non_null_no
FROM employees;
```

# Count distinctive values in a column

As you probably expect, we can also add the `DISTINCT` keyword in our `COUNT()` function:

```
SELECT COUNT(DISTINCT customer_id) AS distinct_customers
FROM orders;
```

This time, we count all rows which have a distinctive value in the column `customer_id.` 

In other words, this instruction tells us how many different customers have placed an order so far. 

If a customer places 5 orders, the customer will only be counted once.

### Exercise

Count how many different positions there are in the employees table. 

Name the column distinct_positions.



```
SELECT COUNT(DISTINCT position) AS distinct_positions
FROM employees;
```

# Find the minimum and maximum value

Of course, `COUNT()` is not the only function out there. Let's learn some others!

```
SELECT MIN(total_sum)
FROM orders;
```

The function `MIN(total_sum)` returns the smallest value of the column `total_sum`. 

You can also use a similar function, namely `MAX()`. That's right, it returns the biggest value of the specified column. 


### Exercise

Select the highest salary from the employees table.

```
SELECT MAX(salary)
FROM employees;
```

# Find the average value

Let's discuss another function:

```
SELECT AVG(total_sum)
FROM orders
WHERE customer_id = 100;
```
The function `AVG()` finds the average value of the specified column.

In the above example, we'll get the average order value for the customer with ID of 100.

### Exercise

Find the average salary in the employees table for the year 2013.

```
SELECT AVG(salary)
FROM employees
WHERE year = 2013;
```

# Find the sum

That's right. The last function that we'll discuss is `SUM()`.

Examine the example:

```
SELECT SUM(total_sum)
FROM orders
WHERE customer_id = 100;
```

The above instruction will find the total sum of all orders placed by the customer with ID of 100.

### Exercise

Find the sum of all salaries in the Marketing department in 2014. Remember to put the department name in the single quotes!


```
SELECT SUM(salary)
FROM employees
WHERE year = 2014
  AND department = 'Marketing';
```

# Group the rows and count them

We'll now go on to study even more sophisticated statistics. Look at the following statement:

```
SELECT
  customer_id,
  COUNT(*)
FROM orders
GROUP BY customer_id;
```

The new piece here is `GROUP BY` followed by a column name (`customer_id`). `GROUP BY` will group together all rows having the same value in the specified column.

In our example, all orders made by the same customer will be grouped together in one row. The function `COUNT(*)` will then count all rows for the specific clients. 

As a result, we'll get a table where each `customer_id` will be shown together with the number of orders placed by that customer.

### Exercise

Find the number of employees in each department in 2013. Show the department name together with the number of employees. Name the second column `employees_no`.



```
SELECT
  department,
  COUNT(*) AS employees_no
FROM employees
WHERE year = 2013
GROUP BY department;
```

# Find min and max values in groups

Of course, `COUNT(*)` isn't the only option. In fact, `GROUP BY` is used together with many other functions. Take a look:

```
SELECT
  customer_id,
  MAX(total_sum)
FROM orders
GROUP BY customer_id;
```

We've replaced `COUNT(*)` with MAX(total_sum). C

an you guess what happens now?

That's right, instead of counting all the orders for specific clients, we'll find the order with the highest value for each customer.

### Exercise

Show all departments together with their lowest and highest salary in 2014.

```
SELECT
  department,
  MIN(salary),
  MAX(salary)
FROM employees
WHERE year = 2014
GROUP BY department;
```

# Find the average value in groups

Let's study one more example of this kind:

```
SELECT
  customer_id,
  AVG(total_sum)
FROM orders
WHERE order_date >= '2019-01-01'
  AND order_date < '2020-01-01'
GROUP BY customer_id;
```

As you can see, we now use the function `AVG(total_sum)` which will count the average order value for each of our customers but only for their orders placed in 2019.

### Exercise

For each department find the average salary in 2015.

```
SELECT
  department,
  AVG(salary)
FROM employees
WHERE year = 2015
GROUP BY department;
```

# Group by a few columns


Here's one more thing about `GROUP BY` that we want to discuss. 

Sometimes we want to group the rows by more than one column. 

Let's imagine we have a few customers who place tons of orders every day, so we would like to know the daily sum of their orders.

```
SELECT
  customer_id,
  order_date,
  SUM(total_sum)
FROM orders
GROUP BY customer_id, order_date;
```

As you can see, we group by two columns: customer_id and order_date. We select these columns along with the function `SUM(total_sum)`.

Remember: in such queries each column in the `SELECT` part must either be used later for grouping or it must be used with one of the functions.

### Exercise
Find the average salary for each employee. Show the last name, the first name, and the average salary. Group the table by the last name and the first name.

```
SELECT
  last_name,
  first_name,
  AVG(salary)
FROM employees
GROUP BY last_name, first_name;
```

# Filter groups

We'll have a look at how groups can be filtered. There is a special keyword `HAVING` reserved for this.

```
SELECT
  customer_id,
  order_date,
  SUM(total_sum)
FROM orders
GROUP BY customer_id, order_date
HAVING SUM(total_sum) > 2000;
```

The new part here comes at the end. 

We've used the keyword `HAVING` and then stated the condition to filter the results. 

In this case, we only want to show those customers who, on individuals days, ordered goods with a total daily value of more than 2,000.

By the way, this is probably a good time to point out an important thing: in SQL, the specific fragments must always be put in the right order. You can't, for example, put WHERE before FROM. Similarly, HAVING must always follow GROUP BY, not the other way around. Keep that in mind when you write your queries, especially longer ones.

### Exercise

Find such employees who (have) spent more than 2 years in the company. Select their last name and first name together with the number of years worked (name this column years).

```
SELECT
  last_name,
  first_name,
  COUNT(DISTINCT year) AS years
FROM employees
GROUP BY last_name, first_name
HAVING COUNT(DISTINCT year) > 2;
```

### Exercise
Find such departments where the average salary in 2012 was higher than 3,000. Show the department name with the average salary.

```
SELECT
  department,
  AVG(salary)
FROM employees
WHERE year = 2012
GROUP BY department
HAVING AVG(salary) > 3000;
```

# Order groups

There's one more thing before you go. Groups can be sorted just like rows. 

Take a look:

```
SELECT
  customer_id,
  order_date,
  SUM(total_sum)
FROM orders
GROUP BY customer_id, order_date
ORDER BY SUM(total_sum) DESC;
```

In this case, we'll order our rows according to the total daily sum of all orders by a specific customer. The rows with the highest value will appear first.

### Exercise
Sort the employees according to their summary salaries. Highest values should appear first. Show the last name, the first name, and the sum.

```
SELECT
  last_name,
  first_name,
  SUM(salary)
FROM employees
GROUP BY last_name, first_name
ORDER BY SUM(salary) DESC;
```

# Put your skills into practice

### Exercise

Show the columns `last_name` and `first_name` from the `employees` table together with each person's average salary and the number of years they (have) worked in the company.

Use the following aliases: `average_salary` for each person's average salary and `years_worked` for the number of years worked in the company. 

Show only such employees who (have) spent more than 2 years in the company. Order the results according to the average salary in the descending order.



```
SELECT
  last_name,
  first_name,
  AVG(salary) AS average_salary,
  COUNT(DISTINCT year) AS years_worked
FROM employees
GROUP BY last_name, first_name
HAVING COUNT(DISTINCT year) > 2
ORDER BY AVG(salary) DESC;
```