# Window Functions, Part 1

In this lesson, we're going to talk about `window functions` in PostgreSQL. 

SQL window functions make building complex aggregations much simpler. 

They are so powerful that they serve as a dividing point in time: people talk about SQL before window functions and SQL after window functions.

After window functions and Common Table Expressions were introduced to the language, SQL has become [`Turing complete`](https://www.youtube.com/watch?v=RPQD7-AOjMI).

Window functions are a relatively new addition to SQL.

They were originally standardized with SQL:2003. 

Postgres has supported window functions since PostgreSQL 8.4. It was the first major database system to do so.

We're going to start from the very beginning. 

Step by step, you will learn new elements that make up window functions. 

I hope that at the end of our course, you will take one more look at that scary example and think: this is now so easy!

## What is a window function exactly? 

It is a function that performs calculations across a set of table rows. 

The rows are somehow related to the current row.

For example, with window functions you can compute sum of values in the current row, one before and one after, as in the [picture](https://learnsql.com/static/postgresql-window-functions-window-functions-part2-ex4.gif):

<img src = 'https://learnsql.com/static/postgresql-window-functions-window-functions-part2-ex4.gif'>



We call it window functions precisely because the set of rows is called a window or a window frame. Take a look at the syntax:

````
<window_function> OVER (...)
````

`<window_function>` can be an aggregate function that you already know (`COUNT()`, `SUM()`, `AVG()` etc.), or another function, such as a ranking or an analytical function that you'll get to know in the coming lectures.

The window frame is defined in the `OVER(...)` clause. 

The large part of the course explains how to define the window frame with `OVER(...)`. T

This is what we're going to talk about in the next section.


But first, let's get our schema

https://www.db-fiddle.com/f/7FpnuSLVbfMFScZzCR2nsa/0

Here's the schema:

```

CREATE TABLE IF NOT EXISTS "employee" (
    "id" INT,
    "first_name" TEXT,
    "last_name" TEXT,
    "department_id" INT,
    "salary" INT,
    "years_worked" INT
);
INSERT INTO "employee" VALUES
    (1,'Diane','Turner',1,5330,4),
    (2,'Clarence','Robinson',1,3617,2),
    (3,'Eugene','Phillips',1,4877,2),
    (4,'Philip','Mitchell',1,5259,3),
    (5,'Ann','Wright',2,2094,5),
    (6,'Charles','Wilson',2,5167,5),
    (7,'Russell','Johnson',2,3762,4),
    (8,'Jacqueline','Cook',2,6923,3),
    (9,'Larry','Lee',3,2796,4),
    (10,'Willie','Patterson',3,4771,5),
    (11,'Janet','Ramirez',3,3782,2),
    (12,'Doris','Bryant',3,6419,1),
    (13,'Amy','Williams',3,6261,1),
    (14,'Keith','Scott',3,4928,8),
    (15,'Karen','Morris',4,6347,6),
    (16,'Kathy','Sanders',4,6286,1),
    (17,'Joe','Thompson',5,5639,3),
    (18,'Barbara','Clark',5,3232,1),
    (19,'Todd','Bell',5,4653,1),
    (20,'Ronald','Butler',5,2076,5);

CREATE TABLE IF NOT EXISTS "department" (
    "id" INT,
    "name" TEXT
);
INSERT INTO "department" VALUES
    (1,'IT'),
    (2,'Management'),
    (3,'Human Resources'),
    (4,'Accounting'),
    (5,'Help Desk');
    
CREATE TABLE IF NOT EXISTS "purchase" (
    "id" INT,
    "department_id" INT,
    "item" TEXT,
    "price" INT
);
INSERT INTO "purchase" VALUES
    (1,4,'monitor',531),
    (2,1,'printer',315),
    (3,3,'whiteboard',170),
    (4,5,'training',117),
    (5,3,'computer',2190),
    (6,1,'monitor',418),
    (7,3,'whiteboard',120),
    (8,3,'monitor',388),
    (9,5,'paper',37),
    (10,1,'paper',695),
    (11,3,'projector',407),
    (12,4,'garden party',986),
    (13,5,'projector',481),
    (14,2,'chair',180),
    (15,2,'desk',854),
    (16,2,'post-it',15),
    (17,3,'paper',60),
    (18,2,'tv',943),
    (19,2,'desk',478),
    (20,5,'keyboard',214);

```


# `OVER()`

Let's start by focusing on `OVER (...)`, which defines the window. 

The most basic example is `OVER()` and means that the window consists of all rows in the query. Take a look:

```
SELECT
  first_name,
  last_name,
  salary,  
  AVG(salary) OVER()
FROM employee;
```


That's not a very complicated query, but take a look at the last column:

```
AVG(salary) OVER()
```

`AVG(salary)` means we're looking for the average salary. 

Where exactly? 

Everywhere we can, because `OVER()` means 'for all rows in the query result'. 

In others words, we're looking for the average salary in the entire company.

Note that we did NOT group rows. 

`OVER()` makes it possible to show the details of single rows and the result of an aggregating function together. 

That wouldn't be so easy with GROUP BY — we would have to write a subquery, which is more complicated and less effective. 

`OVER()` makes our work simple and efficient at the same time.

# Exercise


Now it's your turn to write a window function. For each employee, find their first name, last name, salary and the sum of all salaries in the company.

Note that the last column is an aggregated column, even though you're not using a `GROUP BY`.



```
SELECT
  first_name,
  last_name,
  salary,
  SUM(salary) OVER()
FROM employee;
```

# Exercise

For each item in the purchase table, select its name (column item), price and the average price of all items.

```
SELECT
  item,
  price,
  AVG(price) OVER()
FROM purchase;
```

# Computations with OVER()

Typically,`OVER()` is used to compare the current row with an aggregate. 

For example, we can compute the difference between employee's salary and the average salary. Actually, why don't we calculate the difference between these two values? 

Take a look:

```
SELECT
  first_name,
  last_name,
  salary,
  AVG(salary) OVER(),
  salary - AVG(salary) OVER() as difference
FROM employee;
```

The last column shows the difference between the employee's salary and the average salary. That's the typical usage of window functions: compare the current row with an aggregate for a group of rows. With window functions you can do such comparisons with one simple query.

# Exercise

For each employee in table employee, select first and last name, years_worked, average of years spent in the company by all employees, and the difference between the years_worked and the average as difference.

```
SELECT
  first_name,
  last_name,
  years_worked,
  AVG(years_worked) OVER(),
  years_worked - AVG(years_worked) OVER() AS difference
FROM employee;
```

# Computations with OVER() - exercise 2

Now, take a look at another interesting example:

```
SELECT
  id,
  item,
  price,
  price::numeric / SUM(price) OVER()
FROM purchase
WHERE department_id = 2;
```

In the above query, we show all purchases from the department with id = 2. 

Note that we divide the price of the item purchased by the total price of all items purchased by that department. 

In this way, we can check what part of all expenditures each purchase constitutes.

## Exercise

For all employees from department with department_id = 3, show their:

- first_name.
- last_name.
- salary.
- the difference of their salary to the average of all salaries in that department as difference.

```
SELECT
  first_name,
  last_name,
  salary,
  salary - AVG(salary) OVER() AS difference
FROM employee
WHERE department_id = 3;
```

# OVER() and COUNT()

You can use all aggregate functions with OVER(). Let's try an example with COUNT:

```
SELECT 
  id, 
  name, 
  COUNT(id) OVER()
FROM department
ORDER BY name ASC;
```

Here, we show the id and name of each department, plus the number of all departments. At the end, we sort the rows by name.

## Exercise

For each employee that earns more than 4000, show their first_name, last_name, salary and the number of all employees who earn more than 4000.

```
SELECT
  first_name,
  last_name,
  salary,
  COUNT(id) OVER()
FROM employee
WHERE salary > 4000;
```

The concept of pure OVER() might seem easy, but let's do a few more questions to be on the safe side before we move on to more complex things.

## Exercise

For each purchase with department_id = 3, show its:

- id.
- department_id.
- item.
- price.
- maximum price from all purchases in this department.
- the difference between the maximum price and the price.



```
SELECT
  id,
  department_id,
  item,
  price,
  MAX(price) OVER(),
  MAX(price) OVER() - price AS difference
FROM purchase
WHERE department_id = 3;
```

## Exercise

For each purchase from any department, show its id, item, price, average price and the sum of all prices in that table.

```
SELECT
  id,
  item,
  price,
  AVG(price) OVER(),
  SUM(price) OVER()
FROM purchase;
```

# Range of OVER()

Of course, you can add a `WHERE` clause just as you do in any other query:

```
SELECT
  first_name,
  last_name,
  salary,
  AVG(salary) OVER(),
  salary - AVG(salary) OVER()
FROM employee
WHERE department_id = 1;
```

Now, we only calculate the salaries in the department with id = 1. 

Two exercises ago, we said that OVER() means 'for all rows in the query result'. 

This 'in the query result' part is very important – window functions work only on the rows returned by the query.

Here, this means we'll get the salary of each IT department employee and the average salary in that department, and not in the entire company.

That's a very important rule which you need to remember. 

Window functions are always executed AFTER the WHERE clause, so they work on whatever they find as the result.


## Exercise

Show the first_name, last_name and salary of every person who works in departments with id 1, 2 or 3, along with the average salary calculated in those three departments.




```
SELECT
  first_name,
  last_name,
  salary,
  AVG(salary) OVER()
FROM employee
WHERE department_id IN (1, 2, 3);
```

# OVER and WHERE

Now, it might be tempting to use window functions in a WHERE clause, as in the example:

```
SELECT
  first_name,
  last_name,
  salary,
  AVG(salary) OVER()
FROM employee
WHERE salary > AVG(salary) OVER();
```

However, when you run this query, you'll get an error message. 

You cannot put window functions in WHERE.

Why? 

The window functions is applied after the rows are selected. 

If the window functions were in a WHERE clause, you'd get a circular dependency: in order to compute the window function, you have to filter the rows with WHERE, which requires to compute the window function.

# Summary

Let's review what we've learned so far.

- Use <window_function> OVER() to compute an aggregate for all rows in the query result.
- The window functions is applied after the rows are filtered by WHERE.
- The window functions are used to compute aggregates but keep details of individual rows at the same time.
- You can't use window functions in WHERE clauses.

# `OVER(PARTITION BY)` 

Here's a link to the fiddle for the next half of the class: https://www.db-fiddle.com/f/rsFrn1e6DDuPjRZVz95mqP/0



Earlier I taught you the simplest window function type – an aggregate with OVER(). 

In that case, the window consisted of all the rows in the query result. 

Next, I'll show you how you can change that window.

Here's a quick description about the tables:

### Train Table

Select all the information from the table train.

Each train has an id, model, maximum speed expressed in km/h, production year, the number of first class seats and second class seats. Pretty intuitive, right?


### Route Table

A route in our system is, in other words, a railroad connection between point A and B.

Each route has its own id, its friendly name, the from_city and the to_city, as well as the distance between these two cities in kilometers.

For simplicity, we assume that we only have intercity trains, i.e. there are no stations between from_city and to_city where the train could stop.

### Journey Table

Select all the information from the table journey.

Journey in our database is what passengers can buy tickets for. Each journey has its own id, is operated by a certain train, goes via a certain route on a certain day.

Take a look at the first row: if you had wanted to go from Sheffield to Manchester with train 1 on 3 Jan 2016, you would have bought a ticket for journey with id 1.

### Ticket Table

Finally, there are tickets. Each ticket has its own id, price, seat class (1st or 2nd class) and the journey id for which it was bought. Show all these columns.



# PARTITION BY

In this part, we'll learn one construction which can be put in OVER(), namely PARTITION BY. T

he basic syntax looks like this:

```
<window_function> OVER (PARTITION BY column1, column2 ... column_n)
```

PARTITION BY works in a similar way as GROUP BY: it partitions the rows into groups, based on the columns in PARTITION BY clause. Unlike GROUP BY, PARTITION BY does not collapse rows.

Let's see the example. 

For each train, the query returns its id, model, first_class_places and the sum of first class places from the same models of trains.

<img src = 'https://learnsql.com/static/postgresql-window-functions-window-functions-part3-ex5.png'>


With PARTITION BY, you can easily compute the statistics for the whole group but keep details about individual rows.

What functions can you use with PARTITION BY? 

You can use an aggregate function that you already know (COUNT(), SUM(), AVG(), etc.), or another function, such as a ranking or an analytical function that you'll get to know in the coming lectures.

Within parentheses, in turn, we've now put PARTITION BY, followed by the columns by which we want to partition (group).

```
SELECT
  id,
  model,
  first_class_places,
  SUM(first_class_places) OVER (PARTITION BY model)
FROM train;
```

As you can see, the query works fine. 

Imagine writing the same query using regular GROUP BY: you'd have to use a correlated subquery and a JOIN. 

The query would neither be readable nor efficient.

We no longer want to pay that price and PARTITION BY is the solution. 

Thanks to PARTITION BY, we can easily get the information about individual rows AND the information about the groups these rows belong to. 

# Exercise

Show the id of each journey, its date and the number of journeys that took place on that date.

```
SELECT
  id,
  date,
  COUNT(id) OVER(PARTITION BY date)
FROM journey;
```

# Range of OVER(PARTITION BY)

Remember: window functions only work for those rows which are indeed returned by the query. Take a look at this query:

```
SELECT
  id,
  model,
  max_speed,
  COUNT(id) OVER (PARTITION BY max_speed)
FROM train
WHERE production_year != 2012;
```


We cut out the trains with production_year = 2012 and the query would not show them – that's pretty obvious. 

But the window function would not even count them – we could find out that there are only 2 trains with max_speed = 240, even though there is a third one which was produced in 2012. 

Note that a GROUP BY clause with a WHERE clause will behave in the same way – GROUP BY will only take into account rows which match the condition(s).

# Exercise

Show id, model,first_class_places, second_class_places, and the number of trains of each model with more than 30 first class places and more than 180 second class places.

```
SELECT
  id,
  model,
  first_class_places,
  second_class_places,
  COUNT(id) OVER (PARTITION BY model)
FROM train
WHERE first_class_places > 30
  AND second_class_places > 180;
```

# PARTITION BY MULTIPLE COLUMNS

Of course, you can partition rows by multiple columns. 

Take a look:

```
SELECT
  route_id,
  ticket.id,
  ticket.price,
  SUM(price) OVER (PARTITION BY route_id, date)
FROM ticket
JOIN journey
ON ticket.journey_id = journey.id;
```

We wanted to show each ticket with the sum of all tickets on the particular route on the particular date. 

Neither of the tables would suffice on its own, so we had to join them together to get all the columns.


# Exercise

Show the id of each journey, the date on which it took place, the model of the train that was used, the max_speed of that train and the highest max_speed from all the trains that ever went on the same route on the same day.

```
SELECT
  journey.id,
  journey.date,
  train.model,
  train.max_speed,
  MAX(max_speed) OVER(PARTITION BY route_id, date)
FROM journey
JOIN train
  ON journey.train_id = train.id;
```

# Summary

Let's review what we've learned in this part:

- `OVER(PARTITION BY x)` works in a similar way to GROUP BY, defining the window as all the rows in the query result that have the same value in x.

- `x` can be a single column or multiple columns separated by commas.

# Homework

Use the following schema (same as the first half of the class) for the next two question:

https://www.db-fiddle.com/f/7FpnuSLVbfMFScZzCR2nsa/0

# Exercise

For each employee from department with id 1, 3 or 5, show their first name, last name, years_worked and the average number of years_worked in those departments.

```
SELECT
  first_name,
  last_name,
  years_worked,
  AVG(years_worked) OVER()
FROM employee
WHERE department_id IN (1, 3, 5);
```

# Exercise

For each purchase, show its:

- id.
- the name of the department.
- the item.
- the price.
- the minimum price from all the rows in the query result.
- the difference between the price and the minimum price.

```
SELECT
  purchase.id,
  name,
  item,
  price,
  MIN(price) OVER(),
  price - MIN(price) OVER()
FROM purchase
JOIN department
  ON purchase.department_id = department.id;
```

# PARTITION BY HW

Using the following schemas for the homework problems about PARTITION BY: https://www.db-fiddle.com/f/rsFrn1e6DDuPjRZVz95mqP/0


# Exercise

For each journey, show its id, the production_year of the train on that journey, the number of journeys the train took and the number of journeys on the same route.



```
SELECT
  journey.id,
  production_year,
  COUNT(journey.id) OVER(PARTITION BY train_id),
  COUNT(journey.id) OVER(PARTITION BY route_id)
FROM train
JOIN journey
  ON train.id = journey.train_id;
```

# Exercise

For each ticket, show its id, price, date of its journey, the average price of tickets sold on that day and the number of tickets sold on that day. Exclude journeys with train_id = 5.

```
SELECT
  ticket.id,
  date,
  price,
  AVG(price) OVER(PARTITION BY date),
  COUNT(ticket.id) OVER(PARTITION BY date)
FROM ticket
JOIN journey
  ON ticket.journey_id = journey.id
WHERE train_id != 5;
```

# Exercise

For each ticket, show its id, price and, the column named ratio. The ratio is the ticket price to the sum of all ticket prices purchased on the same journey.



```
SELECT
  id,
  price,
  price::numeric / SUM(price) OVER (PARTITION BY journey_id) AS ratio
FROM ticket;
```

Use this fiddle for the next couple of exercises: https://www.db-fiddle.com/f/hAm1rR8KearmtEKi3Lpwah/1

# Exercise

For each employee, show their first_name, last_name, department, salary, as well as the minimal and maximal salary in that department.

```
SELECT
  first_name,
  last_name,
  department,
  salary,
  MIN(salary) OVER(PARTITION BY department),
  MAX(salary) OVER(PARTITION BY department)
FROM employee;
```

# Exercise

For each employee, show their first_name, last_name, department, salary and the proportion of their salary to the sum of all salaries in that department. To avoid the integer division remember to cast the dividend to numeric.

```
SELECT
  first_name,
  last_name,
  department,
  salary,
  salary::numeric / SUM(salary) OVER(partition by department)
FROM employee;
```