# Notes on SQL and Databases

From DataQuest courses.

In [2]:
import sqlite3
import pandas as pd
from pathlib import Path

In [3]:
data_path = Path.home() / "datasets" / "tabular_practice"

def run_sql_query(query, db_fname="chinook.db"):
    conn = sqlite3.connect(data_path / db_fname)
    results = pd.read_sql_query(query, conn)
    conn.close()
    return results

## Random Notes

* Division of integers rounds down. Cast to `REAL` for a real division. Use `CAST(3 AS REAL) / 2` instead of `3 / 2`
* `ROUND(x, 2)` rounds to 2 digits after decimal point
* **Alias**: Can rename expression in `SELECT` or table name in `FROM`: `expression AS alias_name`.
* The order of execution is: `FROM`, `WHERE`, `GROUP BY`, `HAVING`, `SELECT`, `ORDER BY`. An alias created in `SELECT` cannot be used in `WHERE`,
  but it can be used in `ORDER BY`. Also, an alias created in `SELECT` cannot be used in another part of `SELECT`.
* `UPPER(x)` and `LOWER(x)` on string values
* Concatenate strings with `||`

  ```sql
  SELECT city || ", " || state AS "location"
  FROM orders;
  ```

* `WHERE` for filtering: Equality is `=`, not `==`. Not equal is `<>`, not `!=`. Can use `AND`, `OR`, `NOT`
* Check for missing value: `WHERE x IS NULL`
* Check for membership in list or complement: `WHERE x IN (v1, v2, v3)`, `WHERE x NOT IN (v1, v2, v3)`
* `SELECT DISTINCT`: Filters duplicate records (as specified by `SELECT`)
* `ORDER BY`, `ORDER BY .. DESC`: Sort records. Can sort by multiple columns. Default is ascending, but descending with `DESC`
* `CASE .. WHEN .. THEN .. WHEN .. THEN .. ELSE .. END`: Like switch statement
* Aggregation functions: `SUM`, `AVG`, `MIN`, `MAX`, `COUNT` (here, `COUNT(*)` counts number of rows, while `COUNT(colname)` counts number of non-null values in column `colname`)
* `COALESCE(a, b)` is short for `CASE WHEN a IS NULL THEN b ELSE a END`, so that `null` values are mapped to `b`

## Advanced Example

```sql
SELECT order_id, product_name, sales, quantity,
       CASE 
       WHEN sales BETWEEN 0 AND 50 THEN 'small sale'
       WHEN sales BETWEEN 51 AND 100 THEN 'medium sale'
       ELSE 'large sale'
       END AS sales_amount                        
FROM orders
WHERE order_id LIKE 'CA%'
ORDER BY quantity DESC
LIMIT 10;
```

* `CASE .. WHEN .. THEN .. WHEN .. THEN .. ELSE .. END`: Like switch statement
* `BETWEEN a AND b`: Range comparison (inclusive on both ends)
* `LIKE`: Matching with wildcards (also regex?)
* `ORDER BY`, `ORDER BY .. DESC`: Sort records. Can sort by multiple columns. Default is ascending, but descending with `DESC`
* `LIMIT`: Only display first ? records (SQLite dialect)

## More Examples for Grouping and Aggregation

We can aggregate w.r.t. more than one column:

```sql
SELECT billing_country, billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 GROUP BY billing_country, billing_state 
 ORDER BY num_row DESC;
```

Aggregate expressions cannot be used for filtering in `WHERE`, since this is done before grouping. We can use `HAVING` to filter on aggregate expressions:

```sql
SELECT billing_country, billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 GROUP BY billing_country, billing_state
HAVING COUNT(*) > 40
 ORDER BY num_row DESC;
```

Note the order of execution:

* `FROM`
* `WHERE`: Rows are filtered before grouping is done
* `GROUP BY`
* `HAVING`
* `SELECT`: Here, aggregate expressions are specified. Aliases can be used in `ORDER BY`, but not in `GROUP BY` or in `HAVING`
  (the latter is not so clear)
* `ORDER BY`


## Examples of Inner Joins

Combinations of several tables by joins happen before anything else. They can be thought to result in a virtual combined table, on which the remaining command is executed. For example:

```sql
SELECT g.name AS genre, COUNT(*) as num_of_tracks
  FROM track as t
  JOIN genre as g
    ON t.genre_id = g.genre_id
 GROUP BY g.name
 ORDER BY num_of_tracks;
```

We can join more than two tables. It is custom that each `JOIN` is followed by its own `ON`:

```sql
SELECT t.track_id,
       t.name as track_name,
       m.name as media_type,
       i.unit_price
  FROM track as t
  JOIN invoice_line as i
    ON t.track_id = i.track_id 
  JOIN media_type as m
    ON t.media_type_id = m.media_type_id;
```

Another example:

```sql
SELECT i.*, e.first_name as employee
  FROM invoice as i
  JOIN customer as c
    ON i.customer_id = c.customer_id
  JOIN employee as e
    ON c.support_rep_id = e.employee_id;
```

## Self Join

A table can be joined with itself. This is useful if there are relationships between rows of the same table, encoded by index columns. For example, the `employee` table may encode who reports to whom, in which case we can run the following query:

```sql
SELECT e1.employee_id,
       e2.employee_id AS manager_id
  FROM employee AS e1
  JOIN employee AS e2
    ON e1.reports_to = e2.employee_id;
```

Different aliases are mandatory in a self join.

In [4]:
query = """
SELECT e1.employee_id,
       e2.employee_id AS manager_id
  FROM employee AS e1
  JOIN employee AS e2
    ON e1.reports_to = e2.employee_id;
"""
run_sql_query(query).head(10)

Unnamed: 0,employee_id,manager_id
0,2,1
1,6,1
2,3,2
3,4,2
4,5,2
5,7,6
6,8,6


## Left Join

`JOIN` is short for `INNER JOIN`. In an inner join, rows of left or right table are dropped if they do not match any row on the opposite side. The resulting table can have less rows than either of the two tables.

In a `LEFT JOIN`, all rows of the left table are retained. If a row does not match any in the right table, the `SELECT` expressions for the right table evaluate to missing values. For example:

```sql
SELECT e1.first_name || ' ' || e1.last_name as report,
       e2.first_name || ' ' || e2.last_name as manager
  FROM employee AS e1
  LEFT JOIN employee AS e2
    ON e1.reports_to = e2.employee_id;
```

The result has one more row than the result for an inner join. Namely, the general manager is listed as `report`, with missing value for `manager`.



In [9]:
query = """
SELECT e1.first_name || ' ' || e1.last_name as report,
       e2.first_name || ' ' || e2.last_name as manager
  FROM employee AS e1
  LEFT JOIN employee AS e2
    ON e1.reports_to = e2.employee_id
 ORDER BY manager;
"""
run_sql_query(query)

Unnamed: 0,report,manager
0,Andrew Adams,
1,Nancy Edwards,Andrew Adams
2,Michael Mitchell,Andrew Adams
3,Robert King,Michael Mitchell
4,Laura Callahan,Michael Mitchell
5,Jane Peacock,Nancy Edwards
6,Margaret Park,Nancy Edwards
7,Steve Johnson,Nancy Edwards


## Cross Join

A cross join (`CROSS JOIN`) produces all combinations between rows of two tables, there is no `ON` condition. If can be used with a `WHERE` condition between keys, in which case the result is equivalent to an inner join.

For example, this query displays all pairs of distinct customers.

```sql
SELECT c1.first_name,
       c1.last_name,
       c1.email,
       c2.first_name AS first_name_2,
       c2.last_name AS last_name_2,
       c2.email AS email_2
  FROM customer AS c1
 CROSS JOIN customer AS c2
 WHERE c1.customer_id < c2.customer_id;
```

## Right Join and Full Join

`RIGHT JOIN` is like `LEFT JOIN`, but retaining all rows of the right table. At least for two tables, a right join can be converted into a left join by exchanging the tables.

`FULL JOIN` retains all rows of the union of both tables.

Some SQL dialects do not support these types of joins.

## Further Notes on Joins

The `ON` condition for matching rows does not have to be an equality between keys, it can be any expression. Instead of combining `ON` with a subsequent `WHERE`, we should rather use a more complex `ON` condition.

In this example, we query the number of purchases in 2020 for each track. We use left joins, since we also want to list the tracks which did not get purchased in 2020 at all.

```sql
SELECT t.track_id,
       t.name,
       COUNT(i.invoice_id) AS no_of_purchases
FROM track as t
LEFT JOIN invoice_line AS il
ON t.track_id = il.track_id
LEFT JOIN invoice AS i
ON il.invoice_id = i.invoice_id AND i.invoice_date LIKE '2020-%'
GROUP BY t.track_id, t.name;
```


## Example: Computing the Running Total

We can use a self join together with grouping and aggregation in order to display a table of all invoices along with the running total (or cumulative sum of total amounts). For this example, we assume that `invoice_id` and `invoice_date` have the same ordering.

We can think of this in two steps. First, the self join generates a virtual table where each invoice is paired with all previous ones:

```sql
SELECT i1.invoice_id,
       i1.invoice_date,
       i1.total,
       i2.invoice_id as prev_invoice_id,
       i2.total as prev_total
  FROM invoice AS i1
  JOIN invoice AS i2
    ON i1.invoice_id >= i2.invoice_id;
```

We can now group over `i1.invoice_id` and sum `i2.total` in each group:

```sql
SELECT i1.invoice_id,
       i1.invoice_date,
       i1.total,
       ROUND(SUM(i2.total), 2) AS running_total
  FROM invoice AS i1
  JOIN invoice AS i2
    ON i1.invoice_id >= i2.invoice_id
 GROUP BY i1.invoice_id, i1.invoice_date, i1.total;
```

Here, `i2.total` in the group for `i1.invoice_id` runs over all `i2.invoice_id <= i1.invoice_id`, which gives the running total.

In [10]:
query = """
SELECT i1.invoice_id,
       i1.invoice_date,
       i1.total,
       ROUND(SUM(i2.total), 2) AS running_total
  FROM invoice AS i1
  JOIN invoice AS i2
    ON i1.invoice_id >= i2.invoice_id
 GROUP BY i1.invoice_id, i1.invoice_date, i1.total;
"""
run_sql_query(query).head(10)

Unnamed: 0,invoice_id,invoice_date,total,running_total
0,1,2017-01-03 00:00:00,15.84,15.84
1,2,2017-01-03 00:00:00,9.9,25.74
2,3,2017-01-05 00:00:00,1.98,27.72
3,4,2017-01-06 00:00:00,7.92,35.64
4,5,2017-01-07 00:00:00,16.83,52.47
5,6,2017-01-10 00:00:00,1.98,54.45
6,7,2017-01-12 00:00:00,10.89,65.34
7,8,2017-01-13 00:00:00,9.9,75.24
8,9,2017-01-18 00:00:00,8.91,84.15
9,10,2017-01-18 00:00:00,1.98,86.13


## Set Operators

We can apply set operators to the results of several queries (taken as sets of rows). In order for this to work, the results must have the same number of columns and the same signature in terms of column data types. The column names of the result of the set operation comes from the first table.

* `UNION`: Union. Rows which appear in both tables, are included only once.
* `UNION ALL`: Concatenation of rows. Rows which appear in both tables, are included twice.
* `INTERSECT`: Include only rows which appear in both tables.
* `EXCEPT`: Set difference. Rows which appear in the first, but not in the second table.

## Scalar Subqueries

We can use SQL queries in expressions of other queries. These queries are called *subqueries*. There are different kinds:

* Scalar subquery: Result is a single value
* Multi-rows subquery: Result is a single column (or list)
* Multi-columns subquery: Result is a general table

Here, we look at scalar subqueries. Example:

```sql
SELECT billing_country,
       ROUND(SUM(total) /
             (SELECT SUM(total)
                FROM invoice), 2) AS country_share
  FROM invoice
 GROUP BY billing_country
 ORDER BY country_share DESC
 LIMIT 5;
```

A scalar subquery can be used to obtain relative numbers. Subqueries can be used in `WHERE`, for example to count how
many rows have sales larger than average.

```sql
SELECT COUNT(*) AS rows_tally
  FROM invoice
 WHERE total > (SELECT AVG(total) AS total_avg
                  FROM invoice);
```

## Multi-row Subqueries

A multi-row (but single column) subquery can be used in an `IN` expression:

```sql
SELECT COUNT(*) AS tracks_tally
  FROM track
 WHERE media_type_id IN (SELECT media_type_id
                           FROM media_type
                          WHERE name LIKE '%MPEG%');
```

This query counts the number of tracks whose media type is MPEG. The alternative would be to join `track` and `media_type`, which seems more complex, and may be slower. It makes sense to use a subquery if the resulting list is much shorter than the number of rows of the table in the outer query.

For example, the following two queries return the same result:

```sql
SELECT c.first_name, c.last_name
  FROM invoice AS i
  JOIN customer AS c
    ON i.customer_id = c.customer_id
 GROUP BY i.customer_id
HAVING SUM(i.total) >= 100;
```

and

```sql
SELECT first_name, last_name
  FROM customer
 WHERE customer_id IN (SELECT customer_id
                         FROM invoice
                        GROUP BY customer_id
                       HAVING SUM(total) >= 100);
```


## Multi-column Subqueries

A query resulting in a table from be used in the `FROM` clause (or in `JOIN` clauses). This is useful to determine statistics from results containing other statistics. In this example, we compute the maximum sales per billing country, and then the average of these maximum values:

```sql
SELECT AVG(billing_country_max) AS billing_country_max_avg
  FROM (SELECT billing_country, MAX(total) AS billing_country_max
          FROM invoice
         GROUP BY billing_country);
```

Another use case is joins between a table with statistics and a table with extra information. For example, we can query customers with their average purchase amount, displaying names instead of customer IDs:

```sql
SELECT c.last_name, c.first_name, i.total_avg
  FROM (SELECT customer_id, AVG(total) AS total_avg
          FROM invoice
         GROUP BY customer_id) as i
  JOIN customer AS c
    ON c.customer_id = i.customer_id;
```

We can nest subqueries. For example, let us query customers with their percentage of total purchases, displaying names instead of customer IDs:

```sql
SELECT c.last_name, c.first_name, i.total_perc
  FROM (SELECT customer_id,
               SUM(total) * 100.0 /
               (SELECT SUM(total)
                  FROM invoice) AS total_perc
          FROM invoice
         GROUP BY customer_id
         ORDER BY total_perc DESC) AS i
  JOIN customer AS c
    ON c.customer_id = i.customer_id;
```

Here is an example with two multi-column subqueries. Per country, we display the number of invoices divided by the number of customers: the average number of invoices per customer, per country:

```sql
SELECT ct.country,
       CAST(i.invoice_tally AS REAL) / ct.customer_tally AS sale_avg_tally
  FROM (SELECT billing_country, COUNT(*) AS invoice_tally
          FROM invoice
         GROUP BY billing_country) AS i
  JOIN (SELECT country, COUNT(*) AS customer_tally
          FROM customer
         GROUP BY country) as ct
    ON i.billing_country = ct.country
 ORDER BY sale_avg_tally DESC;
```

Note the cast to `REAL`, since otherwise SQL uses integer division.

## Correlated Subqueries

An inner query is called *correlated* if it depends on values of the outer query. For example, we can display average sales per customer without using a join:

```sql
SELECT last_name, 
       first_name, 
       (SELECT AVG(total)
          FROM invoice i
         WHERE c.customer_id = i.customer_id) total_avg
  FROM customer c;
```

The inner query depends on `c.customer_id` coming from the outer one. This query is an alternative to

```sql
SELECT c.last_name, c.first_name, AVG(i.total) AS total_avg
  FROM customer AS c
  JOIN invoice AS i
    ON c.customer_id = i.customer_id
 GROUP BY i.customer_id;
```

Correlated queries are a better choice than a join if the same table is used in inner and outer query. Here, we display all invoices whose purchase amount is larger than the average purchase amount for the same country:

```sql
SELECT invoice_id, billing_country, total
  FROM invoice AS oi
 WHERE total > (SELECT AVG(total)
                  FROM invoice AS ii
                 WHERE ii.billing_country = oi.billing_country)
 ORDER BY billing_country, total DESC;
```

The `EXISTS` / `NOT EXISTS` predicate checks whether a table is not empty or empty. This can be used with correlated subqueries. For example, these are all tracks which have not been sold once:

```sql
SELECT track_id, name
  FROM track t
 WHERE NOT EXISTS (SELECT *
                     FROM invoice_line i
                    WHERE i.track_id = t.track_id);
```

In [11]:
query = """
SELECT invoice_id, billing_country, total
  FROM invoice AS oi
 WHERE total > (SELECT AVG(total)
                  FROM invoice AS ii
                 WHERE ii.billing_country = oi.billing_country)
 ORDER BY billing_country, total DESC
"""
run_sql_query(query).head(10)

Unnamed: 0,invoice_id,billing_country,total
0,381,Argentina,12.87
1,380,Argentina,9.9
2,218,Argentina,8.91
3,99,Australia,17.82
4,427,Australia,14.85
5,90,Australia,10.89
6,586,Australia,10.89
7,488,Austria,13.86
8,123,Austria,11.88
9,337,Austria,9.9


## Nested Subqueries

Queries with subqueries containing further subqueries are called nested. They are often the better alternative to using joins over 3 or more tables.

This query selects all playlists which contain at least one track with duration greater or equal to 15 minutes:

```sql
SELECT playlist_id, name
  FROM playlist
 WHERE playlist_id IN (SELECT playlist_id
                         FROM playlist_track
                        WHERE track_id IN (SELECT track_id
                                             FROM track
                                            WHERE milliseconds >= 900000));
```

The join alternative is:

```sql
SELECT DISTINCT p.playlist_id, p.name
  FROM playlist p
  JOIN playlist_track pt
    ON p.playlist_id = pt.playlist_id
  JOIN track t
    ON pt.track_id = t.track_id
 WHERE t.milliseconds >= 900000;
```

Here is a complex example, using both joins and nested subqueries. We display a list of invoices, each with their total purchase amount and the total number of minutes of any metal genre tracks purchased in the United States:

```sql
SELECT i.invoice_id,
       SUM(il.quantity * t.unit_price) AS total,
       SUM(t.minutes) AS minute
  FROM invoice AS i
  JOIN invoice_line AS il
    ON i.invoice_id = il.invoice_id
  JOIN (SELECT track_id,
               unit_price,
               milliseconds / 1000.0 / 60 as minutes
          FROM track
         WHERE genre_id IN (SELECT genre_id
                              FROM genre
                             WHERE name LIKE '%Metal%')) AS t
    ON t.track_id = il.track_id
 WHERE i.billing_country = 'USA'
 GROUP BY i.invoice_id;
```

An alternative is to avoid the join with `invoice` by using another subquery:

```sql
SELECT il.invoice_id,
       SUM(il.quantity * t.unit_price) AS total,
       SUM(t.minutes) AS minute
  FROM invoice_line AS il
  JOIN (SELECT track_id,
               unit_price,
               milliseconds / 1000.0 / 60 as minutes
          FROM track
         WHERE genre_id IN (SELECT genre_id
                              FROM genre
                             WHERE name LIKE '%Metal%')) AS t
    ON t.track_id = il.track_id
 WHERE il.invoice_id IN (SELECT invoice_id
                           FROM invoice
                          WHERE billing_country = 'USA')
 GROUP BY il.invoice_id;
```


In [12]:
query = """
SELECT il.invoice_id,
       SUM(il.quantity * t.unit_price) AS total,
       SUM(t.minutes) AS minute
  FROM invoice_line AS il
  JOIN (SELECT track_id,
               unit_price,
               milliseconds / 1000.0 / 60 as minutes
          FROM track
         WHERE genre_id IN (SELECT genre_id
                              FROM genre
                             WHERE name LIKE '%Metal%')) AS t
    ON t.track_id = il.track_id
 WHERE il.invoice_id IN (SELECT invoice_id
                           FROM invoice
                          WHERE billing_country = 'USA')
 GROUP BY il.invoice_id;
"""
run_sql_query(query).head(10)

Unnamed: 0,invoice_id,total,minute
0,4,1.98,8.788883
1,9,1.98,7.653433
2,17,1.98,9.959167
3,18,0.99,4.37985
4,42,0.99,3.4155
5,51,0.99,9.396667
6,66,0.99,1.63525
7,75,1.98,8.44625
8,89,0.99,9.8094
9,98,0.99,9.812017


## Common Table Expressions

Queries involving nested subqueries can be difficult to read and understand. We can simplify things by naming multi-column subqueries using `WITH`. For example:

```sql
WITH
city_sales_table AS (
SELECT billing_city, COUNT(*) AS billing_city_tally
  FROM invoice
 GROUP BY billing_city
)

SELECT AVG(billing_city_tally) AS billing_country_tally_avg
  FROM city_sales_table;
```

Importantly, `city_sales_table` is not computed and stored. This is just a way to split the complex expression into multiple parts.

We can define several subqueries in the common table expression:

```sql
WITH
country_invoice_total_table AS (
SELECT billing_country, SUM(total) AS invoice_total
  FROM invoice
 GROUP BY billing_country
),
country_total_table AS (
SELECT country, COUNT(*) AS customer_tally
  FROM customer
 GROUP BY country
)

SELECT ct.country, 
       ROUND(i.invoice_total / ct.customer_tally, 2) AS sale_avg
  FROM country_invoice_total_table AS i
  JOIN country_total_table AS ct
    ON i.billing_country = ct.country
 ORDER BY sale_avg DESC
 LIMIT 5;
```

## Recursive CTEs

The `employee` table has colums `employee_id` and `reports_to`, the latter is the ID of the manager of the employee. This self-relation defines a graph. Recursive CTEs allow us to traverse such a graph in a query. Example:

```sql
WITH RECURSIVE
under_adams_table(employee_id, last_name, first_name, level) AS (

SELECT employee_id, last_name, first_name, 0
  FROM employee
 WHERE reports_to IS NULL

 UNION ALL
    
SELECT e.employee_id, 
       e.last_name, 
       e.first_name, 
       u.level + 1 AS level
  FROM employee e
  JOIN under_adams_table u
    ON e.reports_to = u.employee_id
 ORDER BY level
)

SELECT SUBSTR('>>>', 1, level)  || ' '  || last_name  || ' ' || first_name AS hierarchy 
  FROM under_adams_table;
```

The table `under_adams_table` is defined recursively via a set operation (`UNION ALL`). Namely, we join `employee` with the generation `level` of `under_adams_table` in order to create generation `level + 1`. The recursion terminates once a generation is empty. Such recursive
queries allow us to traverse a graph (in breadth-first order) in a SQL query.

In general, a recursive CTE has these parts:

* Anchor member: Query to select the tree root
* Compound-operator: `UNION`, `UNION ALL`, `INTERSECT`, or `EXCEPT`, follows the anchor member
* Recursive member: Calls itself



In [13]:
query = """
WITH RECURSIVE
under_adams_table(employee_id, last_name, first_name, level) AS (

SELECT employee_id, last_name, first_name, 0
  FROM employee
 WHERE reports_to IS NULL

 UNION ALL
    
SELECT e.employee_id, 
       e.last_name, 
       e.first_name, 
       u.level + 1 AS level
  FROM employee e
  JOIN under_adams_table u
    ON e.reports_to = u.employee_id
 ORDER BY level
)

SELECT SUBSTR('>>>', 1, level)  || ' '  || last_name  || ' ' || first_name AS hierarchy 
  FROM under_adams_table;
"""
run_sql_query(query)

Unnamed: 0,hierarchy
0,Adams Andrew
1,> Edwards Nancy
2,> Mitchell Michael
3,>> Peacock Jane
4,>> Park Margaret
5,>> Johnson Steve
6,>> King Robert
7,>> Callahan Laura


This example prints the manager chain for every employee:

```sql
WITH RECURSIVE
managers_chain(employee_id, path) AS (

SELECT employee_id, last_name || ' ' || first_name AS path 
  FROM employee
 WHERE reports_to IS NULL
 
 UNION ALL

SELECT e.employee_id,
       mc.path  || '<--' || e.last_name || ' ' || e.first_name AS path
  FROM employee e
  JOIN managers_chain mc
    ON e.reports_to = mc.employee_id
)
 
SELECT path
  FROM managers_chain
 ORDER BY path;
```

In [14]:
query = """
WITH RECURSIVE
managers_chain(employee_id, path) AS (

SELECT employee_id, last_name || ' ' || first_name AS path 
  FROM employee
 WHERE reports_to IS NULL
 
 UNION ALL

SELECT e.employee_id,
       mc.path  || '<--' || e.last_name || ' ' || e.first_name AS path
  FROM employee e
  JOIN managers_chain mc
    ON e.reports_to = mc.employee_id
)
 
SELECT path
  FROM managers_chain
 ORDER BY path
"""
run_sql_query(query)

Unnamed: 0,path
0,Adams Andrew
1,Adams Andrew<--Edwards Nancy
2,Adams Andrew<--Edwards Nancy<--Johnson Steve
3,Adams Andrew<--Edwards Nancy<--Park Margaret
4,Adams Andrew<--Edwards Nancy<--Peacock Jane
5,Adams Andrew<--Mitchell Michael
6,Adams Andrew<--Mitchell Michael<--Callahan Laura
7,Adams Andrew<--Mitchell Michael<--King Robert


## Views in SQL

A view is a virtual table containing the result of a saved `SELECT` statement, dynamically created once we query the view. In other words, a view encapsulates queries into a reusable database object to reduce repetitive work and maintain data integrity. Unlike a regular table, a view doesn’t contain real data. However, like tables, we can query views and join them with other views or tables.

Views are beneficial because they allow us to:

* Hide Complexities:
  One of the main reasons for using views is to simplify complex SQL queries; after the view query is implemented, we can reuse it without
  knowing the details of the underlying query.
* Ensure Data Security:
  We can make our databases more secure by exposing part of tables instead of complete tables and granting access to particular subsets
  instead of the entire tables.
* Represent Data:
  Views are a great way to change data format and represent the data differently from their underlying tables.

Differences between views and CTEs (above):

* A SQL view is a database object whose query is only stored in the database, not the data returned by the query. On the other hand, a CTE is not stored as an object in the database, which means the CTE's query exists in the memory while the query is executing and is discarded when the query execution finishes.
* Although frequently used queries are ideal candidates for being implemented as views, we use CTEs for occasionally referenced queries.
* SQL views can be used for data access management too. They restrict users from accessing a particular part of data while they can still use the information they need. CTEs don't provide this capability.

Here is an example which selects some columns of `employee` for rows corresponding to managers:

```sql
CREATE VIEW manager AS
SELECT employee_id, first_name, last_name, title, email
  FROM employee
 WHERE employee_id IN (SELECT DISTINCT reports_to FROM employee);
```

A view is stored as object in the database. It can be removed like this:

```sql
DROP VIEW manager;
```

In order to redefine a view, it first has to be removed.

## Simple Window Functions

When aggregating data with `GROUP BY`, the result is a table with one row per group. Window functions allow for the computation of more general statistics, whose values usually become a new column, so the original data is enhanced.

A window function is determined by the function (aggregation, ranking, distribution, offset) and a window definion in the `OVER` keyword.
With `OVER()`, there is a single window with all rows:

```sql
SELECT first_name,
       last_name,
       salary,
       salary - AVG(salary) OVER() AS difference
  FROM employee;
```

We can differentiate between managers and other employees:

```sql
WITH employee_and_manager AS (
SELECT *,
       title LIKE '%Manager%' AS manager
  FROM employee
)

SELECT first_name,
       last_name,
       salary,
       AVG(salary) OVER(PARTITION BY manager) AS avg_salary,
       salary - AVG(salary) OVER(PARTITION BY manager) AS difference
  FROM employee_and_manager;
```

With `OVER(ORDER BY [expr])`, we can order the rows (within each partition, if combined with `PARTITION BY`). The effect of this depends on the window function. For `SUM`, the window for the current row contains this row and all rows before. This is useful for computing cumulative sums:

```sql
SELECT *,
       SUM(quantity) OVER(ORDER BY sales_date) AS running_total_quantity,
  FROM apple_sales_quantity_by_month;
```

Note some things:

* The rows of the result are ordered according the first `OVER(ORDER BY ...)` expression (if there is more than one), but each window
  function uses its own ordering.
* There is one window per unique value in the `ORDER BY` sequence. In this sense, `ORDER BY` is related to `PARTITION BY`.

```sql
SELECT quantity,
       SUM(quantity) OVER(ORDER BY quantity) AS running_total_quantity
  FROM apple_sales_quantity_by_month;
```

results in

```
quantity  running_total_quantity
--------------------------------
25        25
30        55
40        135
40        135
47        182
50        232
--------------------------------
```

Note the change from 55 to 135, an increase by 40+40 (sum over two rows), and 135 staying the same for the two rows.

## Window Framing

Window framing allows to control the window size relative to the current row.

`ROWS BETWEEN [start_expr] AND [end_expr]`. Here, the expressions can be:

* `n PRECEDING`
* `m FOLLOWING`
* `CURRENT ROW`
* `UNBOUNDED PRECEDING` (only `[start_expr]`)
* `UNBOUNDED FOLLOWING` (only `[end_expr]`)

Examples:

* `1 PRECEDING AND 1 FOLLOWING`: Window of 3 rows, centered at current row.
* `2 PRECEDING AND 1 PRECEDING`: 2 rows just before current row.
* `UNBOUNDED PRECEDING AND CURRENT ROW`: From start of window until current row.
* `CURRENT ROW AND UNBOUNDED FOLLOWING`: From current row until end of window.

`RANGE BETWEEN [start_expr] AND [end_expr]`. This is slightly different. First, `n PRECEDING` and `m FOLLOWING` are not supported.
Second, the windows for equal values in the `ORDER BY` sequences are the same. See the discussion for `SUM` above.

**Note**: The default framing for a window function with `ORDER BY` is `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, or short `RANGE UNBOUNDED PRECEDING`. This explains the behaviour with `SUM` noted above.

## Window Aggregate Functions

Here are some gotchas, concerning window aggregate functions (`SUM`, `AVG`, `MIN`, `MAX`, `COUNT`, `STDDEV`).

### Order of Execution

Window functions can only be used in `SELECT`. They cannot be used in `WHERE`, for example. They are applied to the result after clauses executed before `SELECT` (such as `WHERE`, `GROUP BY`, `HAVING`). If they are aliased, these can be used in `ORDER BY`.

### Window Aggregate Functions in Aggregate Queries

Window aggregate functions can be used in aggregate queries (using `GROUP BY`), but they need to be defined on expressions which are already aggregated. For example, we would like to display the average salary per department along with the overall company average salary. This query does **not** work:

```sql
SELECT department,
       SUM(salary) AS total_department_salary,
       SUM(salary) OVER() AS total_company_salary
  FROM employees
 GROUP BY department;
```

Namely, the expression `salary` inside the window aggregate function is not itself an aggregate. We need to use this query:

```sql
SELECT department,
       SUM(salary) AS total_department_salary,
       SUM(SUM(salary)) OVER() AS total_company_salary
  FROM employees
 GROUP BY department;
```

We can also use a scalar subquery, which may be simpler here. Also, while things work out for `SUM`, `MIN`, `MAX`, the double aggregation would not work for `AVG` or `STDDEV`.

```sql
SELECT department,
       SUM(salary) AS total_department_salary,
       (SELECT SUM(salary)
          FROM employees) AS total_company_salary
  FROM employees
 GROUP BY department;
```

A more complex is this. We would like to group by month and brand and display total revenues, but also total revenues per brand (summed over months):

```sql
WITH phone_sales_by_month_and_month AS (
SELECT *, EXTRACT(MONTH FROM sales_date) AS month
  FROM phone_sales_by_month
)

SELECT month,
       brand,
       SUM(quantity * unit_price) AS monthly_brand_revenue,
       SUM(SUM(quantity * unit_price)) OVER(PARTITION BY brand) AS total_brand_revenue
  FROM phone_sales_by_month_and_month
 GROUP BY brand, month
 ORDER BY brand, month;
```

The aggregate query sums revenues over `(brand, month)`, and then `SUM(...) OVER(PARTITION BY brand)` sums over month.

## Ranking Window Functions

These can be used for ranking rows. The usage with `ORDER BY` is mandatory, since the ordering defines the ranking. If combined with `PARTITION BY`, the rankings are done for each partition. The functions differ if the `ORDER BY` sequence contains non-unique values:

* `ROW_NUMBER()`: Assigns row numbers 1, 2, 3, ... consistent with the ordering. If the ordering column is not unique, the assignment is non-deterministic.
* `RANK()`: Ranks are numbers 1, 2, 3, ..., but equal values in the ordering column receive the same rank. Gives rise to rank sequences such as 1, 2, 2, 4, 4, 4, 7, 8, 8, 10. After a stretch of `n` equal values, the rank is increased by `n`.
* `DENSE_RANK()`: Same as `RANK()`, but the rank sequence does not have holes. For the example above, we'd have 1, 2, 2, 3, 3, 3, 4, 5, 5, 6.

The `NTILE` window function is used to partition rows into equi-sized buckets along an ordering. If the number of rows is not a multiple of the number of buckets, the first buckets contain one more row.

## Offset Window Functions

These window functions are used to select single rows:

* Relative to current row: `LEAD`, `LAG`
* Relative to window: `FIRST_VALUE`, `LAST_VALUE`, `NTH_VALUE`

Examples:

For each brand, select the first month with a silver sale (between 49K and 60K revenue):

```sql
WITH temp_table AS
(
    SELECT *,
           FIRST_VALUE(sales_date) OVER(
               PARTITION BY brand
                   ORDER BY sales_date
           ) AS first_silver_sales_date
      FROM phone_sales_revenue_by_month
     WHERE revenue BETWEEN 49000 AND 60000
)

SELECT sales_date,
       brand,
       revenue
  FROM temp_table
 WHERE sales_date = first_silver_sales_date;
```

For each row, display the percentage change of revenue versus the revenue of the first and the last month. Note that this needs the window to be defined appropriately (the default is between unbounded preceding and the current row, which would not work for `LAST_VALUE`):

```sql
WITH temp_table AS (
    SELECT sales_date, brand, revenue,
           FIRST_VALUE(revenue) OVER(
               PARTITION BY brand
               ORDER BY sales_date
               ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
           ) AS first_revenue,
           LAST_VALUE(revenue) OVER(
               PARTITION BY brand
               ORDER BY sales_date
               ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
           ) AS last_revenue
      FROM phone_sales_revenue_by_month
)

SELECT sales_date, brand, revenue,
       ROUND(100.0 * (revenue - first_revenue) / first_revenue, 2) AS first_month_pct_change,
       ROUND(100.0 * (revenue - last_revenue) / last_revenue, 2) AS last_month_pct_change
  FROM temp_table;
```

Display percentage change between this to next month quantity. This is using `LEAD` with the default offset of 1:

```sql
WITH temp_table AS (
SELECT brand,
       sales_date,
       quantity,
       LEAD(quantity) OVER(
           PARTITION BY brand
           ORDER BY sales_date
       ) AS next_month_sales
  FROM phone_sales_quantity_by_month           
)

SELECT *,
       100.0 * (next_month_sales - quantity) / quantity AS sales_percentage_change
  FROM temp_table
 ORDER BY brand, sales_date;
```

## Distribution Window Functions

These functions provide insights into the data distribution. They are typically about the rank of rows, or about percentiles.

* `CUME_DIST() OVER(...)`: Fraction of rows with values less or equal to current row value for `ORDER BY` column. The last value is equal to 1, the first value is larger than 0.
* `PERCENT_RANK() OVER(...)`: Ratio `(rank - 1) / (num_rows - 1)` w.r.t. values of `ORDER BY` column. The first value is 0.

### WITHIN GROUP Clause

Next, `PERCENTILE_CONT` and `PERCENTILE_DISC` are not really window functions, but aggregation functions operating on a sorted column. The sort ordering is specified with the `WITHIN GROUP` clause, which is not ANSI SQL, but is offered by most RDBMS.

Example: This is how to compute a continuous quantile (interpolating between values):

```sql
SELECT PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY quantity) AS "Median of Quantity"
  FROM phone_sales_quantity_by_month;
```

If the quantile value is supposed to be one of the column values (closest column value smaller or equal to continuous quantile):

```sql
SELECT PERCENTILE_DISC(0.50) WITHIN GROUP(ORDER BY quantity) AS "Median of Quantity"
  FROM phone_sales_quantity_by_month;
```
These can be used in subqueries. Different to window functions, these ordered-set aggregate functions can also be used in `WHERE` clauses.

### WINDOW Clause

The window definition `OVER(...)` of a window function can be named, using a window clause. This is useful if the same window definition is used several times in the same query, or simply to make a query easier to read.