# Exercise Session 6 - Information Systems for Engineers

## Queries with SQL - Part I

In the last notebooks, you practiced **Data Definition Language (DDL)** where you created tables as well as inserted, updated and removed rows. Now we shift to **Data Manipulation Language (DML)** where you will work on 'fixed' tables and query specific informations from them.

- Load the SQL module

In [1]:
%load_ext sql

- Connect to the PostgreSQL database:  
```python
%sql postgresql://username:password@server_address/database_name
```
  For the people using Docker containers the following line should work. For the people using local installation, you may need to change the line usually to:
```python
%sql postgresql://user:\[yourpassword\]@localhost/postgres
```

In [2]:
%sql postgresql://postgres:example@db:5432

- Run the following script to create the schema and insert data into tables.
- Since we will work inside this schema for the rest of the sheet, you can run the script below to start over.  

***Warning:*** this will delete all your tables and data in the **ise_ex6_stocks** schema.


In [3]:
%%sql
DROP SCHEMA IF EXISTS ise_ex6_stocks CASCADE;
CREATE SCHEMA ise_ex6_stocks;


CREATE TABLE ise_ex6_stocks.aapl_s1_bid_a(
    time TIMESTAMP,
    open NUMERIC(7, 3),
    high NUMERIC(7, 3),
    low NUMERIC(7, 3),
    close NUMERIC(7, 3),
    volume FLOAT);

INSERT INTO ise_ex6_stocks.aapl_s1_bid_a
    (time, open, high, low, close, volume) VALUES
    ('2017-10-23 14:30:00.000', 156.848, 156.848, 156.848, 156.848, 0.36),
    ('2017-10-23 14:30:01.000', 156.837, 156.837, 156.788, 156.788, 15.012),
    ('2017-10-23 14:30:02.000', 156.797, 156.797, 156.767, 156.787, 22.502),
    ('2017-10-23 14:30:03.000', 156.787, 156.787, 156.778, 156.778, 30),
    ('2017-10-23 14:30:04.000', 156.778, 156.778, 156.748, 156.748, 15),
    ('2017-10-23 14:30:05.000', 156.788, 156.788, 156.727, 156.788, 37.9),
    ('2017-10-23 14:30:06.000', 156.748, 156.748, 156.737, 156.737, 15),
    ('2017-10-23 14:30:07.000', 156.737, 156.737, 156.737, 156.737, 30),
    ('2017-10-23 14:30:08.000', 156.647, 156.647, 156.618, 156.627, 23.627),
    ('2017-10-23 14:30:09.000', 156.618, 156.618, 156.598, 156.598, 8.01);
    
CREATE TABLE ise_ex6_stocks.aapl_s1_ask_a(
    time TIMESTAMP,
    open NUMERIC(7, 3),
    high NUMERIC(7, 3),
    low NUMERIC(7, 3),
    close NUMERIC(7, 3),
    volume FLOAT);

INSERT INTO ise_ex6_stocks.aapl_s1_ask_a
    (time, open, high, low, close, volume) VALUES
    ('2017-10-23 14:30:00.000', 156.882, 156.882, 156.882, 156.882, 7.5),
    ('2017-10-23 14:30:01.000', 156.913, 156.913, 156.822, 156.822, 7.985),
    ('2017-10-23 14:30:02.000', 156.853, 156.853, 156.843, 156.843, 15.59),
    ('2017-10-23 14:30:03.000', 156.833, 156.833, 156.808, 156.833, 0.457),
    ('2017-10-23 14:30:04.000', 156.832, 156.832, 156.792, 156.792, 15),
    ('2017-10-23 14:30:05.000', 156.833, 156.833, 156.752, 156.822, 0.987),
    ('2017-10-23 14:30:06.000', 156.823, 156.823, 156.783, 156.783, 7.73),
    ('2017-10-23 14:30:07.000', 156.753, 156.783, 156.753, 156.783, 10.302),
    ('2017-10-23 14:30:08.000', 156.702, 156.703, 156.672, 156.702, 15.247),
    ('2017-10-23 14:30:09.000', 156.692, 156.692, 156.692, 156.692, 0.151);

CREATE TABLE ise_ex6_stocks.aapl_s1_bid_b(
    time TIMESTAMP,
    open NUMERIC(7, 3),
    high NUMERIC(7, 3),
    low NUMERIC(7, 3),
    close NUMERIC(7, 3),
    volume FLOAT);

INSERT INTO ise_ex6_stocks.aapl_s1_bid_b
    (time, open, high, low, close, volume) VALUES
    ('2017-10-23 14:30:05.000', 156.788, 156.788, 156.727, 156.788, 37.9),
    ('2017-10-23 14:30:06.000', 156.748, 156.748, 156.737, 156.737, 15),
    ('2017-10-23 14:30:07.000', 156.737, 156.737, 156.737, 156.737, 30),
    ('2017-10-23 14:30:08.000', 156.647, 156.647, 156.618, 156.627, 23.627),
    ('2017-10-23 14:30:09.000', 156.618, 156.618, 156.598, 156.598, 8.01),
    ('2017-10-23 14:30:10.000', 156.598, 156.627, 156.598, 156.608, 0.665),
    ('2017-10-23 14:30:11.000', 156.607, 156.607, 156.597, 156.597, 17.6),
    ('2017-10-23 14:30:12.000', 156.652, 156.657, 156.598, 156.598, 22.53),
    ('2017-10-23 14:30:13.000', 156.597, 156.597, 156.568, 156.568, 22.5),
    ('2017-10-23 14:30:14.000', 156.567, 156.567, 156.567, 156.567, 0.23);
    
CREATE TABLE ise_ex6_stocks.goog_s1_bid_a(
    time TIMESTAMP,
    open NUMERIC(7, 3),
    high NUMERIC(7, 3),
    low NUMERIC(7, 3),
    close NUMERIC(7, 3),
    volume FLOAT);

INSERT INTO ise_ex6_stocks.goog_s1_bid_a
    (time, open, high, low, close, volume) VALUES   
    ('2017-10-23 14:30:01.000', 1005.708, 1005.708, 1005.708, 1005.708, 15),
    ('2017-10-23 14:30:02.000', 1005.267, 1005.267, 1005.267, 1005.267, 0.019),
    ('2017-10-23 14:30:03.000', 1005.148, 1005.148, 1005.148, 1005.148, 22.5),
    ('2017-10-23 14:30:04.000', 1005.148, 1005.528, 1005.148, 1005.528, 30),
    ('2017-10-23 14:30:05.000', 1004.997, 1005.068, 1004.997, 1005.068, 0.042),
    ('2017-10-23 14:30:06.000', 1005.068, 1005.068, 1005.068, 1005.068, 0.001),
    ('2017-10-23 14:30:07.000', 1005.058, 1005.058, 1005.058, 1005.058, 10),
    ('2017-10-23 14:30:09.000', 1005.068, 1005.068, 1004.987, 1004.987, 7.501),
    ('2017-10-23 14:30:10.000', 1004.797, 1004.797, 1004.797, 1004.797, 10),
    ('2017-10-23 14:30:12.000', 1004.798, 1004.798, 1004.527, 1004.527, 7.501);
    
CREATE TABLE ise_ex6_stocks.goog_s1_ask_a(
    time TIMESTAMP,
    open NUMERIC(7, 3),
    high NUMERIC(7, 3),
    low NUMERIC(7, 3),
    close NUMERIC(7, 3),
    volume FLOAT);

INSERT INTO ise_ex6_stocks.goog_s1_ask_a
    (time, open, high, low, close, volume) VALUES 
    ('2017-10-23 14:30:01.000', 1005.723, 1005.732, 1005.723, 1005.732, 10.019),
    ('2017-10-23 14:30:02.000', 1005.622, 1005.622, 1005.622, 1005.622, 7.5),
    ('2017-10-23 14:30:03.000', 1005.622, 1005.622, 1005.412, 1005.622, 15.013),
    ('2017-10-23 14:30:04.000', 1005.262, 1005.992, 1005.262, 1005.853, 17.505),
    ('2017-10-23 14:30:05.000', 1005.853, 1005.993, 1005.802, 1005.802, 0.094),
    ('2017-10-23 14:30:06.000', 1005.422, 1005.422, 1005.422, 1005.422, 7.5),
    ('2017-10-23 14:30:07.000', 1005.802, 1005.802, 1005.802, 1005.802, 10.2),
    ('2017-10-23 14:30:09.000', 1005.273, 1005.273, 1005.233, 1005.233, 15),
    ('2017-10-23 14:30:10.000', 1005.463, 1005.463, 1005.463, 1005.463, 10.1),
    ('2017-10-23 14:30:12.000', 1005.462, 1005.462, 1005.062, 1005.062, 0.235);

 * postgresql://postgres:***@db:5432
Done.
Done.
Done.
10 rows affected.
Done.
10 rows affected.
Done.
10 rows affected.
Done.
10 rows affected.
Done.
10 rows affected.


[]

Summary of the schema:  
- 5 tables of stock-market data from NASDAQ, 23.10.2017, ~14:30 New York time for 2 companies, Apple and Google.
- Different tables for 'bid' and 'ask' prices. Ask is the price for which you can buy a stock (someone is willing to sell to you at this price), while bid is the price for which you can sell or short it (someone is willing to buy from you at this price).  
- The price of a stock is updated whenever a transaction (either buy or sell) occurs.
- Each table contains data about prices and transaction volumes for a given stock and operation (bid/ask). Prices are given in aggregation ___candlesticks___ of 1 second. Open is the first price within the 1 second interval, while close is the last one. Accordingly, high and low are the maximum and minimum prices within the same time interval.
- Volume is the number of stocks (in thousands) that were traded (either bought or sold) within the 1 second interval.
- Missing data, e.g. Google's stock does not have data for 23.10.2017 14:30:08, are usually an indication that no transactions were completed during that interval.


For brevity, tables are written without the schema name. However, you must include it in your SQL queries.  
For example:  
  $$ \text{aapl_s1_bid_a} \rightarrow \text{ise_ex6_stocks.aapl_s1_bid_a} $$
  
If you initialized the schema correctly, running the example below should generate the following table contents:

|time | open | high | low | close | volume|
|-:|:-|-:|-:|-:|-:|
|2017-10-23 14:30:00| 156.848| 156.848| 156.848|156.848|0.36|
|2017-10-23 14:30:01| 156.837| 156.837| 156.788|156.788|15.012|
|2017-10-23 14:30:02| 156.797| 156.797| 156.767|156.787|22.502|
|2017-10-23 14:30:03| 156.787| 156.787| 156.778|156.778|30.0|
|2017-10-23 14:30:04| 156.778| 156.778| 156.748|156.748|15.0|
|2017-10-23 14:30:05| 156.788| 156.788| 156.727|156.788|37.9|
|2017-10-23 14:30:06| 156.748| 156.748| 156.737|156.737|15.0|
|2017-10-23 14:30:07| 156.737| 156.737| 156.737|156.737|30.0|
|2017-10-23 14:30:08| 156.647| 156.647| 156.618|156.627|23.627|
|2017-10-23 14:30:09| 156.618| 156.618| 156.598|156.598|8.01|

In [4]:
%sql SELECT * FROM ise_ex6_stocks.aapl_s1_bid_a;

 * postgresql://postgres:***@db:5432
10 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:00,156.848,156.848,156.848,156.848,0.36
2017-10-23 14:30:01,156.837,156.837,156.788,156.788,15.012
2017-10-23 14:30:02,156.797,156.797,156.767,156.787,22.502
2017-10-23 14:30:03,156.787,156.787,156.778,156.778,30.0
2017-10-23 14:30:04,156.778,156.778,156.748,156.748,15.0
2017-10-23 14:30:05,156.788,156.788,156.727,156.788,37.9
2017-10-23 14:30:06,156.748,156.748,156.737,156.737,15.0
2017-10-23 14:30:07,156.737,156.737,156.737,156.737,30.0
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627
2017-10-23 14:30:09,156.618,156.618,156.598,156.598,8.01


### Write the SQL query for the following ___bag___ operations:

$$ \delta \text{ represents the distinct property: duplicates are removed.} $$

$$ \delta(\text{aapl_s1_bid_a} \cup \text{aapl_s1_bid_b}) $$

Check the results, there should be 15 distinct rows.

In [12]:
%%sql 
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_a 
UNION 
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_b; 

 * postgresql://postgres:***@db:5432
15 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:09,156.618,156.618,156.598,156.598,8.01
2017-10-23 14:30:04,156.778,156.778,156.748,156.748,15.0
2017-10-23 14:30:14,156.567,156.567,156.567,156.567,0.23
2017-10-23 14:30:06,156.748,156.748,156.737,156.737,15.0
2017-10-23 14:30:11,156.607,156.607,156.597,156.597,17.6
2017-10-23 14:30:03,156.787,156.787,156.778,156.778,30.0
2017-10-23 14:30:12,156.652,156.657,156.598,156.598,22.53
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627
2017-10-23 14:30:01,156.837,156.837,156.788,156.788,15.012
2017-10-23 14:30:13,156.597,156.597,156.568,156.568,22.5


$$ \tau_\text{attribute ordering}\text{ represents the ORDER BY clause with specified attribute and ordering} $$

$$ \tau_\text{time ascending}(\delta(\text{aapl_s1_bid_a} \cup \text{aapl_s1_bid_b})) $$

Check the results, there should be 15 ordered rows beginning at 14:30:00 and ending at 14:30:14.

In [13]:
%%sql 
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_a 
UNION 
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_b
ORDER BY time;

 * postgresql://postgres:***@db:5432
15 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:00,156.848,156.848,156.848,156.848,0.36
2017-10-23 14:30:01,156.837,156.837,156.788,156.788,15.012
2017-10-23 14:30:02,156.797,156.797,156.767,156.787,22.502
2017-10-23 14:30:03,156.787,156.787,156.778,156.778,30.0
2017-10-23 14:30:04,156.778,156.778,156.748,156.748,15.0
2017-10-23 14:30:05,156.788,156.788,156.727,156.788,37.9
2017-10-23 14:30:06,156.748,156.748,156.737,156.737,15.0
2017-10-23 14:30:07,156.737,156.737,156.737,156.737,30.0
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627
2017-10-23 14:30:09,156.618,156.618,156.598,156.598,8.01


$$ \text{aapl_s1_bid_a} \cup \text{aapl_s1_bid_b} $$

When not removing duplicates, the total number of rows should be equal to a sum of extension size in both tables. In our case, that's 20.

In [16]:
%%sql 
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_a 
UNION ALL
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_b;

 * postgresql://postgres:***@db:5432
20 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:00,156.848,156.848,156.848,156.848,0.36
2017-10-23 14:30:01,156.837,156.837,156.788,156.788,15.012
2017-10-23 14:30:02,156.797,156.797,156.767,156.787,22.502
2017-10-23 14:30:03,156.787,156.787,156.778,156.778,30.0
2017-10-23 14:30:04,156.778,156.778,156.748,156.748,15.0
2017-10-23 14:30:05,156.788,156.788,156.727,156.788,37.9
2017-10-23 14:30:06,156.748,156.748,156.737,156.737,15.0
2017-10-23 14:30:07,156.737,156.737,156.737,156.737,30.0
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627
2017-10-23 14:30:09,156.618,156.618,156.598,156.598,8.01


$$ \tau_\text{time descending}(\text{aapl_s1_bid_a} \cup \text{aapl_s1_bid_b}) $$

When sorting rows by time, it should be very easy to catch the duplicated entries. The first entry should now be at 14:30:14.

In [17]:
%%sql 
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_a 
UNION ALL
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_b
ORDER BY time DESC;

 * postgresql://postgres:***@db:5432
20 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:14,156.567,156.567,156.567,156.567,0.23
2017-10-23 14:30:13,156.597,156.597,156.568,156.568,22.5
2017-10-23 14:30:12,156.652,156.657,156.598,156.598,22.53
2017-10-23 14:30:11,156.607,156.607,156.597,156.597,17.6
2017-10-23 14:30:10,156.598,156.627,156.598,156.608,0.665
2017-10-23 14:30:09,156.618,156.618,156.598,156.598,8.01
2017-10-23 14:30:09,156.618,156.618,156.598,156.598,8.01
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627
2017-10-23 14:30:07,156.737,156.737,156.737,156.737,30.0


We learned so far that both tables have 20 entries in total, but only 15 are left after removing duplicates. Our hypothesis is that there are 5 identical entries between two tables. Can you empirically validate it?

$$ \text{aapl_s1_bid_a} \cap \text{aapl_s1_bid_b} $$

In [18]:
%%sql
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_a 
INTERSECT
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_b;

 * postgresql://postgres:***@db:5432
5 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:09,156.618,156.618,156.598,156.598,8.01
2017-10-23 14:30:06,156.748,156.748,156.737,156.737,15.0
2017-10-23 14:30:07,156.737,156.737,156.737,156.737,30.0
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627
2017-10-23 14:30:05,156.788,156.788,156.727,156.788,37.9


We can sort by each attribute in the table. Now we want to find out when was the busiest time at the market for Apple stock?

$$ \tau_\text{volume descending}(\text{aapl_s1_bid_a} \cap \text{aapl_s1_bid_b}) $$

In [20]:
%%sql
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_a 
INTERSECT
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_b
ORDER BY volume DESC;

 * postgresql://postgres:***@db:5432
5 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:05,156.788,156.788,156.727,156.788,37.9
2017-10-23 14:30:07,156.737,156.737,156.737,156.737,30.0
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627
2017-10-23 14:30:06,156.748,156.748,156.737,156.737,15.0
2017-10-23 14:30:09,156.618,156.618,156.598,156.598,8.01


Now we want to find out not only which elements appear only in one of the tables. To that end, we use the **\** operator representing an exception.

$$ \text{aapl_s1_bid_a} \setminus \text{aapl_s1_bid_b} $$

Look at the [Postgres](https://www.postgresql.org/docs/13/queries-union.html) documentation to find out how to implement the substraction query. 

Side note: Here the name of the operator differs between PostgreSQL and MySQL, but the operation is the same.
Unfortunately, the SQL databases and implementations are not always fully compatible with each other.


In [21]:
%%sql 
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_a 
EXCEPT 
SELECT * FROM ise_ex6_stocks.aapl_s1_bid_b;

 * postgresql://postgres:***@db:5432
5 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:04,156.778,156.778,156.748,156.748,15.0
2017-10-23 14:30:03,156.787,156.787,156.778,156.778,30.0
2017-10-23 14:30:01,156.837,156.837,156.788,156.788,15.012
2017-10-23 14:30:00,156.848,156.848,156.848,156.848,0.36
2017-10-23 14:30:02,156.797,156.797,156.767,156.787,22.502


10 rows in each table, 5 identical rows in table A, 5 different rows in table A, and 5 identical rows in table B - let's verify if there are 5 different rows in table B.

$$ \text{aapl_s1_bid_b} \setminus \text{aapl_s1_bid_a} $$

Question: in this problem, would it make a difference to add the `ALL` deduplication clause to substraction query?

In [None]:
%%sql 

### Write the SQL query for expressions containing projections, selections or aggregations:

$$ \pi_\text{open, high, low, close}(\text{aapl_s1_bid_a}) $$

You should obtain a table with the following header:

| open | high | low | close|
|-:|-:|-:|-:|

In [22]:
%%sql
SELECT open,high,low,close FROM ise_ex6_stocks.aapl_s1_bid_a;

 * postgresql://postgres:***@db:5432
10 rows affected.


open,high,low,close
156.848,156.848,156.848,156.848
156.837,156.837,156.788,156.788
156.797,156.797,156.767,156.787
156.787,156.787,156.778,156.778
156.778,156.778,156.748,156.748
156.788,156.788,156.727,156.788
156.748,156.748,156.737,156.737
156.737,156.737,156.737,156.737
156.647,156.647,156.618,156.627
156.618,156.618,156.598,156.598


Let's remove timeslots with low trade activity from consideration. 

$$ \sigma_{\text{volume} > 10.0}(\text{aapl_s1_bid_a}) $$

You should obtain exactly 8 rows.

In [23]:
%%sql 
SELECT *
FROM ise_ex6_stocks.aapl_s1_bid_a
WHERE volume>10.0;

 * postgresql://postgres:***@db:5432
8 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:01,156.837,156.837,156.788,156.788,15.012
2017-10-23 14:30:02,156.797,156.797,156.767,156.787,22.502
2017-10-23 14:30:03,156.787,156.787,156.778,156.778,30.0
2017-10-23 14:30:04,156.778,156.778,156.748,156.748,15.0
2017-10-23 14:30:05,156.788,156.788,156.727,156.788,37.9
2017-10-23 14:30:06,156.748,156.748,156.737,156.737,15.0
2017-10-23 14:30:07,156.737,156.737,156.737,156.737,30.0
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627


### Now let's nest SQL queries

Now we want to apply SQL queries to results of prior SQL queries. Can you write the SQL query with two nested projection operators, using the nested table expressions from the lecture? 

$$ \pi_\text{open, close}(\pi_\text{open, high, low, close}(\text{aapl_s1_bid_a})) $$

Note: for a nested table expressions in **FROM**, Postgres requires using an alias with **AS** - see [**SELECT** documentation](https://www.postgresql.org/docs/13/sql-select.html) for more details.


In [38]:
%%sql 
SELECT op,close
FROM 
(SELECT open,high,low AS op,close 
 FROM ise_ex6_stocks.aapl_s1_bid_a) 
AS foo;

 * postgresql://postgres:***@db:5432
10 rows affected.


op,close
156.848,156.848
156.788,156.788
156.767,156.787
156.778,156.778
156.748,156.748
156.727,156.788
156.737,156.737
156.737,156.737
156.618,156.627
156.598,156.598


In the last homework, you learned that projections commute with one another and a projection on a larger subset is absorbed by a smaller one. But is this fully supported by Postgres?

$$ \pi_\text{open, high, low, close}(\pi_\text{open, close}(\text{aapl_s1_bid_a})) $$

In [None]:
%%sql


If you correctly inverted the order of projections, then you should observe an exception from `psycopg2` library - the second projection cannot find one of the columns that didn't "survive" the first projection.

Can you nest two selection operators?

$$ \sigma_{\text{time} > \text{23.10.2017 14:30:05}}(\sigma_{\text{volume} > 10.0}(\text{aapl_s1_bid_a})) $$

In [41]:
%%sql
SELECT * FROM
(
    SELECT * 
    FROM ise_ex6_stocks.aapl_s1_bid_a
    WHERE volume>10.0
) 
AS foo
WHERE time>'2017-10-23 14:30:05';

 * postgresql://postgres:***@db:5432
3 rows affected.


time,open,high,low,close,volume
2017-10-23 14:30:06,156.748,156.748,156.737,156.737,15.0
2017-10-23 14:30:07,156.737,156.737,156.737,156.737,30.0
2017-10-23 14:30:08,156.647,156.647,156.618,156.627,23.627


We know that in relational algebra the selection is commutative.

Do you think we should observe the same result as in the previous query? Would the resulting table be identical?

$$ \sigma_{\text{volume} > 10.0}(\sigma_{\text{time} > \text{23.10.2017 14:30:05}}(\text{aapl_s1_bid_a})) $$

In [None]:
%%sql 

Remember that SQL executes the **WHERE** clause (selection) before the **SELECT** clause (projection). It selects the rows on which it will operate. We can see this as a "reduction" of the initial table before it is projected on some columns.

$$ \pi_\text{open, close, volume}(\sigma_{\text{volume} > 10.0}(\text{aapl_s1_bid_a})) $$

In [None]:
%%sql 

The aggregation $ \gamma $ is represented by the **GROUP BY** clause, but it can appear in the **SELECT** clause as well. In this example, we have a single attribute and we can compute the **MAX** directly in select.

$$ \gamma_{\text{MAX(volume)} \rightarrow \text{max_vol}}(\pi_\text{volume}(\text{aapl_s1_bid_a})) $$

Look in [Postgres documentation of **SELECT**](https://www.postgresql.org/docs/13/sql-select.html) for examples of using aggregation in **SELECT**.

In [None]:
%%sql 

We can locate the duplicated entries by counting how many times a given *time* appears in the table. To apply the **COUNT** aggregation function for each timeslot separately, we need to *group by* the time.

$$ \gamma_{\text{time, open, high, low, close, volume, COUNT(time)} \rightarrow \text{count_time}}(\text{aapl_s1_bid_a} \cup \text{aapl_s1_bid_b}) $$

See [Postgres docs](https://www.postgresql.org/docs/13/tutorial-agg.html) for an example of using **GROUP BY**.

In [None]:
%%sql 

We can combine grouping with many other SQL statements, such as **ORDER BY**.

$$ \tau_\text{time ascending}(\gamma_{\text{time, open, high, low, close, volume, COUNT(time)} \rightarrow \text{count_time}}(\text{aapl_s1_bid_a} \cup \text{aapl_s1_bid_b})) $$

In [None]:
%%sql

Which previous query outputs all the rows where count_time > 1 in this last query?