# 5 really cool features of Postgres 10
<br /><br /><br /><br /><br />
## <div style="text-align: right">Author: Jakub Wilkowski </div>

# About me
<br />
## Currently:
### Python developer @ [10Clouds](https://10clouds.com/)
<img class="tenc-header__logo" src="https://10clouds.com/wp-content/themes/thegem/dist/images/10clouds-logo.svg" alt="10Clouds" style="background: #000001">
<br /><br />
## Previously:
#### * Database developer in major telecom company
#### * Application development specialist in consulting/finance
#### * MSc in telecommunications

# Agenda
<br />

* A little bit about postgres
* Environment setup
* New features:
  1. Identity columns
  2. Native partitioning
  3. Multicolumn statistics
  4. More parallelism
  5. Full text search support JSON & JSONB columns
* Summary

# Why Postgres?

<div align="center"><iframe align="center" src="https://giphy.com/embed/c5iMjFfrUFpza" width="480" height="266" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div>

# Postgres
<br />
* open source
* object RDBMS
* big flexiblity
* extensions!
* a lot of fun

# Setup

## Postgres docker images

In [37]:
!docker pull postgres:9.6
!docker run -p 5430:5432 --name jupy-old-postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres:9.6

9.6: Pulling from library/postgres
Digest: sha256:318757ed6291e6a1ef86312ac453b9b4a67b48495b59ca2dece909cb0c688c53
Status: Image is up to date for postgres:9.6
ea95e66ea557f1ad3eca203be4ca690ead9d2cd974dac76fe190902241f83305


In [38]:
!docker pull postgres:10
!docker run -p 5431:5432 --name jupy-new-postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres:10

10: Pulling from library/postgres
Digest: sha256:73a1c4e98fb961bb4a5c55ad6428470a3303bd3966abc442fe937814f6bbc002
Status: Image is up to date for postgres:10
36618fb47a3cc5a0646b32526799686808e50d51c251ad0179f02d998ac37c3e


## Connection to database(s)

### [jupyter sql magic](https://github.com/catherinedevlin/ipython-sql) used

In [5]:
%reload_ext sql
connection96="postgresql+psycopg2://postgres:mysecretpassword@localhost:5430/postgres"
connection10="postgresql+psycopg2://postgres:mysecretpassword@localhost:5431/postgres"

In [14]:
%%sql $connection96
select current_setting('server_version');

1 rows affected.


current_setting
9.6.5


In [21]:
%%sql $connection10
select current_setting('server_version')

1 rows affected.


current_setting
10.0


## Examples
<br/>

### All examples will be available in [another notebook](https://github.com/jakubwilkowski/pg10/blob/master/pg10_examples.ipynb) to fully leverage jupyter's capabilites

#  1. Identity columns

ID|datetime|user_id|amount|category_id
---|---|---|---|---
101|'2017-01-01'|123456|456|2
102|'2017-01-02'|123412|1000|4
...|...|...|...|...

```postgresql
INSERT INTO foo(
    id,
    datetime, user_id, amount, category_id
    )
SELECT
    (SELECT max(id) + 1 FROM foo),
    '2017-11-03', 11111, -230, 3;
```

## Example (pg10)


### Create table with identity column
<br />
```postgres
CREATE TABLE foo (
    id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, 
    val1 INTEGER);
    
INSERT INTO foo(val1) VALUES (1);
```

### Sequence restart
<br />
```postgres
ALTER TABLE foo ALTER COLUMN id RESTART WITH 1000;
INSERT INTO foo(val1) VALUES (2);
```

### Creating copy of the table
<br />
```postgres
CREATE TABLE bar (LIKE foo INCLUDING ALL);
INSERT INTO bar(val1) VALUES (3);
INSERT INTO foo(val1) VALUES (4);
```

### Querying foo & bar
<br />
```postgres
SELECT id, val1, 'foo' AS tbl FROM foo
UNION
SELECT id, val1, 'bar' AS tbl FROM bar
ORDER BY val1;```
<br/>

id  | val1 | tbl
----|------|----
1   |1     |foo
1000|2     |foo
1   |3     |bar
1002|4     |foo

## Meanwhile in pg 9.6...
#### (Above steps were repeated using old syntax)


### Querying foo & bar
<br/>
```postgres
SELECT id, val1, 'foo' as tbl FROM foo
UNION
SELECT id, val1, 'bar' as tbl FROM bar
ORDER BY val1;
```
<br/>

id  | val1 | tbl
----|------|----
1   |1     |foo
1000|2     |foo
1001|3     |bar
1002|4     |foo

```
                        Table "public.bar"
 Column |  Type   |                    Modifiers                     
--------+---------+-------------------------------------------------
 id     | integer | not null default nextval('foo_id_seq'::regclass)
 val1   | integer | 

Indexes:
    "bar_pkey" PRIMARY KEY, btree (id)
```

### Dropping?
<br/>
```postgres 
DROP TABLE foo;
ERROR:  cannot drop table foo because other objects depend on it
DETAIL:  default for table bar column id depends on sequence foo_id_seq
HINT:  Use DROP ... CASCADE to drop the dependent objects too.```

```postgres
DROP TABLE foo CASCADE;
DROP TABLE```

### Inserting again?
<br/>
```postgres 
INSERT INTO bar(val1) VALUES (5);
ERROR:  null value in column "id" violates not-null constraint
DETAIL:  Failing row contains (null, 5).```

<div align="center"><iframe src="https://giphy.com/embed/8FK0n9SIlod7a" width="480" height="360" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div>

```
      Table "public.bar"
 Column |  Type   | Modifiers 
--------+---------+-----------
 id     | integer | not null
 val1   | integer | 
Indexes:
    "bar_pkey" PRIMARY KEY, btree (id)```

# 2. Native partitioning

## What is partitioning for?
<br/>
 * Create another level of abstraction – we want to query only one (master) table
 * The data themselves should be dispatched to different child tables
 * We expect gains in terms of performance, especially reads

## How it used to be?

 * Create a master table.
 * Create as many inherited tables with datetime constraints as needed.
 * Create indices, keys, and other constraints on child tables.
 * Create a trigger on the master table that will dispatch rows to proper child tables before insert.

## What's the current state?

 * Create a master table, **specify partitioning rule**.
 * Create as many ~~(inherited)~~ child tables with datetime constraints as needed.
 * Create indices, keys, and other constraints on child tables.
 * ~~Create a trigger on the master table that will dispatch rows to proper child tables before insert.~~

## Implementation
<br />
[How Do You Fight Smog with Machine Learning? We Tried, and This Is What Happened](https://10clouds.com/blog/machine-learning-smog-app/)

### 1. Create master table, specify partitioning rule
<br/>
```postgres
CREATE TABLE measurement(
  id INTEGER GENERATED ALWAYS AS IDENTITY,
  datetime TIMESTAMPTZ,
  site_id INTEGER,
  pollutant_id INTEGER,
  value FLOAT)
PARTITION BY RANGE (datetime);
```

### 2. Create a couple of child tables. Define data range limits they should store
<br/>
```postgres
CREATE TABLE measurement_201708
PARTITION OF measurement(datetime)
FOR VALUES FROM ('2017-08-01') TO ('2017-09-01');

CREATE TABLE measurement_201709
PARTITION OF measurement(datetime)
FOR VALUES FROM ('2017-09-01') TO ('2017-10-01');

CREATE TABLE measurement_201710
PARTITION OF measurement(datetime)
FOR VALUES FROM ('2017-10-01') TO ('2017-11-01');
```

### 3. Add all needed keys and indices, for each child table
<br/>
```postgres
ALTER TABLE measurement_201708 ADD PRIMARY KEY (id);
ALTER TABLE measurement_201708 ADD CONSTRAINT fk_measurement_201708_site 
  FOREIGN KEY (site_id) REFERENCES site(id);
CREATE INDEX idx_measurement_201708_datetime 
  ON measurement_201708(datetime);
```

## Messing with partitions

### Insert
<br/>
```postgres 
INSERT INTO measurement(
  datetime, 
  site_id, 
  pollutant_id, 
  value)
SELECT 
  '2017-08-01'::TIMESTAMPTZ + ((random()*90)::int) * INTERVAL '1 day',
  (1 + random()*(SELECT max(id)-1 FROM site))::int,
  (1 + random()*(SELECT max(id)-1 FROM pollutant))::int,
  random()
FROM generate_series(1,1000);
```

### Select
<br/>
```postgres
SELECT * FROM measurement 
WHERE datetime BETWEEN '2017-09-20' AND '2017-09-27';
```

### Explain
<br/>
```postgres
EXPLAIN SELECT * FROM measurement 
WHERE datetime BETWEEN '2017-09-20' AND '2017-09-27';
```
<br/>

QUERY PLAN|
---|
...|
Bitmap Index Scan on **idx_measurement_201709_datetime** (cost=0.00..4.22 rows=7 width=0)|


Tell that you can use inspect db to create a model in django with fake migration, write a function that creates new partitions with keys etc., trigger it once a month

Also there should be a big performance boost on inserts vs old approach

## Problems
<br/>
 * A lot of repeated commands
 * No unique keys across children tables

# 3. Multicolumn statistics
## aka correlated statistics

## How does postgres estimate number of returned rows?
<br/>
```postgres
SELECT * 
FROM SOME_TABLE 
WHERE col1 = cond1
  AND col2 = cond2
  AND col3 = cond3;
```

$$rows\,to\,retrieve=total\,number\,of\,rows *p_{predicate\,1}*p_{predicate\,2}*p_{predicate\,3}$$

<div align="center"><iframe src="https://giphy.com/embed/3o85xpYXnjNyfScn28" width="480" height="288" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div>

## Example

```postgres
CREATE TABLE counting_log (
  id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, 
  datetime TIMESTAMP WITH TIME ZONE, 
  child_id INTEGER, 
  word TEXT);
```
<br/>

id|datetime|child_id|word|
--|--------|--------|----
1|'2017-11-08 12:00:00'|123|'eeny'
2|'2017-11-08 15:30:00'|130|'meeny'

```postgres
INSERT INTO counting_log(
  datetime,
  child_id, 
  word)
SELECT 
  current_timestamp, 
  i%1000, 
  CASE WHEN i%4=1 THEN 'eeny' 
    WHEN i%4=2 THEN 'meeny' 
    WHEN i%4=3 THEN 'miny' 
    WHEN i%4=0 THEN 'moe' 
    ELSE 'nope' END 
FROM generate_series(1, 1000000) i;
```

```postgres
EXPLAIN SELECT datetime FROM counting_log WHERE child_id=123;
```
<br/>

QUERY PLAN|
-------------------|
Bitmap Heap Scan on counting_log  (cost=19.92..2688.01 **rows=967** width=8)|
Recheck Cond: (child_id = 123)|
->  Bitmap Index Scan on idx_counting_log_child_id  (cost=0.00..19.68 rows=967 width=0)|
Index Cond: (child_id = 123)|

### Estimated 971 rows with child_id=123

```postgres
EXPLAIN SELECT datetime FROM counting_log WHERE word='miny';
```
<br/>

QUERY PLAN|
----------------|
 Seq Scan on counting_log  (cost=0.00..19643.00 **rows=249133** width=8)|
   Filter: (word = 'miny'::text)|

### Estimated 252867 rows with word='miny'

## Let's do some math!

$$rows\,to\,retrieve=total\,number\,of\,rows *p_{predicate\,1}*p_{predicate\,2}\\
=1000000*\frac{971}{1000000}*\frac{252867}{1000000}\approx 245.534$$

```postgres
EXPLAIN SELECT datetime FROM counting_log WHERE word='miny' and child_id=123;
```
<br/>

QUERY PLAN|
----------|
Bitmap Heap Scan on counting_log (cost=19.77..2699.92 **rows=245** width=8)|
  Recheck Cond: (child_id = 123)|
  Filter: (word = 'miny'::text)|
  -> Bitmap Index Scan on idx_counting_log_child_id (cost=0.00..19.71 rows=971 width=0)|
        Index Cond: (child_id = 123)|

## The reality

```postgres
SELECT count(datetime) FROM counting_log WHERE word='miny' and child_id=123;
```
<br/>

 count |
-------|
  1000|

# <div style="text-align: center">245 != 1000

## What if?
<br/><br/>
* We actually wanted to join another table to above results (i.e. children info)?
* ... and because of such big underestimation query planner chose to use nested loop instead of hash join?
* ... and we had more tables to join with **a lot** more data in it?

<div align="center"><iframe src="https://giphy.com/embed/687qS11pXwjCM" width="480" height="480" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div>

<img src="https://upload.wikimedia.org/wikipedia/en/thumb/8/82/Reddit_logo_and_wordmark.svg/1280px-Reddit_logo_and_wordmark.svg.png" style="background: #FFFFFF">

<img src="http://ww1.prweb.com/prfiles/2017/05/25/14370539/Hacker%20Noon%20-%20how%20hackers%20start%20their%20afternoon%20AMI%20David%20Smooke.jpg">

<img src="https://scontent-frx5-1.xx.fbcdn.net/v/t31.0-8/456226_10150559388837382_1784277255_o.jpg?oh=b089bd5375b2a9ce7ab5c58295595a03&oe=5AAEF011">

<div align="center"><iframe src="https://giphy.com/embed/hFmIU5GQF18Aw" width="343" height="480" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div>

## Postgres 10 to the rescue!

```postgres
CREATE STATISTICS st_counting_log_child_id_word 
  ON child_id, word FROM counting_log;
```

```postgres
ANALYZE counting_log;
EXPLAIN SELECT datetime FROM counting_log WHERE word='miny' and child_id=123;
```
<br/>

QUERY PLAN |
-----------|
 Bitmap Heap Scan on counting_log  (cost=19.93..2692.85 **rows=968** width=8)|
   Recheck Cond: (child_id = 123)|
   Filter: (word = 'miny'::text)|
   ->  Bitmap Index Scan on idx_counting_log_child_id  (cost=0.00..19.68 rows=968 width=0)|
         Index Cond: (child_id = 123)|

# <div style="text-align: center">968 &asymp; 1000

### Let's look closer at our new statistics
<br/>
```postgres
SELECT stxname, stxkeys, stxkind, stxndistinct, stxdependencies 
  FROM pg_statistic_ext 
  WHERE stxname = 'st_counting_log_child_id_word';
```
<br/>

stxname            | stxkeys | stxkind |  stxndistinct  |   stxdependencies    
-------------------------------|---------|--------|----------------|----------------------
 st_counting_log_child_id_word | 3 4     | {d,f}   | {"3, 4": 1000} | {"3 => 4": 1.000000}

# 4. More parallelism!

## Parallel queries in Postgres so far
<br/>
* Postgres 9.6:
  * Parallel Scans
    * Sequential scan only
  * Parallel Joins
    * Nested loop
    * Hash join
  * Parallel Aggregation
  

## With pg10 we also get:
<br/>
* Postgres 10:
  * Parallel Scans
    * sequential scan only
    * **bitmap heap scan**
    * **index scan**
    * **index-only scan**
  * Parallel Joins
    * nested loop
    * hash join
    * **merge join**
  * Parallel Aggregation 

## New settings

### Minimal size of a table for which parallelism can be triggered
<br/>
```postgres
show min_parallel_table_scan_size;
```
<br/>

min_parallel_table_scan_size| 
--|
8MB|

### Minimal size of a index for which parallelism can be triggered
<br/>
```postgres
show min_parallel_index_scan_size;
```
<br/>

min_parallel_index_scan_size| 
--|
512kB|

### Maximum number of parallel workers to be used
<br/>
```postgres
show max_parallel_workers;
```
<br/>

max_parallel_workers| 
--|
8|

## Example

```postgres
CREATE TABLE trigonometry 
AS 
SELECT 
  i AS arg, 
  sin(i) AS sine, 
  cos(i) AS cosine, 
  tan(i) AS tangent 
FROM generate_series(0, 100000, 0.01) i;

CREATE INDEX idx_trigonometry_arg ON trigonometry(arg);
CREATE INDEX idx_trigonometry_sine ON trigonometry(sine);
CREATE INDEX idx_trigonometry_cosine ON trigonometry(cosine);
```
<br/>

  arg  |         sine         |        cosine        |       tangent        
-------|----------------------|----------------------|----------------------
     0 |                    0 |                    1 |                    0
  0.01 |  0.00999983333416666 |    0.999950000416665 |   0.0100003333466672


### Parallel aggregate (old stuff)

```postgres
EXPLAIN SELECT count(arg) FROM trigonometry WHERE arg > 50000;
```                                         
<br/>

QUERY PLAN|
--------------------------------------------------------|
 Finalize Aggregate  (cost=138908.78..138908.79 rows=1 width=8)|
   ->  Gather  (cost=138908.56..138908.77 rows=2 width=8)|
         Workers Planned: 2|
         ->  Partial Aggregate  (cost=137908.56..137908.57 rows=1 width=8)|
               ->  Parallel Seq Scan on trigonometry  (cost=0.00..134436.34 rows=1388889 width=32)|
                     Filter: (arg > '50000'::numeric)|


### Parallel index scan (new and shiny)

```postgres
EXPLAIN SELECT * FROM trigonometry WHERE arg > 50000;
```
<br/>

QUERY PLAN|
--------------------------------------------------|
 Index Scan using idx_trigonometry_arg on trigonometry  (cost=0.43..202722.77 rows=4988362 width=32)|
   Index Cond: (arg > '50000'::numeric)|

### Parallel index scan (2nd attempt)

```postgres
SET parallel_setup_cost=100;
EXPLAIN SELECT * FROM trigonometry WHERE arg > 50000;
```
<br/>

QUERY PLAN|
------------------------------|
 Index Scan using idx_trigonometry_arg on trigonometry  (cost=0.43..202722.77 rows=4988362 width=32)|
   Index Cond: (arg > '50000'::numeric)|


<div align="center"><iframe src="https://giphy.com/embed/Az1CJ2MEjmsp2" width="480" height="221" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div>

### Parallel index scan (one more try)

```postgres
SET parallel_setup_cost=1000;
EXPLAIN SELECT arg FROM trigonometry 
        WHERE sine > 0.999 AND arg >100 AND arg < 10000;
```
<br/>

QUERY PLAN|
----------------------------|
 **Gather  (cost=1000.43..40550.21 rows=13356 width=8)**|
   **Workers Planned: 2**|
   ->  **Parallel Index Scan** using idx_trigonometry_arg on trigonometry  (cost=0.43..38214.61 rows=5565 width=8)|
         Index Cond: ((arg > '100'::numeric) AND (arg < '10000'::numeric))|
         Filter: (sine > '0.999'::double precision)|

## Let's spread some chaos

<div align="center"><iframe src="https://giphy.com/embed/moiWSfviYKNgc" width="480" height="360" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div>

```postgres
SET max_parallel_workers=0;
SET force_parallel_mode=on;
EXPLAIN ANALYZE SELECT arg FROM trigonometry 
                  WHERE sine > 0.999 AND arg >100 AND arg < 10000;
```

QUERY PLAN|
------------------|
 Gather  (cost=1000.43..40550.21 rows=13356 width=8) (actual time=1.015..238.902 rows=14097 loops=1)|
   **Workers Planned: 2**|
   **Workers Launched: 0**|
   ...|

# 5. Full text search support JSON & JSONB columns

## In previous releases full text search worked only on TEXT columns

## Example

```postgres
CREATE TABLE transactions(
  id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, 
  transaction_id VARCHAR(10), 
  user_id INTEGER, 
  created_datetime TIMESTAMP WITH TIME ZONE, 
  result BOOL, 
  amount INT ,
  response_data JSON);
```
<br/>

response_data  |
---------------|
 {"transaction":                                              
     {"id": "91a40daa7d",                                     
      "transaction_datetime": "2017-11-07 14:11:55.328364+00",
      "amount": 833,                                          
      "is_success": "false",                                  
      "message": "insufficient funds"}}

### Creating a GIN index
<br/>
```postgres
CREATE INDEX idx_transactions_response_data 
  ON transactions USING GIN (to_tsvector('english', response_data));
```

### Querying using TS

```postgres
SELECT transaction_id 
  FROM transactions 
  WHERE to_tsvector('english', response_data->'transaction'->'message')  
    @@ to_tsquery('english', 'insufficient') LIMIT 5;
```
<br/>

transaction_id |
----------------|
 91a40daa7d|
 4fc852f17b|
 6b32294ed4|
 a9ae624482|
 23eb7d83c1|

### to_tsquery
<br/>
 >insufficient -> insuffici
 
<br/>
 ```postgres
 SELECT transaction_id 
  FROM transactions
  WHERE to_tsvector('english', response_data->'transaction'->'message') 
    @@ 'insuffici' LIMIT 5;
 ```

### Explain
<br/>
```postgres
EXPLAIN SELECT transaction_id FROM transactions 
        WHERE to_tsvector('english', response_data) 
          @@ to_tsquery('english', 'insufficient') LIMIT 5;
```
<br/>

QUERY PLAN |
--------------|
 Limit  (cost=8.04..23.42 rows=5 width=11)|
   ->  Bitmap Heap Scan on transactions  (cost=8.04..23.42 rows=5 width=11)|
         Recheck Cond: (to_tsvector('english'::regconfig, response_data) @@ '''insuffici'''::tsquery)|
         ->  **Bitmap Index Scan on idx_transactions_response_data**  (cost=0.00..8.04 rows=5 width=0)|
               Index Cond: (to_tsvector('english'::regconfig, response_data) @@ '''insuffici'''::tsquery)|

# Wait, there's even more

* logical replication
* quorum commit for synchronous replication
* XML tables
* SCRAM authentication
* renaming of some functions
* \if \elif \else statements in psql
* ...

# Wrapping it up

# Questions?

# Thank you!
<br/>
<div align="center"><iframe align="center" src="https://giphy.com/embed/c5iMjFfrUFpza" width="480" height="266" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div>