 # TP TUNING
 
 ## Prerequisites
 In this section, we will cover different aspects about the optimizer.
 
 Before going on, please connect you to the database.
 

In [29]:
-- connection: host='localhost' dbname='ds2' user='ds2'

## <span style="color:blue">EX - 1</span>

In this exercise, we will see what happens when you don't use index or any other optimization such as partitionning.


All statistics are stored in postgres into a view catalog called pg_stats. Another view is available trough pg_statistic but this one is less easy to read.<br/>
Click on <a href="https://www.postgresql.org/docs/11/view-pg-stats.html">pg_stats</a> to get a complete description.

In our first step, we will focus on a test table we will create.


In [36]:
SET max_parallel_workers_per_gather TO 0;
DROP TABLE IF EXISTS test;
CREATE TABLE test (i integer not null, t text);
explain (ANALYZE , TIMING ON )INSERT INTO test SELECT CASE WHEN i > 700000 THEN 700000 ELSE i/1000 END, md5(i::text) FROM generate_series(1, 1000000) i;

4 row(s) returned.


QUERY PLAN
Insert on test (cost=0.00..22.50 rows=1000 width=36) (actual time=1417.053..1417.053 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..22.50 rows=1000 width=36) (actual time=74.285..511.145 rows=1000000 loops=1)
Planning Time: 0.039 ms
Execution Time: 1418.554 ms


Note that our data are inserted in 1365.778 ms : depends on your computer.<br/>
In order to get the description of the table, run the following command :

In [37]:
SELECT *
FROM information_schema.columns
WHERE table_schema = 'ds2'
  AND table_name   = 'test';

2 row(s) returned.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
ds2,ds2,test,i,1,,NO,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,ds2,pg_catalog,int4,,,,,1,NO,NO,,,,,,NO,NEVER,,YES
ds2,ds2,test,t,2,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,ds2,pg_catalog,text,,,,,2,NO,NO,,,,,,NO,NEVER,,YES


To get the index information on the test table :

In [38]:
SELECT * from pg_indexes where  tablename='test';

0 row(s) returned.


To update statistics on our new test table, use the ANALYZE command :

In [39]:
ANALYZE test;

Let's have a look on statistics of the test table : 

In [40]:
select * from pg_stats where schemaname ='ds2' and tablename='test';


2 row(s) returned.


schemaname,tablename,attname,inherited,null_frac,avg_width,n_distinct,most_common_vals,most_common_freqs,histogram_bounds,correlation,most_common_elems,most_common_elem_freqs,elem_count_histogram
ds2,test,i,False,0,4,702,"{700000,509,576,546,29,55,116,204,306,597}","[0.297967, 0.00156667, 0.00156667, 0.0015, 0.00146667, 0.00146667, 0.00146667, 0.00146667, 0.00146667, 0.00146667]","{0,7,14,21,28,36,42,49,57,64,71,79,85,93,100,107,114,122,128,135,142,149,156,163,170,177,184,191,197,203,211,218,226,233,240,246,253,260,267,274,280,287,294,301,309,315,322,329,336,342,350,356,363,370,377,384,391,398,404,412,419,426,433,440,447,453,460,466,473,480,486,493,500,507,514,521,528,534,541,549,556,563,569,577,585,592,600,607,613,619,626,633,641,649,657,664,671,678,685,692,700}",1.0,,,
ds2,test,t,False,0,33,-1,,,"{00011fd5429cc89f957c34d212c36252,02b8218b311ddda3a774c36de620d95f,0559421fe048f9f6ed2f447465e0722b,07e524517a453884055bbd6032a08f3a,0a48aac296bf7d6a0a07a10b49bb3b71,0cb8c116516b7f2391c17cf0ab60dbce,0f566d6e331b34a0ce5874d738ad988c,11d88e3c1aecc77dcfff66e4dee8b7a3,14627f30de88bb291227a31c4c13f439,17109878b1d5e163dd35a26102eebb13,199a777aa2d5adb3a49b14c34a5506f0,1c2a6c86e32a5ab370433d2f4a463993,1ebac7cab305e276e1a581869fd7f6fe,2159edec712ef3331affcaf4dd5a5cfc,242d88300bb43bdf65686f300b5bc1c2,26ebe658c275aab8905d05580286d361,29681a3824efcfe2a2debd1b1ae45491,2bdcdbd9863bdf0a755491e015027ebb,2e577395f009a278eb3aaf5ff35537da,30ea3de8d7fc8ae70afe8ee506956683,3391bcde91286e5493d2f93c87665ee0,3644f3492bd7df2b92e9d31ba9ecf9c6,3863d6ec59f692f4fef00aa2ce1ae73e,3acb938057a800b4d83f8fd7005228b0,3d5e0fc9d50c27447c21c03185f4aff0,3fdb09320238e109ace6bd2e812aba71,427b6c3fa22a8b08981e0092c94f9c84,44ff361eb96953006ac4188e9c3ccc2a,4780f2d8c5b37cc95d353381a2b6c1e8,4a319c28b327bd859c7b9cb9bbff1f17,4cbf73044fa6661cd21923f0d0f43da3,4f76880ea31215348f339fe12e5cf295,5200812fff9954324faeae0919f4c85e,54ecd1b65bf8ad35021858062271013b,577472c54a6e3527ec5d06f7ee5fa650,59dfe77359ec74d4bbe81f06b7caef60,5ca3842ce5dbd1f19c9c6a1cdab60b68,5edca37083272210f9eb7d282ac158a4,616a492327460f972a14f5015dbc1dcc,63f4e1892e7560d3fe2f9b906815d838,6676f42ad40765db9a7832ee5e35f542,68ea2aeabee59a7db1ec407950784664,6b9a2ff002dee23ad2457f16331f9152,6e18b12a82c40871803b88fa42cc1a20,709fd8d46c1c7bcc1ddabda873e968dd,731d7f5490a6e7b524a9f2dba421edbf,75d9e38a4d78ed2dd7e1bb8cb4414a8f,782aa3af32886233b7cc07728d6b4303,7aa9e5057da3262d2808b8bf31aac707,7db1fb8a484fbfe185cc44c3048f2a8f,80321e5289efca1e78b44010d3607c9c,82b80df8e4df1330c0aa8506a5aa4525,853a5378660114be3f3520b79a024766,87bd028b9fa7cedea0649c2171954b15,8a347bc83565ef0dd3a626bee0dc850e,8cd7adacb562101c6f099cd219a4a832,8f8da19b39860ad047dc392f622a1094,9279cf0a498aaaf0597e097e56536d37,951124d4a093eeae83d9726a20295498,97bd1dc1795bcf7a222c862ae3b35b34,9a56fc17905148c7a4f9bc01b866a09d,9cea3d3cb97d16530b6771b1aaf6ad82,9f84cfcee67ad94befdb8e71e442f80d,a20fceca4c94729f4a5db8f16f59b5bb,a48a66786b0ef1acb1012bc7c2fb3811,a729a21ff3c02a1c9a774f483f6dcc1a,a9c837026a5ab119d92076bfb58e77f5,ac5df0ebd85404b7e04d4160b0d69250,aecfdedf0df7ad70b1677aa26c27d277,b1062637c41eaf99c5a790ec8c0afbc6,b3af0f8b07961e3f420142ccd9c4fa87,b61158cfee1c2c7ed7066ebce620b36f,b90e873c33fb53270b093eb70994d002,bb9de0146ad1c2e14ed5bcf155d2bc39,be0da409e1e12fdad28d0e8cbea172d1,c0cde430463d529a7c517beea3a758e9,c325a1f2d1226ac996f297052c91a683,c585de0e5e67333eb1e2a8aa25096278,c8419e6ce51753e4d54c961b9baf3023,cad2dd4ac4d8d86663098a2843654559,cd39c9f46dd2cc6bb78d6037a0724af6,cfef0630b5462a56d1e533ffcd745019,d27c82fcb9760185d9aac0c382d30bbc,d51b99b58e8509f1378a4f2efa6e350e,d7c5bf2b14b310234ffe995160ee8c91,da3d08f40a6098687860d2eb54faee62,dcb6d0244774e4f3f044d6b5f3efe51f,df12e6b73f124c24dff6900bf4d06dc3,e1bcdd0c47356d12c6cb7be7a97aff88,e4468f72c7cedbf7424120defaa3e8e2,e6c1a72588295368cabb74c061b34b86,e954b76d2a11bf9c1ac1ed7ff7493cf4,ebd3938ff5a72cb62993325aa7a1f313,ee5fee2f92db3b87a87b70d60f4e07d1,f0abdb9e54dbbc7d6f556c153b813c95,f35f7f5774e48f997fc4b9e36beb761f,f5e146703585b5ec963d6783f8c884e4,f89393c73a4844d57324d6abd25b4721,fb3f3b6300f1cb0d388dcd87ccfb4804,fdba50eeefe3d6a98c8318b72d86fdb0,ffff1b67c9bf77f1dd4df612f6f3911d}",-0.000203858,,,


You can get a full field description here :
https://www.postgresql.org/docs/11/view-pg-stats.html<br/>

Great, our statistics are up-to-date.<br/>

Let's look at how the RDBMS estimates the cost of a SELECT.<br/>

Postgres defines a cost for each operation :
* cpu_tuple_cost : cost of processing each row during a query. The default is 0.01.
* cpu_operator_cost : cost of processing each operator or function executed during a query. The default is 0.0025.
* seq_page_cost :  cost of a disk page fetch that is part of a series of sequential fetches. The default is 1.0.<br/>

A complete description of postgresql variables is available here :https://www.postgresql.org/docs/11/runtime-config-query.html

To estimate the cost, the optimizer will take into account : 
* system variables
* number of pages to analyze (relpages)
* number of row to analyze (reltuples)

For example, when you run a full scan of your table via the following SQL query :

In [43]:
explain analyze select * from test;

3 row(s) returned.


QUERY PLAN
Seq Scan on test (cost=0.00..18334.00 rows=1000000 width=37) (actual time=0.009..167.979 rows=1000000 loops=1)
Planning Time: 0.027 ms
Execution Time: 259.614 ms


Even though the optimizer knows how data is distributed, it has only one way to scan data: sequentially.<br/>
Of course, this method will be used whatever the value of your predicat is.<br/>

We agree the RDBMS will need to read sequentially all table pages and consequently all rows.<br/>
So the cost will be based on the :
* cost of processing each row : the total number of rows * cpu_tuple_cost(0,01)
* cost of a disk page fetch :  the total number of pages *  seq_page_cost(1)
* cost of a query processing :  the total number of pages *  cpu_operator_cost(0,0025)

If we addition all these costs we get the same estimate than the optimizer.

An estimate of the number of pages and rows are stored in the pg_class view into reltuples and relpages columns.

In [44]:
select to_char(reltuples,'999 999 999 999') reltuples, relpages from pg_class where relname ='test';

1 row(s) returned.


reltuples,relpages
1 000 000,8334


To get the estimate cost of you "select * from test; ", we can find it by runing the following query :

In [47]:
select relname,relpages * current_setting('seq_page_cost')::float + 
reltuples * current_setting('cpu_tuple_cost')::float + 
current_setting('cpu_operator_cost')::float "Estimated cost"
from pg_class where relname = 'test';

1 row(s) returned.


relname,Estimated cost
test,18334


So, according to our compute, reading the whole data costs 18334 and it's the estimate we got in the explain command.

If you look at the estimate cost with a predicat you could notice this one is higher than our previous query :

In [48]:
explain analyze select * from test where i =56;

5 row(s) returned.


QUERY PLAN
Seq Scan on test (cost=0.00..20834.00 rows=995 width=37) (actual time=5.385..75.641 rows=1000 loops=1)
Filter: (i = 56)
Rows Removed by Filter: 999000
Planning Time: 0.351 ms
Execution Time: 75.738 ms


Could you explain why ?

## <span style="color:blue">EX - 2</span>

In this exercise, we will see how postgresql use a single index.

In our first step, we will focus on a test table we will create with an index.

In [71]:
DROP TABLE IF EXISTS test_ex2;
CREATE TABLE test_ex2 (i integer not null, t text);
INSERT INTO test_ex2 SELECT i/1000 , md5(i::text) FROM generate_series(1, 1000000) i;
CREATE INDEX ON test_ex2 (i);

In order to get the description of the table, run the following command :

In [72]:
SELECT *
FROM information_schema.columns
WHERE table_schema = 'ds2'
  AND table_name   = 'test_ex2';

2 row(s) returned.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
ds2,ds2,test_ex2,i,1,,NO,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,ds2,pg_catalog,int4,,,,,1,NO,NO,,,,,,NO,NEVER,,YES
ds2,ds2,test_ex2,t,2,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,ds2,pg_catalog,text,,,,,2,NO,NO,,,,,,NO,NEVER,,YES


To get the index information on the test_ex2 table :

In [73]:
SELECT * from pg_indexes where  tablename='test_ex2';

1 row(s) returned.


schemaname,tablename,indexname,tablespace,indexdef
ds2,test_ex2,test_ex2_i_idx,,CREATE INDEX test_ex2_i_idx ON ds2.test_ex2 USING btree (i)


To update statistics on our new test table, use the ANALYZE command :

In [74]:
ANALYZE test_ex2;

For each object of the ds2 schema, we can see statistics which are collected with the auto-vacuum or the ANALYZE command.
In this practice, we will limit statistics to the test table :

In [75]:
select * from pg_stats where schemaname ='ds2' and tablename='test_ex2';

2 row(s) returned.


schemaname,tablename,attname,inherited,null_frac,avg_width,n_distinct,most_common_vals,most_common_freqs,histogram_bounds,correlation,most_common_elems,most_common_elem_freqs,elem_count_histogram
ds2,test_ex2,i,False,0,4,1000,"{88,121,462,256,630,329,772,341,437,690,858}","[0.00166667, 0.00156667, 0.00156667, 0.00153333, 0.00153333, 0.0015, 0.0015, 0.00146667, 0.00146667, 0.00146667, 0.00146667]","{0,9,20,29,39,51,61,71,80,91,102,111,122,133,143,153,163,173,183,193,202,211,221,231,242,252,263,273,283,293,304,314,324,334,346,355,365,376,385,395,405,414,424,434,445,455,466,476,486,494,506,516,525,534,543,552,562,572,582,592,602,611,620,631,640,651,662,671,682,693,703,712,721,732,742,751,760,770,782,792,801,811,821,830,840,850,862,872,882,891,901,911,920,930,940,950,960,969,980,989,999}",1.0,,,
ds2,test_ex2,t,False,0,33,-1,,,"{0007168c1e1770d30a98ea7e360765d5,028d3195b14596277152579c1bae5352,0554ec66129b068ef7c80a0348340cbd,07ce5d3561f516f8689b70eff968ccb4,0a050629e2c0a2e6e41e138b77865af7,0c80e3d57a3432d485fea1011a81c8c3,0f1c381472b8afd30eef6d47700e7998,1196fa678dd3d887663bd9f1956b81aa,142bb2ae5a123688573bb04fbcd9dc9b,166f80f6cf8ef395010ef595a2315a5c,18f9c702993871054248bd4d7aa6a2cc,1badf207f2eff054598b02e78e0a6be2,1e18a0e3c0797db0375d3c766ac819ab,20952ab17d587d48ca96fae18c5df364,231a1c5436c4d9c432a9bc5324bd4844,25b32e400b6beadf075c093e420ac9ca,289a6beac1dd62628360c48c3f5cf94a,2b11ea160ec7c3cd10136ee17f6d9433,2d7b3ba90017562996dc6972bc0dc734,2fda4672e7844afb8ab8390295e869bc,324d5602b48cb31d156f6ef7d71a1a84,34bf77a88007769816c6b0726942c7e6,374fa5298650ef07e026041869767a5e,3a1341e6a28786bdbd8a2a9cba8baac0,3ca33d675f7f3c89cba02ca70418d600,3f20ad37c509e1dc2a7a12a0824636a0,41cfb05ae1c68f5227207e7a132595c7,44a216e7d9bae2a03d269497452c3547,47841769810b5f7d2ab6e66c84b7d9b0,4a064e446bc91651a64079fad6031f57,4c89af489433acbf68a37510a6718266,4f6e12bcaa78bf3c575fc96544f30700,520288bf2878bc2801da3d4cb4c41a1c,54ab849f3ade7b57acac9c4eb3ebeacf,57507e5ba69779d77e834a1a92494eb6,59bb0aea62b70ddc63832302636c713c,5c65d38bc61d0ab8c09ea47c756df900,5ede571a16d5fdae4b6a0826cc82436c,61701c46e652bbfcb74aaeb1ae6b88e2,6410bb923bcf940b7c57331f7b7db3c6,668f7137d32b5d3bf6167a4094495801,693b2c1ba87b96919678359abe66da1f,6bd9d12a9cf01904886e10eba8d2c2eb,6e3e0a5717b7f31aca1dbccdcb5e5dca,715543dd49d090906d2b2a328c6492e7,73958810c61e4d9037d33967542added,75edb935e65ca9a0214956a0be78917b,78789a0ba95cedc0fb3d32bb34a6482d,7b1aa388acdfc03a1155e488b45c4e01,7def3bfc41bf9c95a481b9c60365ae71,8088fa5c81aa5ce69a6d6798395c7ac0,8330f09177fd12af0aca71089532926a,85741b76407565b1ec275a79c7d8dae8,87e5be3ffcf76c392eb8a35d89d5d920,8a46c74a81c9c5bc97fe8a291551f3b3,8ca24c3e826218e67841e5f7b729818d,8f626b44a3e98f14696103fbd6136837,91d93ccd83ec1c202b2c2cc8b08509a9,94923763cab000d6c72a140cdf9ce576,9716982d803c8229316c01da227ccd9e,99a090c72547fe93a3ee5f9ead67207a,9c09c2a76881271bb51058cb3e71ce07,9e7aa04abbdfeca3f582500213571d0c,a10a7690e51ec71d99a03b90ea873d39,a3653cf7e4123607aff15e05505d9b60,a5ea91aa366ad7d591a920c2b1b59d46,a851fa15a489c113d6b0c9971c254d71,aae87768dfaa9b4facbdbcb92531e59f,ad987255388f5f9b5aa8ff17125c0371,b01a27f15881937e4ec3f5a53836f942,b2e17f763ad51f7e065935ba25c88b5c,b59cf327b93e64a8e6b293f45d3ff5c7,b83a5b000ed68d14e210168d5038b093,bac26283fbe9f5947628e197830336da,bd0cc7a543e3ae1b20ce0607a3540954,bf8401bb24eefda7adbfc5e48cb125a1,c21721f351b4c39bf61f7d9a987615e5,c4bb75db9f42bfc6d0e33c40075c9d3f,c759e3c8416c367ae0e4c28be1184e48,ca1cbbc532426702fa4a69468af1bb59,cc8e05b9bde0f802f8af10273bb89d5c,cf261d964e08e799ccfd8f3e48c747cf,d1ceb8907ca026bb7d653f1eeeef1043,d4731ba3996611dc513541000dc9186e,d728c0034dfca71f9b3efefdb8405971,d9bf83e047eaae238e34b0cff46cf738,dc30b3a5d3ae42ad70e63bea37c144a1,dee24b8e949ec6e4f106ab8ad7415555,e14ca0a88717906cb42c7262d23c0c93,e3d9548296ce2c267c27aa8a06457d84,e6546d296c00155a02f48bdf1040a9a0,e8ee6a7ab30903dfada1e3290bdd4de8,ebbd743b5209be066e6db702d9afc560,ee13f7e9c8cd39408076113a7e42c1ef,f0be7122f61dab38f7ed9015f1bc6235,f3571090e4da52fc2c86ef210ef1bd5a,f5e288cbce5a1df3e1acdc02d2fd31d3,f8684263662112a24159baf79930eba7,fb050706bad22f149a3a005203b0208b,fd61a51457dab424e687eb5f9b116545,fffbc65cff7705cc5bca00328d92c004}",-0.00639372,,,


Now, let's compute the selectivity on the i column.<br/>
The RDBMS looks at the data distribution through most_common_vals and most_common_freqs column.<br/>
Most_common_freqs give the selectivity and most_common_vals give the value associated to the selectivity.

In [84]:
SELECT tablename, attname, value, freq selectivity, freq * 1000000 cardinality 
FROM (SELECT tablename, attname, mcv.value, mcv.freq FROM pg_stats,
LATERAL ROWS FROM (unnest(most_common_vals::text::int[]), unnest(most_common_freqs)) AS mcv(value, freq)
WHERE tablename = 'test_ex2'
AND attname = 'i') get_mcv order by freq DESC limit 5 ;


5 row(s) returned.


tablename,attname,value,selectivity,cardinality
test_ex2,i,307,0.00103556,1035.56
test_ex2,i,3,0.00103111,1031.11
test_ex2,i,164,0.00103,1030.0
test_ex2,i,816,0.00103,1030.0
test_ex2,i,524,0.00102889,1028.89


According to the result, we identified the selectivity for each i value but is it relevant ? <br/>
Of course, that depends on the predicat we use in our query.

Let's take an example with the predicat "i = x" where x is a value of the result from the previous query.<br/>
<span style="color:red">
Watch out, statistics are based on a sample so the selectivity can differ if you rerun an ANALYZE command.</span>

So, the optimizer should estimate the number of rows as follow : <br/>
Selectivity * 1 000 000

Now let us check what the optimizer will estimate : 
Replace the x variable  with your value.

In [78]:
explain (ANALYZE, BUFFERS, TIMING ON, VERBOSE) select * from  test_ex2 where i = x; -- CHANGE X

6 row(s) returned.


QUERY PLAN
Index Scan using test_ex2_i_idx on ds2.test_ex2 (cost=0.42..62.67 rows=1500 width=37) (actual time=0.033..0.206 rows=1000 loops=1)
"Output: i, t"
Index Cond: (test_ex2.i = 329)
Buffers: shared hit=10 read=5
Planning Time: 0.110 ms
Execution Time: 0.324 ms



From the first row, our computed cardinality matches with the the optimtizer estimate.<br/>

You may compute the estimate error with the following formula :

\| actual row - row estimate |<br/>
 \---------------------------<br/>
            actual row <br/>

According to the sampling, the error may be more or less important.<br/>
Don't forget, sampling is random and may not represent correctly your data.<br/>
<br/>
Postgres by default, takes a sample of 300 * default_statistics_target where default_statistics_target =100.</br>
This parameter is defined for any table, so you can customize statistics per table.<br/>

If we want to impove the selectivity score we should increase the sample size by increasing the default_statistics_target to 10000, in order to cover the whole dataset.


In [95]:
ALTER table test_ex2 ALTER COLUMN i SET STATISTICS 10000;

In [96]:
ANALYZE test_ex2;

Now, can you recompute the  new selectivity and cardinality :


In [98]:
SELECT tablename, attname, value, freq selectivity, freq * 1000000 cardinality 
FROM (SELECT tablename, attname, mcv.value, mcv.freq FROM pg_stats,
LATERAL ROWS FROM (unnest(most_common_vals::text::int[]), unnest(most_common_freqs)) AS mcv(value, freq)
WHERE tablename = 'test_ex2'
AND attname = 'i') get_mcv order by freq DESC limit 5 ;

5 row(s) returned.


tablename,attname,value,selectivity,cardinality
test_ex2,i,2,0.001,1000
test_ex2,i,3,0.001,1000
test_ex2,i,4,0.001,1000
test_ex2,i,5,0.001,1000
test_ex2,i,1,0.001,1000


In [99]:
explain (ANALYZE, BUFFERS, TIMING ON, VERBOSE) select * from  test_ex2 where i = x ; -- CHANGE X

6 row(s) returned.


QUERY PLAN
Index Scan using test_ex2_i_idx on ds2.test_ex2 (cost=0.42..41.92 rows=1000 width=37) (actual time=0.010..0.390 rows=1000 loops=1)
"Output: i, t"
Index Cond: (test_ex2.i = 307)
Buffers: shared hit=15
Planning Time: 0.114 ms
Execution Time: 0.488 ms


Can you compute the new error ? What do you think ?

Perfect, now our estimation is exactly the number of actual rows.<br/>
The estimate error is now to 0 but at what cost!
Indeed, the larger the sample is, the more the ANALYZE process will take time but in return we improve the planner's estimates quality.<br/>
Finally, the right question is what is the default_statistics_target value for representing data ? 



From statistics, the optimizer is able to estimate how many rows will be returned with a predicat "=" and consequently may choose an adapted access path.
<br/>In our use case, the selectivity is strong (close to 0) so the best solution to scan and collect the result is to browse the index.<br/>

## <span style="color:blue">EX - 3</span>

In this exercice, we will see how histogram is used.<br/>
Histogram is used to determine the selectivity when the number of distinct value is greater then 250.</br>
Indeed, listing all frequent value would be too large to be kept into most_common_freqs so histogram is a good alternative to describe data distribution.<br/>
Firstly, you will create a new test table with an index.

In [196]:
DROP TABLE IF EXISTS testhistogram;
CREATE TABLE testhistogram (i integer not null, t text);
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
INSERT INTO testhistogram SELECT i, md5(i::text) FROM generate_series(1, 1000) i;
CREATE INDEX ON testhistogram (i);

The number of buckets is set up from the default_statistics_target variable.<br/>
To customize the number of bucket, run the following command :

In [197]:
ALTER table testhistogram ALTER COLUMN i SET STATISTICS 10;

We inserted 10 000 rows with 1 000 distinct values into the test table, so the bucket width should be equal to 100.<br/>
Each bucket should contain 100 rows per bucket.

In [207]:
analyze testhistogram;
select histogram_bounds,most_common_vals, most_common_freqs from pg_stats where schemaname ='ds2' and tablename='testhistogram' and attname='i' ;

1 row(s) returned.


histogram_bounds,most_common_vals,most_common_freqs
"{11,109,208,307,406,505,604,703,802,901,1000}","{1,2,3,4,5,6,7,8,9,10}","[0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001]"


In my demonstration I got the following result :<br/>
{11,109,208,307,406,505,604,703,802,901,1000}.<br/>
So, the first bucket contains 100 values between 11 and 109, 100  values between 109 and 208 and so on.<br/>

Could you identify your bucket structure ?


From histogram, we want to determine the selectivity of a query where the predicat is "i < 300".<br/>
To do that, Postgres details how to compute them on the official documentation : <br/>
https://www.postgresql.org/docs/11/row-estimation-examples.html<br/>

In my case, the compute is the following :<br/>
selectivity = (1 + (300 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets<br/>
selectivity = (1 + (300 - 208)/(307 - 208))::float/10<br/>
selectivity = 0.1<br/>



In [215]:
select (1 + (300 - 208)/(307 - 208))::float/10 Selectivity, ((1 + (300 - 208)/(307 - 208))::float/10)* 10000 Cardinality;

1 row(s) returned.


selectivity,cardinality
0.1,1000


From my sample, I get a cardinality of 1000 with my predicat < 300.<br/>
Now, let's check if we get the same cardinality as the optimizer :

In [216]:
EXPLAIN (analyze)SELECT * FROM testhistogram WHERE  i < 100;

7 row(s) returned.


QUERY PLAN
Bitmap Heap Scan on testhistogram (cost=20.02..116.49 rows=998 width=37) (actual time=0.044..0.270 rows=990 loops=1)
Recheck Cond: (i < 100)
Heap Blocks: exact=16
-> Bitmap Index Scan on testhistogram_i_idx (cost=0.00..19.77 rows=998 width=0) (actual time=0.036..0.036 rows=990 loops=1)
Index Cond: (i < 100)
Planning Time: 0.076 ms
Execution Time: 0.396 ms


The optimizer returns a cardinality of 998 which is close to my selectivity estimate (1000) but is not identical.<br/>
The optimizer should have taken some additional informations to adjust its cardinality but it is not possible to know which one.<br/>
Indeed, its cardinality (998) is closer to the actual value (990).

------------------------------------------------------------------------------------------------------------------------------

## <span style="color:blue">EX - 4</span>

In this exercice, we will see how to influence the optimizer.

Create a new test table as indicated below :


In [223]:
DROP TABLE IF EXISTS test_ex2;
CREATE TABLE test_ex2 (i integer not null, t text);
INSERT INTO test_ex2 SELECT i , md5(i::text) FROM generate_series(1, 1000000) i;
CREATE INDEX ON test_ex2 (i);
ANALYZE test_ex2;

In [224]:
SET enable_bitmapscan=on;
SET enable_indexscan=on; 
SET enable_seqscan=on;

Now run a query with a predicat "i = 100"

In [226]:
EXPLAIN (analyze)SELECT * FROM test_ex2 WHERE  i = 100;

4 row(s) returned.


QUERY PLAN
Index Scan using test_ex2_i_idx on test_ex2 (cost=0.42..8.44 rows=1 width=37) (actual time=0.021..0.022 rows=1 loops=1)
Index Cond: (i = 100)
Planning Time: 0.224 ms
Execution Time: 0.042 ms


What is the cost of the query ? <br/>
Do you agree with the optimizer's choice ? <br/>

We are going to force the optimizer to avoid using the index scan method :


In [227]:
SET enable_indexscan=off; 

In [None]:
Rerun the query :

In [228]:
EXPLAIN (analyze)SELECT * FROM test_ex2 WHERE  i = 100;

7 row(s) returned.


QUERY PLAN
Bitmap Heap Scan on test_ex2 (cost=4.43..8.45 rows=1 width=37) (actual time=0.018..0.018 rows=1 loops=1)
Recheck Cond: (i = 100)
Heap Blocks: exact=1
-> Bitmap Index Scan on test_ex2_i_idx (cost=0.00..4.43 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=1)
Index Cond: (i = 100)
Planning Time: 0.072 ms
Execution Time: 0.040 ms


what do you notice ?<br/>
Can you compare costs between the index scan and bitmap index scan ?

Now, force the optimizer to use another way to scan data :

In [229]:
SET enable_bitmapscan=off;

Rerun the query :

In [231]:
EXPLAIN (analyze)SELECT * FROM test WHERE  i < 5500;

5 row(s) returned.


QUERY PLAN
Seq Scan on test (cost=0.00..20834.00 rows=701964 width=37) (actual time=0.031..112.405 rows=700000 loops=1)
Filter: (i < 5500)
Rows Removed by Filter: 300000
Planning Time: 0.086 ms
Execution Time: 158.067 ms


Compare all costs and verify if the first choice of the optimizer is justified, is it not?

Before moving to the next exercice, re-enable all scan methods :

In [232]:
SET enable_bitmapscan=on;
SET enable_indexscan=on; 
SET enable_seqscan=on;

## <span style="color:blue">EX - 5</span>

In this exercice, we will cover multi-column indexes.

In [233]:
DROP TABLE IF EXISTS testmultiindex;
CREATE TABLE testmultiindex (i integer not null, j integer not null, t text);
INSERT INTO testmultiindex SELECT i,j ,md5(i::text) FROM generate_series(1, 1000) i,generate_series(1, 1000) j;
CREATE INDEX ON testmultiindex (i, j);

Your application frequently runs the following query :<br/>
SELECT * from testmultiindex where j = x;  -- where x is any integer<br/>
You notice the query becomes slow, can you provide the execution plan : <br/>


In [234]:
explain analyze SELECT * from testmultiindex where j = 34;

5 row(s) returned.


QUERY PLAN
Seq Scan on testmultiindex (cost=0.00..21846.00 rows=5000 width=40) (actual time=0.012..94.421 rows=1000 loops=1)
Filter: (j = 34)
Rows Removed by Filter: 999000
Planning Time: 0.123 ms
Execution Time: 94.542 ms


What do you advice to improve this query ? and justify your solution ?

The cost is very high and we browse data with a sequential scan so lot of reads. To minimize the cost I suggest to drop the index on testmultiindex (i, j) and I will create a new one only on j. Don't forget, an index must be maintained, if you don't you will impact writing performances: it is recommended to drop any useless index.

In [235]:
SELECT * from pg_indexes where  tablename='testmultiindex';

1 row(s) returned.


schemaname,tablename,indexname,tablespace,indexdef
ds2,testmultiindex,testmultiindex_i_j_idx,,"CREATE INDEX testmultiindex_i_j_idx ON ds2.testmultiindex USING btree (i, j)"


In [236]:
drop index testmultiindex_i_j_idx;
create index test_j on testmultiindex(j);

Now your application needs to run 2 new queries :<br/>
SELECT * FROM testmultiindex where i = x AND j = y; -- where x and y are any integer <br/>
SELECT * FROM testmultiindex where i = x OR j = y;<br/>
Do you think the execution plan is still optimal ?<br/>


In [237]:
explain analyze SELECT * FROM testmultiindex where i = 4 AND j = 234;

9 row(s) returned.


QUERY PLAN
Bitmap Heap Scan on testmultiindex (cost=19.89..2913.33 rows=1 width=41) (actual time=0.528..2.237 rows=1 loops=1)
Recheck Cond: (j = 234)
Filter: (i = 4)
Rows Removed by Filter: 999
Heap Blocks: exact=1000
-> Bitmap Index Scan on test_j (cost=0.00..19.89 rows=995 width=0) (actual time=0.342..0.342 rows=1000 loops=1)
Index Cond: (j = 234)
Planning Time: 0.374 ms
Execution Time: 2.271 ms


In [238]:
explain analyze SELECT * FROM testmultiindex where i = 4 OR j = 234;

5 row(s) returned.


QUERY PLAN
Seq Scan on testmultiindex (cost=0.00..24346.00 rows=1989 width=41) (actual time=0.042..82.243 rows=1999 loops=1)
Filter: ((i = 4) OR (j = 234))
Rows Removed by Filter: 998001
Planning Time: 0.053 ms
Execution Time: 82.443 ms


For both queries, the execution plan shows a high cost for scanning the i column which is without index.

What do you suggest ?

I would suggest to create an index on i in addition to j. It would not be interesting to create a multi-column index (i,j) because for the predicat  i = x OR j = y, it would not be relevant.

In [None]:
CREATE INDEX ON test_idx_i (i);
CREATE INDEX ON test_idx_j (j);