# Query Performance Tuning

As part of this section we will go through basic performance tuning techniques with respect to queries.

* Preparing Database
* Interpreting Explain Plans
* Overview of Cost Based Optimizer
* Performance Tuning using Indexes
* Criteria for indexes
* Criteria for Partitioning
* Writing Queries – Partition Pruning
* Overview of Query Hints

## Preparing Database

Let us prepare retail tables to come up with the solution for the problem statement.
* Ensure that we have required database and user for retail data. We might provide the database as part of our labs.

```shell
psql -U postgres -h localhost -p 5432 -W
```

```sql
CREATE DATABASE itversity_retail_db;
CREATE USER itversity_retail_user WITH ENCRYPTED PASSWORD 'retail_password';
GRANT ALL ON DATABASE itversity_retail_db TO itversity_retail_user;
```

* Create Tables using the script provided. You can either use `psql` or **SQL Alchemy**.

```shell
psql -U itversity_retail_user \
  -h localhost \
  -p 5432 \
  -d itversity_retail_db \
  -W

\i /data/retail_db/create_db_tables_pg.sql
```

* Data shall be loaded using the script provided.

```shell
\i /data/retail_db/load_db_tables_pg.sql
```

* Run queries to validate we have data in all the 6 tables.

In [None]:
%load_ext sql

In [None]:
%env DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db

In [None]:
%sql SELECT * FROM departments LIMIT 10

In [None]:
%sql SELECT * FROM categories LIMIT 10

In [None]:
%sql SELECT * FROM products LIMIT 10

In [None]:
%sql SELECT * FROM orders LIMIT 10

In [None]:
%sql SELECT * FROM order_items LIMIT 10

In [None]:
%sql SELECT * FROM customers LIMIT 10

## Interpreting Explain Plans

Let us review the below explain plans and understand key terms which will help us in interpreting them.
* Seq Scan
* Index Scan
* Nested Loop

Here are the explain plans for different queries.
* Explain plan for query to get number of orders.

```sql
EXPLAIN
SELECT count(1) FROM orders;
```

```text
                            QUERY PLAN
-------------------------------------------------------------------
 Aggregate  (cost=1386.04..1386.05 rows=1 width=8)
   ->  Seq Scan on orders  (cost=0.00..1213.83 rows=68883 width=0)
(2 rows)
```

* Explain plan for query to get number of orders by date.

```sql
EXPLAIN
SELECT order_date, count(1) AS order_count
FROM orders
GROUP BY order_date;
```

```text
                            QUERY PLAN
-------------------------------------------------------------------
 HashAggregate  (cost=1558.24..1561.88 rows=364 width=16)
   Group Key: order_date
   ->  Seq Scan on orders  (cost=0.00..1213.83 rows=68883 width=8)
(3 rows)
```

* Explain plan for query to get order details for a given order id.

```sql
EXPLAIN
SELECT * FROM orders
WHERE order_id = 2;
```

```text
                                QUERY PLAN
---------------------------------------------------------------------------
 Index Scan using orders_pkey on orders  (cost=0.29..8.31 rows=1 width=26)
   Index Cond: (order_id = 2)
(2 rows)
```

* Explain plan for query to get order and order item details for a given order id.

```sql
EXPLAIN
SELECT o.*,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id = 2;
```

```text
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Nested Loop  (cost=0.29..3427.82 rows=4 width=34)
   ->  Index Scan using orders_pkey on orders o  (cost=0.29..8.31 rows=1 width=26)
         Index Cond: (order_id = 2)
   ->  Seq Scan on order_items oi  (cost=0.00..3419.47 rows=4 width=12)
         Filter: (order_item_order_id = 2)
(5 rows)
```

```{note}
We should understand the order in which the query plans should be interpreted.
```

* Explain plan for a query with multiple joins

```sql
EXPLAIN
SELECT 
    o.order_date,
    d.department_id,
    d.department_name,
    c.category_name,
    p.product_name,
    round(sum(oi.order_item_subtotal)::numeric, 2) AS revenue
FROM orders o
    JOIN order_items oi
        ON o.order_id = oi.order_item_order_id
    JOIN products p
        ON p.product_id = oi.order_item_product_id
    JOIN categories c
        ON c.category_id = p.product_category_id
    JOIN departments d
        ON d.department_id = c.category_department_id
GROUP BY
    o.order_date,
    d.department_id,
    d.department_name,
    c.category_id,
    c.category_name,
    p.product_id,
    p.product_name
ORDER BY o.order_date,
    revenue DESC;
```

```text
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=76368.54..76799.03 rows=172198 width=211)
   Sort Key: o.order_date, (round((sum(oi.order_item_subtotal))::numeric, 2)) DESC
   ->  Finalize GroupAggregate  (cost=25958.31..43735.23 rows=172198 width=211)
         Group Key: o.order_date, d.department_id, c.category_id, p.product_id
         ->  Gather Merge  (cost=25958.31..39886.09 rows=101293 width=187)
               Workers Planned: 1
               ->  Partial GroupAggregate  (cost=24958.30..27490.62 rows=101293 width=187)
                     Group Key: o.order_date, d.department_id, c.category_id, p.product_id
                     ->  Sort  (cost=24958.30..25211.53 rows=101293 width=187)
                           Sort Key: o.order_date, d.department_id, c.category_id, p.product_id
                           ->  Hash Join  (cost=2495.48..7188.21 rows=101293 width=187)
                                 Hash Cond: (c.category_department_id = d.department_id)
                                 ->  Hash Join  (cost=2472.43..6897.32 rows=101293 width=79)
                                       Hash Cond: (p.product_category_id = c.category_id)
                                       ->  Hash Join  (cost=2470.13..6609.69 rows=101293 width=63)
                                             Hash Cond: (oi.order_item_product_id = p.product_id)
                                             ->  Hash Join  (cost=2411.87..6284.70 rows=101293 width=20)
                                                   Hash Cond: (oi.order_item_order_id = o.order_id)
                                                   ->  Parallel Seq Scan on order_items oi  (cost=0.00..2279.93 rows=101293 width=16)
                                                   ->  Hash  (cost=1213.83..1213.83 rows=68883 width=12)
                                                         ->  Seq Scan on orders o  (cost=0.00..1213.83 rows=68883 width=12)
                                             ->  Hash  (cost=41.45..41.45 rows=1345 width=47)
                                                   ->  Seq Scan on products p  (cost=0.00..41.45 rows=1345 width=47)
                                       ->  Hash  (cost=1.58..1.58 rows=58 width=20)
                                             ->  Seq Scan on categories c  (cost=0.00..1.58 rows=58 width=20)
                                 ->  Hash  (cost=15.80..15.80 rows=580 width=112)
                                       ->  Seq Scan on departments d  (cost=0.00..15.80 rows=580 width=112)
(27 rows)
```

## Overview of Cost Based Optimizer

Let us get an overview of cost based optimizer.
* Databases use cost based optimizer to generate explain plans. In the earlier days, they used to use rule based optimizer.
* For cost based optimizer to generate optimal explain plan, we need to ensure statistics of our data in tables are collected at regular times.
* We can analyze tables to collect statistics. Typically DBAs schedule to collect statistics at regular intervals.
* Here are some of the statistics typically collected.
  * Approximate number of records at table level.
  * Approximate number of unique records at index level.
* When explain plans are generated, these statistics will be used by cost based optimizer to provide us with the most optimal plan for our query.

## Performance Tuning using Indexes

Let us understand how we can improve the performance of the query by creating index on order_items.order_item_order_id.

* We have order level details in orders and item level details in order_items.
* When customer want to review their orders, they need details about order_items. In almost all the scenarios in order management system, we prefer to get both order as well as order_items details by passing order_id of pending or outstanding orders.
* Let us review the explain plan for the query with out index on order_items.order_item_order_id.

```sql
EXPLAIN
SELECT o.*,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id = 2;
```

```{text}
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Nested Loop  (cost=0.29..3427.82 rows=3 width=34)
   ->  Index Scan using orders_pkey on orders o  (cost=0.29..8.31 rows=1 width=26)
         Index Cond: (order_id = 2)
   ->  Seq Scan on order_items oi  (cost=0.00..3419.47 rows=3 width=12)
         Filter: (order_item_order_id = 2)
(5 rows)
```

* Develop piece of code to randomly pass 2000 order ids and calculate time.

In [None]:
!pip install psycopg2

In [None]:
import psycopg2

In [None]:
%%time
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT count(1) 
FROM orders o JOIN order_items oi 
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    cursor.execute(query, (1,))
    ctr += 1
cursor.close()
connection.close()

* Create index on order_items.order_item_order_id

In [None]:
%load_ext sql

In [None]:
%env DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db

In [None]:
%%sql

CREATE INDEX order_items_order_id_idx 
ON order_items(order_item_order_id);

* Run explain plan after creating index on order_items.order_item_order_id

```sql
EXPLAIN
SELECT o.*,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id = 2;
```

```text
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.71..16.81 rows=3 width=34)
   ->  Index Scan using orders_pkey on orders o  (cost=0.29..8.31 rows=1 width=26)
         Index Cond: (order_id = 2)
   ->  Index Scan using order_items_order_id_idx on order_items oi  (cost=0.42..8.47 rows=3 width=12)
         Index Cond: (order_item_order_id = 2)
(5 rows)
```

* Run the code again to see how much time, it get the results for 2000 random orders.

In [None]:
import psycopg2

In [None]:
%%time

from random import randrange
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT count(1) 
FROM orders o JOIN order_items oi 
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    order_id = randrange(1, 68883)
    cursor.execute(query, (order_id,))
    ctr += 1
cursor.close()
connection.close()

```{warning}
Keep in mind that having indexes on tables can have negative impact on write operations.
```

## Criteria for Indexing

Let us go through some of the criteria to create indexes on tables.
* Indexes are required to enforce constraints such as Primary Key, Unique etc. Indexes will be automatically created, when we define a column(s) Primary Key or Unique.
* Too many indexes on a given table, can slow down the performance of inserts, updates and deletes on that table. Hence, you need to make sure to strike right balance by creating indexes only when they are required.
* Thorough analysis need to be done about how the queries will hit the table from the application.
* We might have to create indexes on foreign key columns of the child table.
* When we have tables with multiple parents, we need to be due diligent about how the index should be created.
  * Shall we create 2 indexes?
  * Shall we create 1 index with both the columns pointing to 2 tables?
  * If we want to create 1 index with both the columns what should be the order?
* Here are some of the scenarios from the application perspective based upon which we can consider creating indexes.
  * Customer checking all his orders.
    * We need to get the data from orders using customer id and hence we need to add index on **orders.order_customer_id**.
  * Customer checking order details for a given order which include order_item_subtotal as well as product names.
    * We need to join **orders**, **order_items** as well as **products**.
    * **order_items** is child table for both **orders** and **products**.
    * We can create composite index on **order_items.order_item_order_id** and **order_items.order_item_product_id**.
  * Customer care executive to check **all the order details placed by customer using at least first 3 characters of customer's first name**.
    * We can consider creating index on **customers.customer_fname** using upper or lower. You can also consider adding **customer_id** to the index along with customer_fname.
    * Also to get all the order details for a given customer, we have to ensure that there is an index on **orders.order_customer_id**.

In [None]:
%load_ext sql

In [None]:
%env DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db

In [None]:
%%sql

DROP INDEX order_items_order_id_idx

In [None]:
%%sql

SELECT min(customer_id), max(customer_id), count(1)
FROM customers

In [None]:
import psycopg2

In [None]:
%%time

from random import randrange
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT count(1) 
FROM orders o
WHERE order_customer_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    customer_id = randrange(10950, 12435)
    cursor.execute(query, (customer_id,))
    ctr += 1
cursor.close()
connection.close()

In [None]:
%%sql

CREATE INDEX orders_customer_id_idx
ON orders(order_customer_id)

In [None]:
%%time

from random import randrange
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT count(1) 
FROM orders o
WHERE order_customer_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    customer_id = randrange(10950, 12435)
    cursor.execute(query, (customer_id,))
    ctr += 1
cursor.close()
connection.close()

In [None]:
%%time

from random import randrange
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT count(1) 
FROM orders o
    JOIN order_items oi
        ON o.order_id = oi.order_item_order_id
    JOIN products p
        ON p.product_id = oi.order_item_product_id
WHERE order_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    order_id = randrange(1, 68883)
    cursor.execute(query, (order_id,))
    ctr += 1
cursor.close()
connection.close()

In [None]:
%%sql

CREATE INDEX order_items_oid_pid_idx 
ON order_items(order_item_order_id, order_item_product_id);

In [None]:
%%time

from random import randrange
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT count(1) 
FROM orders o
    JOIN order_items oi
        ON o.order_id = oi.order_item_order_id
    JOIN products p
        ON p.product_id = oi.order_item_product_id
WHERE order_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    order_id = randrange(1, 68883)
    cursor.execute(query, (order_id,))
    ctr += 1
cursor.close()
connection.close()

```{note}
As our products table only have handful of records there will not be significant difference in performance between the 2 approaches.
* Index on order_items.order_item_order_id
* Index on order_items.order_item_order_id, order_items.order_item_product_id

Howeever if you create index using product id as driving field then the performance will not be as good as above 2 approaches.
```

In [None]:
%%sql

DROP INDEX order_items_oid_pid_idx

In [None]:
%%sql

CREATE INDEX order_items_pid_oid_idx 
ON order_items(order_item_product_id, order_item_order_id);

In [None]:
%%time

from random import randrange
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT count(1) 
FROM orders o
    JOIN order_items oi
        ON o.order_id = oi.order_item_order_id
    JOIN products p
        ON p.product_id = oi.order_item_product_id
WHERE order_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    order_id = randrange(1, 68883)
    cursor.execute(query, (order_id,))
    ctr += 1
cursor.close()
connection.close()

```{note}
Here are the indexes to tune the performance of comparing with at least first 3 characters of customer first name.
```

In [None]:
%%sql

DROP INDEX IF EXISTS orders_customer_id_idx

In [None]:
%%sql

DROP INDEX IF EXISTS customers_customer_fname_idx

* Explain plan for query with out indexes.

```sql
EXPLAIN
SELECT * 
FROM orders o JOIN customers c
    ON o.order_customer_id = c.customer_id
WHERE upper(c.customer_fname) = upper('mar');
```

```text
                               QUERY PLAN
-------------------------------------------------------------------------
 Hash Join  (cost=42.38..1437.09 rows=40 width=99)
   Hash Cond: (o.order_customer_id = c.customer_id)
   ->  Seq Scan on orders o  (cost=0.00..1213.83 rows=68883 width=26)
   ->  Hash  (cost=42.29..42.29 rows=7 width=73)
         ->  Seq Scan on customers c  (cost=0.00..42.29 rows=7 width=73)
               Filter: (upper((customer_fname)::text) = 'MAR'::text)
(6 rows)
```

In [None]:
%%sql

CREATE INDEX customers_customer_fname_idx
ON customers(upper(customer_fname))

In [None]:
%%sql

CREATE INDEX orders_customer_id_idx
ON orders(order_customer_id)

* Explain plan for query with indexes. Check the cost, it is significantly low when compared to the plan generated for the same query with out indexes.

```sql
EXPLAIN
SELECT * 
FROM orders o JOIN customers c
    ON o.order_customer_id = c.customer_id
WHERE upper(c.customer_fname) = upper('mar');
```

```text
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Nested Loop  (cost=8.67..204.43 rows=40 width=99)
   ->  Bitmap Heap Scan on customers c  (cost=4.33..18.58 rows=7 width=73)
         Recheck Cond: (upper((customer_fname)::text) = 'MAR'::text)
         ->  Bitmap Index Scan on customers_customer_fname_idx  (cost=0.00..4.33 rows=7 width=0)
               Index Cond: (upper((customer_fname)::text) = 'MAR'::text)
   ->  Bitmap Heap Scan on orders o  (cost=4.34..26.49 rows=6 width=26)
         Recheck Cond: (order_customer_id = c.customer_id)
         ->  Bitmap Index Scan on orders_customer_id_idx  (cost=0.00..4.34 rows=6 width=0)
               Index Cond: (order_customer_id = c.customer_id)
(9 rows)
```

## Criteria for Partitioning

Let us understand how we can leverage partitioning to fine tune the performance.
* Partitioning is another key strategy to boost the performance of the queries.
* It is extensively used as key performance tuning strategy as part of tables created to support reporting requirements.
* Even in transactional systems, we can leverage partitioning as one of the performance tuning technique while dealing with large tables.
* For application log tables, we might want to discard all the irrelevant data after specific time period. If partitioning is used, we can detach and/or drop the paritions quickly.
* Over a period of time most of the orders will be in **CLOSED** status. We can partition table using list parititioning to ensure that all the **CLOSED** orders are moved to another partition. It can improve the performance for the activity related to active orders.
* In case of reporting databases, we might partition the transaction tables at daily level so that we can easily filter and process data to pre-aggregate and store in the reporting data marts.
* Most of the tables in ODS or Data Lake will be timestamped and partitioned at daily or monthly level so that we can remove or archive old partitions easily

## Writing Queries – Partition Pruning

Let us understand how to write queries by leveraging partitioing.
* Make sure to include a condition on partitioned column.
* Equal condition will yield better results.
* Queries with condition on partition key will result in partition pruning. The data from the other partitions will be fully ignored.
* As partition pruning will result in lesser I/O, the overall performance of such queries will improve drastically.

## Overview of Query Hints

Let us get an overview of query hints.
* We can specify hint using /*+ HINT */ as part of the query.
* Make sure there are no typos in the hint.
* If there are typos or there no indexes specified as part of hint, they will be ignored.
* In case of complex queries, CBO might use incorrect index or inappropriate join.
* As an expert if we are sure that, the query should be using a particular index or right join, then we can force the optimizer to choose such index or join type leveraging hint.

## Exercise - Tuning Queries

As part of this exercise, you need to prepare data set, go through the explain plan and come up with right indexes to tune the performance.

* As of now customer email id in customers table contain same value (**XXXXXXXXX**).
* Let us update customer_email_id.
  * Use initial (first character) of customer_fname
  * Use full string of customer_lname
  * Use row_number by grouping or partitioning the data by first character of customer_fname and full customer_lname then sort it by customer_id.
  * Make sure row_number is at least 3 digits, if not pad with 0 and concatenate to email id. Here are the examples
  * Also make sure email ids are in upper case.
|customer_id|customer_fname|customer_lname|rank|customer_email|
|-----------|--------------|--------------|----|--------------|
|11591|Ann|Alexander|1|AALEXANDER001@SOME.COM|
|12031|Ashley|Benitez|1|ABENITEZ001@SOME.COM|
|11298|Anthony|Best|1|ABEST001@SOME.COM|
|11304|Alexander|Campbell|1|ACAMPBELL001@SOME.COM|
|11956|Alan|Campos|1|ACAMPOS001@SOME.COM|
|12075|Aaron|Carr|1|ACARR001@SOME.COM|
|12416|Aaron|Cline|1|ACLINE001@SOME.COM|
|10967|Alexander|Cunningham|1|ACUNNINGHAM001@SOME.COM|
|12216|Ann|Deleon|1|ADELEON001@SOME.COM|
|11192|Andrew|Dickson|1|ADICKSON001@SOME.COM|
* Let us assume that customer care will try to search for customer details using at least first 4 characters.
* Generate explain plan for the query.
* Create unique index on customer_email.
* Generate explain plan again and review the differences.

In [1]:
%env DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db

env: DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db


In [2]:
%load_ext sql

In [14]:
%%sql

SELECT q.*,
    upper(concat(substring(customer_fname, 1, 1), customer_lname, lpad(rnk::varchar, 3, '0'), '@SOME.COM')) AS customer_email
FROM (  
    SELECT customer_id,
        customer_fname,
        customer_lname,
        rank() OVER (
            PARTITION BY substring(customer_fname, 1, 1), customer_lname
            ORDER BY customer_id
        ) AS rnk
    FROM customers
) q
ORDER BY customer_email
LIMIT 10

 * postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
10 rows affected.


customer_id,customer_fname,customer_lname,rnk,customer_email
11591,Ann,Alexander,1,AALEXANDER001@SOME.COM
12031,Ashley,Benitez,1,ABENITEZ001@SOME.COM
11298,Anthony,Best,1,ABEST001@SOME.COM
11304,Alexander,Campbell,1,ACAMPBELL001@SOME.COM
11956,Alan,Campos,1,ACAMPOS001@SOME.COM
12075,Aaron,Carr,1,ACARR001@SOME.COM
12416,Aaron,Cline,1,ACLINE001@SOME.COM
10967,Alexander,Cunningham,1,ACUNNINGHAM001@SOME.COM
12216,Ann,Deleon,1,ADELEON001@SOME.COM
11192,Andrew,Dickson,1,ADICKSON001@SOME.COM
